Simple Evaluation of RAG with Ragas

5 min readMar 17, 2024

Like a toddler an LLM is taught to learn and “understand” a language during its training. But, this training is not perfect as the LLMs are trained up to a certain period of time, hence, their knowledge can prove to be outdated in many situations and they tend to hallucinate or generate gibberish if unfamiliar with a query or concept.

Tenor GIF Keyboard

To overcome these challenges, we implement RAG. RAG or Retrieval Augmented Generation is like an open book examination but for large language model. The LLM now has a book (knowledge base) to take reference from and generate answers.

The pipeline consists of mainly following steps:

Document Loading: Collecting all relevant documents from different sources to create knowledge base.
Document Chunking: Splitting large documents into smaller chunks to be able to be passed as context to an LLM.
Embedding Generation: These chunked documents are now transformed into a dense array of numbers or embeddings.
Vector Storage: These embeddings are now stored in vector storage for the ease of retrieval.
Context Retrieval: The embeddings that are relevant to the user query are now passed as context to the LLM.
Answer Generation: The LLM generates answers based on the context supplied.

One might need to perform simple prompt engineering to get the best results.

But, the real question is how does one evaluate these RAG pipelines?

Tenor GIF Keyboard

A RAG pipeline is evaluated based on ground truths. So, the evaluation dataset must contain user queries, ground truths, results generated, and context supplied. Before diving into code, we shall explore all the required metrics.

We can divide this process into two parts:

Retrieval Evaluation: This part evaluates the context that is passed to the LLM. The metrics involved are:

Context Recall: The metric checks if all the relevant answers to the question are present in the context. For the user query: “Who discovered the Galapagos Islands and how?” A context with high recall will answer both the parts of question — Who and How. The answer to both of these questions is present in the ground truth. Hence, this metric utilizes context and ground truth to determine a score between 0 and 1. 1 being the highest recall.
Context Precision: This metric determines whether context that is closest to the ground truth is given high score. The more relevant is the chunk to the ground truth the higher will be the score. Context Precision is determined by ground truth, context, and user query. Higher the score, ranging from 0 to 1, higher the context precision.
Context Entities Recall: This recall determines whether all entities present in the ground truth are also present in the supplied context. For query: “In which countries are snow leopards found?” the ground truth mentions 12 countries. If the context contains all the names of these countries, then it would result in high context entity recall.

2. Generation Evaluation: This part evaluates the answer generated by the LLM.

Faithfulness: The metric outputs a score between 0 and 1, determining the extent to which the generated response relies solely on the provided context. The lower the score, the less trustworthy the generated answer is, the less reliance on the supplied context.
Answer Relevance: It measures how relevant and pertinent is the generated answer to the user query. For query: “What are the threats to penguin populations?” an irrelevant answer might focus on location of penguins while a relevant answer would mention the threats to penguins populations.
Answer Similarity: It calculates how semantically similar are the generated answer and the ground truth. In simple terms, how these two are similar conceptually. For query: “In which countries are snow leopards found?” the generated answer might mention only a few countries with snow leopards, but it would have a high answer similarity because the answer is conceptually similar to the ground truth. Again, 1 being the highest similarity score and 0 being the lowest.
Answer Correctness: This metric determines how factually correct is the generated output. It utilizes the ground truth and generated answer to determine this score. The higher the better, from 0 to 1. It is not to be confused with faithfulness as an answer can be (factually) correct but can not be faithful if it is not generated using the context.
Answer Harmfulness: This metric simply determines if the output is potentially offensive to an individual, group, or a society. The output is binary, 0 or 1.

We can now dive into the code to perform RAG and determine these metrics using Ragas (RAG Assessment) framework. Ragas is a simple framework that helps evaluate RAG pipelines.

We start with importing libraries.

Then, we read the OpenAI’s API key as we’ll be utilizing GPT-3.5. We also create our knowledge base using the data scraped from four URLs.

These documents need to be transformed to string for preprocessing and ultimately, arranged as a list of string rather than list of documents.

We now perform sematic chunking which simply creates chunks of documents based on semantic similarity between sentences. Two sentences will fall in the same chunk if they are semantically similar. To find semantically similar data, we require dense embeddings of sentences, therefore, we use sentence transformers all-MiniLM-L6-v2 model to generate these embeddings.

Further, these chunked documents are stored as in Chroma’s vector store in a folder called chroma_db. We utilize the same sentence transformers model to store these documents as embeddings.

Moving on, we define the prompt template for the LLM. We use GPT’s model 3.5 with 16k context length. We make sure that the result also returns all the contexts/ source documents utilized.

We define our queries and ground truths for evaluation.

Finally, we generate results and store context supplied to the LLM.

The evaluation dataset is prepared as a dictionary and the evaluation metrics are calculated for each query using Ragas. These scores are further stored as a csv file.

We can also see the mean of all metrics that gives us the score for entire dataset.

faithfulness             0.955000
answer_relevancy         0.923192
context_precision        0.733333
context_recall           0.916667
context_entity_recall    0.322197
answer_similarity        0.941792
answer_correctness       0.665889
harmfulness              0.000000
dtype: float64

The RAG pipeline exhibits strong performance in terms of faithfulness, answer relevancy, context recall, and answer similarity. However, there are areas for improvement, particularly in context entity recall, context precision, answer correctness, to enhance the overall quality and accuracy of the responses generated by the model.

You can also find this code, chrom_db folder, and evaluation metrics scores in my github repository: https://github.com/namratanwani/Evaluate-RAG/blob/master/RAG-evaluation.ipynb

References

Introduction | Ragas
Introduction | 🦜️🔗 Langchain

Hope you found this helpful! Please feel free to share your feedback :)

Tenor GIF Keyboard

Simple Evaluation of RAG with Ragas

Written by Namrata Tanwani