Retrieve the stylized performance dashboard generated by the enhancement
So far, we know that it's easy to build a proof-of-concept for a Retrieval Enhanced Generation (RAG) application, but it's very difficult to get it into production. It is especially difficult to achieve satisfactory performance in the RAG pipeline, where there are different components:
- Retriever component: Retrieves additional context from an external database for the LLM to answer the query.
- Generator Component: Generate answers based on hints enhanced with retrieved information.
When evaluating a RAG pipeline, you must evaluate both components separately and simultaneously to understand if and where the RAG pipeline still needs improvement. In addition, to know if the performance of your RAG application is improving, you must quantitatively evaluate it. To do this, you need two elements: an evaluation metric and an evaluation dataset.
Currently, identifying the right evaluation metrics and collecting good validation data is an active area of research. As this is a fast-moving subject, we are currently witnessing the emergence of various RAG assessment framework methodologies, such as the RAG triplet indicator, ROUGE, ARES, BLEU, and RAGA [1]. This article will focus on how RAGA pipelines can be assessed using RAGA [1].
What is RAGA
RAGAs (Retrieval Enhanced Build Assessments) is a framework (GitHub, Docs) that provides you with the necessary elements to help you evaluate your RAG pipeline at the component level.
Evaluate the data
What's interesting about RAGA is that it was originally a "no-reference" evaluation framework [1]. This means that RAGA doesn't have to rely on the real labels that are manually annotated in the evaluation dataset, but instead leverages the underlying LLMs to do so.
In order to evaluate the RAG pipeline, RAGA requires the following information:
- question: The user query that is entered as a RAG pipeline. Input.
- answer: The answer generated from the RAG pipeline. Output.
- contexts: Contextual questions retrieved from external knowledge sources to answer questions.
- ground_truths: The correct answer to question. This is the only information that is manually annotated. This information is only necessary for context_recall metrics (see Evaluating Metrics).
Reference-free evaluation using LLMs is an active research topic. While using as little human annotated data as possible makes it a cheaper and faster method of assessment, there is still some discussion about its drawbacks, such as bias [3]. However, some papers have shown promising results [4]. For more information, please refer to the "Related Work" section of the RAGA [1] paper.
Note that the framework has been extended to provide metrics and examples that require basic fact labels (e.g., context_recall and answer_correctness, see Evaluating Metrics).
In addition, the framework provides you with tools to automate test data generation.
Evaluate metrics
RAGA provides you with metrics to evaluate your RAG pipeline on a component-by-component basis as well as end-to-end.
At the component level, RAGA provides you with metrics to evaluate the retrieval components (context_relevancy and context_recall) and the generation components (faithfulness and answer_relevancy) separately [2]:
- Contextual precision measures the signal-to-noise ratio of the retrieved context. The metric uses question and contexts.
- Contextual recall is a measure of whether all the relevant information needed to answer a question was retrieved. This metric is based on ground_truth (which is the only metric in the framework that relies on human-annotated ground truth labels) and contexts.
- Loyalty measures the factual accuracy of the answers generated. The number of correct utterances in a given context divided by the total number of utterances in the resulting answer. This metric uses question and contexts. answer
- Answer relevance question measures how relevant the resulting answer is to the question. This metric is used and calculated to derive an answer. For example, the question "Where is France and where is its capital?" The answer "France is located in Western Europe" will get a lower answer correlation because it answers only half of the question.
All metrics are scaled to the [0, 1] range, with higher values indicating better performance.
RAGA also provides you with end-to-end metrics to evaluate your RAG pipeline, such as answer semantic similarity and answer correctness. This article focuses on component-level metrics.
RAG applications are evaluated using RAGA
This section uses RAGA to evaluate the smallest vanilla RAG pipeline, showing you how to use RAGA and giving you a visual understanding of its evaluation metrics.
prerequisite
Make sure you have installed the required Python packages:
- langchain,openai以及weaviate-clientRAG 管道
- RAGAS is used to evaluate RAG pipelines
#!pip install langchain openai weaviate-client ragas
Also, define the relevant environment variables in the .env file in the root directory. To get an OpenAI API key, you'll need an OpenAI account and then "Create a new key" under the API key.
OPENAI_API_KEY="<您的OPENAI_API_KEY>"
Set up the RAG app
Before evaluating a RAG application, you need to set it up. We'll be using the original RAG pipeline. We will briefly cover this part as we will be using the same settings described in detail in the next article.
Retrieval Enhanced Generation (RAG): From Theory to LangChain Implementation
From the theory of the original academic paper to the Python implementation using OpenAI, Weaviate, and LangChain
towardsdatascience.com
First, you have to prepare the data by loading and chunking the document.
import requests
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
url = "https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt"
res = requests.get(url)
with open("state_of_the_union.txt", "w") as f:
f.write(res.text)
# Load the data
loader = TextLoader('./state_of_the_union.txt')
documents = loader.load()
# Chunk the data
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)
Next, use the OpenAI embedding model to generate vector embeddings for each block and store them in a vector database.
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
import weaviate
from weaviate.embedded import EmbeddedOptions
from dotenv import load_dotenv,find_dotenv
# Load OpenAI API key from .env file
load_dotenv(find_dotenv())
# Setup vector database
client = weaviate.Client(
embedded_options = EmbeddedOptions()
)
# Populate vector database
vectorstore = Weaviate.from_documents(
client = client,
documents = chunks,
embedding = OpenAIEmbeddings(),
by_text = False
)
# Define vectorstore as retriever to enable semantic search
retriever = vectorstore.as_retriever()
Finally, set up the prompt template and OpenAI LLM and combine them with the retriever component into the RAG pipeline.
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
# Define LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# Define prompt template
template = """You are an assistant for question-answering tasks.
Use the following pieces of retrieved context to answer the question.
If you don't know the answer, just say that you don't know.
Use two sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)
# Setup RAG pipeline
rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
Prepare assessment data
Since RAGA is intended to be a reference-free evaluation framework, there is very little preparation required to evaluate the dataset. You need to prepare the question and ground_truths pairs, and then you can prepare the rest of the information by reasoning as follows:
from datasets import Dataset
questions = ["What did the president say about Justice Breyer?",
"What did the president say about Intel's CEO?",
"What did the president say about gun violence?",
]
ground_truths = [["The president said that Justice Breyer has dedicated his life to serve the country and thanked him for his service."],
["The president said that Pat Gelsinger is ready to increase Intel's investment to $100 billion."],
["The president asked Congress to pass proven measures to reduce gun violence."]]
answers = []
contexts = []
# Inference
for query in questions:
answers.append(rag_chain.invoke(query))
contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])
# To dict
data = {
"question": questions,
"answer": answers,
"contexts": contexts,
"ground_truths": ground_truths
}
# Convert dict to dataset
dataset = Dataset.from_dict(data)
If you are not interested in the context_recall metric, you do not need to provide that ground_truths information. In this case, you just need to prepare the questions.
Evaluate RAG applications
First, import all the metrics you want to use from ragas.metrics. You can then use the evaluate() function and simply pass in the relevant metrics and the prepared data set.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
)
result = evaluate(
dataset = dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy,
],
)
df = result.to_pandas()
Below, you can see the results of an example RAGA score:
RAGA scores contextual precision, contextual recall, faithfulness, and answer relevance.
We can make the following observations:
- context_relevancy (Signal-to-Noise Ratio of Retrieval Context): While the LLM judges that all contexts are relevant to the last question, it also judges that most of the retrieval context of the second question is irrelevant. Based on this metric, you can try to reduce the noise with a different number of retrieval contexts.
- context_recall (if all the relevant information needed to answer the question is retrieved): The LLM evaluates whether the retrieved context contains the relevant information needed to answer the question correctly.
- faithfulness (factual accuracy of the generated answer): While the LLM judged the answer to both the first and last questions to be correct, the answer to the second question incorrectly stated that the president did not mention Intel's CEO, and therefore judged loyalty to be 0.5.
- answer_relevancy (how relevant the generated answers are to the question): All generated answers are judged to be fairly relevant to the question.
As mentioned in the evaluation data, the use of LLMs for non-reference evaluation is an active area of research. I'm curious how this topic will develop.
wraparound
It's easy to build a proof-of-concept RAG application, but it's hard to get its performance into production. As with machine learning projects, you should use validation datasets and evaluation metrics to evaluate the performance of your RAG pipeline.
However, because the RAG process consists of multiple components that must be evaluated individually and in combination, you need a set of evaluation metrics. In addition, generating high-quality validation datasets from human annotators is difficult, time-consuming, and costly.
This article introduces the RAGAs [1] evaluation framework. The framework proposes four assessment indicators – context_relevancy, context_recall, and faithfulness, answer_relevancy they together make up the RAGAs score. In addition, RAGAs utilize the underlying LLM for reference-free evaluation to save costs.
disclaimer
At the time of writing, I'm a developer advocate for Weaviate, an open-source vector database.
reference
[1] Es, S.、James, J.、Espinosa-Anke, L. 和 Schockaert, S. (2023)。 RAGA:检索增强生成的自动评估。 arXiv预印本 arXiv:2309.15217。
[2] RAGAs Docs (2023). Document (Accessed 11 December 2023)
[3] Wang, P., Li, L., Chen, L., Zhu, D., Lin, B., Cao, Y., … & Sui, Z. (2023)。 大型语言模型不是公平的评估者。 arXiv preprint arXiv:2305.17926。
[4] Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023)。 G-eval:使用 GPT-4 进行 NLG 评估,具有更好的人类对齐效果,2023 年 5 月。 arXiv预印本 arXiv: 2303.16634,6 。
作者:Leonie Monigatti
Source: https://towardsdatascience.com/evaluating-rag-applications-with-ragas-81d67b0ee31a