diff --git a/docs/community/index.md b/docs/community/index.md
index 7bb4cceb4..e99cea651 100644
--- a/docs/community/index.md
+++ b/docs/community/index.md
@@ -1,4 +1,4 @@
-# ❤️ Community
+# ❤️ Community
> "Alone we can do so little; together we can do so much." - Helen Keller
@@ -16,7 +16,7 @@ Meet some of our outstanding members who made significant contributions !
Explore insightful articles, tutorials, and stories written by and for our community members.
-- [Luka Panić](https://www.linkedin.com/in/luka-pani%C4%87-20b671277/) shares his work on
+- [Luka Panić](https://www.linkedin.com/in/luka-pani%C4%87-20b671277/) shares his work on
- [Ragas Evaluation: In-Depth Insights | PIXION Blog](https://pixion.co/blog/ragas-evaluation-in-depth-insights): A detailed explanation of the metrics and how they are calculated.
- [RAG in practice - Test Set Generation | PIXION Blog](https://pixion.co/blog/rag-in-practice-test-set-generation): A tutorial on how to generate a test set using Ragas.
- [Shanthi Vardhan](https://www.linkedin.com/in/shanthivardhan/) shares how his team at [Atomicwork uses ragas](https://www.atomicwork.com/blog/ragas-improving-atom-accuracy) to improve their AI system's ability to accurately identify and retrieve more precise information for enhanced service management.
@@ -25,16 +25,16 @@ Explore insightful articles, tutorials, and stories written by and for our commu
- Leonie (aka [@helloiamleonie](https://twitter.com/helloiamleonie?source=about_page-------------------------------------)) offers her perspective in the detailed article, ["Evaluating RAG Applications with RAGAs"](https://towardsdatascience.com/evaluating-rag-applications-with-ragas-81d67b0ee31a).
- The joint efforts of [Erika Cardenas](https://twitter.com/ecardenas300) and [Connor Shorten](https://twitter.com/CShorten30) are showcased in their collaborative piece, ["An Overview on RAG Evaluation | Weaviate"](https://weaviate.io/blog/rag-evaluation), and their podcast with the Ragas team.
- [Erika Cardenas](https://twitter.com/ecardenas300) further explores the "[RAG performance of hybrid search weightings (alpha)](https://www.linkedin.com/posts/erikacardenas300_i-tested-the-rag-performance-of-hybrid-search-activity-7139679925426376705-TVtc?utm_source=share&utm_medium=member_desktop)" in her recent experiment to tune weaviate alpha score using Ragas.
-- [Langchain’s](https://blog.langchain.dev/) work about [RAG Evaluating RAG pipelines with RAGAs and Langsmith](https://blog.langchain.dev/evaluating-rag-pipelines-with-ragas-langsmith/) provides a complete tutorial on how to leverage both tools to evaluate RAG pipelines.
-- [Plaban Nayak](https://nayakpplaban.medium.com/) shares his work [Evaluate RAG Pipeline using RAGAS](https://medium.aiplanet.com/evaluate-rag-pipeline-using-ragas-fbdd8dd466c1) on building and evaluating a simple RAG using Langchain and RAGAS
-- [Stephen Kurniawan](https://www.linkedin.com/in/stepkurniawan/) compares different RAG elements such as [Chunk Size](https://medium.com/@stepkurniawan/rag-chunk-size-experiment-e5e5ca437f44), [Vector Stores: FAISS vs ChromaDB](https://medium.com/@stepkurniawan/comparing-faiss-with-chroma-vector-stores-0953e1e619eb), [Vector Stores 2: Multiple Documents](https://medium.com/@stepkurniawan/comparing-faiss-vs-chroma-vector-store-retrieve-multiple-documents-07ad81a18851), and [Similarity Searches / Distance Metrics / Index Strategies](https://medium.com/@stepkurniawan/comparing-similarity-searches-distance-metrics-in-vector-stores-rag-model-f0b3f7532d6f).
+- [LangChain’s](https://blog.langchain.dev/) work about [RAG Evaluating RAG pipelines with RAGAs and LangSmith](https://blog.langchain.dev/evaluating-rag-pipelines-with-ragas-langsmith/) provides a complete tutorial on how to leverage both tools to evaluate RAG pipelines.
+- [Plaban Nayak](https://nayakpplaban.medium.com/) shares his work [Evaluate RAG Pipeline using RAGAS](https://medium.aiplanet.com/evaluate-rag-pipeline-using-ragas-fbdd8dd466c1) on building and evaluating a simple RAG using LangChain and RAGAS
+- [Stephen Kurniawan](https://www.linkedin.com/in/stepkurniawan/) compares different RAG elements such as [Chunk Size](https://medium.com/@stepkurniawan/rag-chunk-size-experiment-e5e5ca437f44), [Vector Stores: FAISS vs ChromaDB](https://medium.com/@stepkurniawan/comparing-faiss-with-chroma-vector-stores-0953e1e619eb), [Vector Stores 2: Multiple Documents](https://medium.com/@stepkurniawan/comparing-faiss-vs-chroma-vector-store-retrieve-multiple-documents-07ad81a18851), and [Similarity Searches / Distance Metrics / Index Strategies](https://medium.com/@stepkurniawan/comparing-similarity-searches-distance-metrics-in-vector-stores-rag-model-f0b3f7532d6f).
- Discover [Devanshu Brahmbhatt](https://www.linkedin.com/in/devanshubrahmbhatt/)'s insights on optimizing RAG systems in his article, [Enhancing LLM's Accuracy with RAGAS](https://devanshus-organization.gitbook.io/llm-testing-ragas). Learn about RAG architecture, key evaluation metrics, and how to use RAGAS scores to improve performance.
- [Suzuki](https://www.linkedin.com/in/hirokazu-suzuki-206245110/) and [Hwang](https://www.linkedin.com/in/hwang-yongtae/) conducted an experiment to investigate if Ragas' performance is language-dependent by comparing the performance (correlation coefficient between human labels and scores from Ragas) using datasets of the same content in Japanese and English. They wrote blog about the result of the experiment and basic algorithm of Ragas.
- [RAG Evaluation: Necessity and Challenge](https://tech.beatrust.com/entry/2024/05/02/RAG_Evaluation%3A_Necessity_and_Challenge)
- [RAG Evaluation : Computational Metrics in RAG and Calculation Methods in Ragas](https://tech.beatrust.com/entry/2024/05/02/RAG_Evaluation_%3A_Computational_Metrics_in_RAG_and_Calculation_Methods_in_Ragas)
- [RAG Evaluation: Assessing the Usefulness of Ragas](https://tech.beatrust.com/entry/2024/05/02/RAG_Evaluation%3A_Assessing_the_Usefulness_of_Ragas)
-- [Atita Arora](https://www.linkedin.com/in/atitaarora/) writes about [Evaluating Retrieval Augmented Generation using RAGAS](https://superlinked.com/vectorhub/articles/retrieval-augmented-generation-eval-qdrant-ragas), an end-to-end tutorial on building RAG using [Qdrant](https://qdrant.tech/) and [Langchain](https://www.langchain.com/) and evaluating it with RAGAS.
- - *Bonus content* : Learn how to create an evaluation dataset that serves as a reference point for evaluating our RAG pipeline, Understand the RAGAS evaluation metrics and how to make sense of them and putting them in action to test a Naive RAG pipeline and measure its performance using RAGAS metrics.
+- [Atita Arora](https://www.linkedin.com/in/atitaarora/) writes about [Evaluating Retrieval Augmented Generation using RAGAS](https://superlinked.com/vectorhub/articles/retrieval-augmented-generation-eval-qdrant-ragas), an end-to-end tutorial on building RAG using [Qdrant](https://qdrant.tech/) and [LangChain](https://www.langchain.com/) and evaluating it with RAGAS.
+ - *Bonus content* : Learn how to create an evaluation dataset that serves as a reference point for evaluating our RAG pipeline, Understand the RAGAS evaluation metrics and how to make sense of them and putting them in action to test a Naive RAG pipeline and measure its performance using RAGAS metrics.
- *Code walkthrough* : https://github.com/qdrant/qdrant-rag-eval/tree/master/workshop-rag-eval-qdrant-ragas
- *Code walkthrough using [Deepset Haystack](https://haystack.deepset.ai/) and [Mixedbread.ai](https://www.mixedbread.ai/)* : https://github.com/qdrant/qdrant-rag-eval/tree/master/workshop-rag-eval-qdrant-ragas-haystack
- [Minoru Onda](https://x.com/minorun365) writes for beginners about how to start Ragas v0.2 evaluation with Amazon Bedrock, and integrate with Langfuse.
@@ -50,5 +50,5 @@ Explore insightful articles, tutorials, and stories written by and for our commu
Stay updated with our latest gatherings, meetups, and online webinars.
- OpenAI Engineers shares their [RAG tricks and features Ragas](https://youtu.be/ahnGLM-RC1Y?si=rS_WSQF8XB04PzhP) on DevDay.
-- [Langchain](https://python.langchain.com/docs/get_started/introduction)’s a [LangChain "RAG Evaluation” Webinar](https://www.crowdcast.io/c/bnx91nz59cqq) with the Ragas team
+- [LangChain](https://python.langchain.com/docs/get_started/introduction)’s a [LangChain "RAG Evaluation” Webinar](https://www.crowdcast.io/c/bnx91nz59cqq) with the Ragas team
diff --git a/docs/concepts/components/eval_sample.md b/docs/concepts/components/eval_sample.md
index 35c153fcf..e78992da3 100644
--- a/docs/concepts/components/eval_sample.md
+++ b/docs/concepts/components/eval_sample.md
@@ -1,6 +1,6 @@
-# Evaluation Sample
+# Evaluation Sample
-An evaluation sample is a single structured data instance that is used to asses and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the `SingleTurnSample` and `MultiTurnSample` classes.
+An evaluation sample is a single structured data instance that is used to assess and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the `SingleTurnSample` and `MultiTurnSample` classes.
## SingleTurnSample
SingleTurnSample represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.
diff --git a/docs/concepts/metrics/available_metrics/agents.md b/docs/concepts/metrics/available_metrics/agents.md
index 156475a5a..2a72d8b94 100644
--- a/docs/concepts/metrics/available_metrics/agents.md
+++ b/docs/concepts/metrics/available_metrics/agents.md
@@ -7,8 +7,8 @@ Agentic or tool use workflows can be evaluated in multiple dimensions. Here are
AI systems deployed in real-world applications are expected to adhere to domains of interest while interacting with users but LLMs sometimes may answer general queries by ignoring this limitation. The topic adherence metric evaluates the ability of the AI to stay on predefined domains during the interactions. This metric is particularly important in conversational AI systems, where the AI is expected to only provide assistance to queries related to predefined domains.
-`TopicAdherenceScore` requires a predefined set of topics that the AI system is expected to adhere to which is provided using `reference_topics` along with `user_input`. The metric can compute precision, recall, and F1 score for topic adherence, defined as
-
+`TopicAdherenceScore` requires a predefined set of topics that the AI system is expected to adhere to which is provided using `reference_topics` along with `user_input`. The metric can compute precision, recall, and F1 score for topic adherence, defined as
+
$$
\text{Precision } = {|\text{Queries that are answered and are adheres to any present reference topics}| \over |\text{Queries that are answered and are adheres to any present reference topics}| + |\text{Queries that are answered and do not adheres to any present reference topics}|}
$$
@@ -65,7 +65,7 @@ To change the mode to recall, set the `mode` parameter to `recall`.
```python
scorer = TopicAdherenceScore(llm = evaluator_llm, mode="recall")
-```
+```
Output
```
0.99999999995
@@ -75,7 +75,7 @@ Output
## Tool call Accuracy
-`ToolCallAccuracy` is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. This metric needs `user_input` and `reference_tool_calls` to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. The metric is computed by comparing the `reference_tool_calls` with the Tool calls made by the AI. The values range between 0 and 1, with higher values indicating better performance.
+`ToolCallAccuracy` is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. This metric needs `user_input` and `reference_tool_calls` to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. The metric is computed by comparing the `reference_tool_calls` with the Tool calls made by the AI. The values range between 0 and 1, with higher values indicating better performance.
```python
from ragas.metrics import ToolCallAccuracy
@@ -113,7 +113,7 @@ Output
The tool call sequence specified in `reference_tool_calls` is used as the ideal outcome. If the tool calls made by the AI does not match the order or sequence of the `reference_tool_calls`, the metric will return a score of 0. This helps to ensure that the AI is able to identify and call the required tools in the correct order to complete a given task.
-By default the tool names and arguments are compared using exact string matching. But sometimes this might not be optimal, for example if the args are natural language strings. You can also use any ragas metrics (values between 0 and 1) as distance measure to identify if a retrieved context is relevant or not. For example,
+By default, the tool names and arguments are compared using exact string matching. But sometimes this might not be optimal, for example if the args are natural language strings. You can also use any ragas metrics (values between 0 and 1) as distance measure to identify if a retrieved context is relevant or not. For example,
```python
from ragas.metrics._string import NonLLMStringSimilarity
diff --git a/docs/concepts/metrics/available_metrics/aspect_critic.md b/docs/concepts/metrics/available_metrics/aspect_critic.md
index 6bf1253a3..8b7cac73d 100644
--- a/docs/concepts/metrics/available_metrics/aspect_critic.md
+++ b/docs/concepts/metrics/available_metrics/aspect_critic.md
@@ -5,7 +5,7 @@ This is designed to assess submissions based on predefined aspects such as `harm
Critiques within the LLM evaluators evaluate submissions based on the provided aspect. Ragas Critiques offers a range of predefined aspects like correctness, harmfulness, etc. (Please refer to `SUPPORTED_ASPECTS` for a complete list). If you prefer, you can also create custom aspects to evaluate submissions according to your unique requirements.
-The `strictness` parameter plays a crucial role in maintaining a certain level of self-consistency in predictions, with an ideal range typically falling between 2 to 4.
+The `strictness` parameter plays a crucial role in maintaining a certain level of self-consistency in predictions, with an ideal range typically falling from 2 to 4.
```{hint}
diff --git a/docs/concepts/metrics/available_metrics/context_entities_recall.md b/docs/concepts/metrics/available_metrics/context_entities_recall.md
index e28e4f3cb..a882367b2 100644
--- a/docs/concepts/metrics/available_metrics/context_entities_recall.md
+++ b/docs/concepts/metrics/available_metrics/context_entities_recall.md
@@ -1,13 +1,13 @@
## Context Entities Recall
-`ContextEntityRecall` metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities are recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.
+`ContextEntityRecall` metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities is recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.
-To compute this metric, we use two sets:
+To compute this metric, we use two sets:
-- **$RE$**: The set of entities in the reference.
-- **$RCE$**: The set of entities in the retrieved contexts.
+- **$RE$**: The set of entities in the reference.
+- **$RCE$**: The set of entities in the retrieved contexts.
-We calculate the number of entities common to both sets ($RCE \cap RE$) and divide it by the total number of entities in the reference ($RE$). The formula is:
+We calculate the number of entities common to both sets ($RCE \cap RE$) and divide it by the total number of entities in the reference ($RE$). The formula is:
$$
\text{Context Entity Recall} = \frac{\text{Number of common entities between $RCE$ and $RE$}}{\text{Total number of entities in $RE$}}
@@ -22,7 +22,7 @@ from ragas.metrics import ContextEntityRecall
sample = SingleTurnSample(
reference="The Eiffel Tower is located in Paris.",
- retrieved_contexts=["The Eiffel Tower is located in Paris."],
+ retrieved_contexts=["The Eiffel Tower is located in Paris."],
)
scorer = ContextEntityRecall(llm=evaluator_llm)
@@ -43,15 +43,15 @@ Output
**High entity recall context**: The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.
**Low entity recall context**: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.
-Let us consider the refrence and the retrieved contexts given above.
+Let us consider the reference and the retrieved contexts given above.
-- **Step-1**: Find entities present in the refrence.
+- **Step-1**: Find entities present in the reference.
- Entities in ground truth (RE) - ['Taj Mahal', 'Yamuna', 'Agra', '1631', 'Shah Jahan', 'Mumtaz Mahal']
- **Step-2**: Find entities present in the retrieved contexts.
- Entities in context (RCE1) - ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
- Entities in context (RCE2) - ['Taj Mahal', 'UNESCO', 'India']
- **Step-3**: Use the formula given above to calculate entity-recall
-
+
$$
\text{context entity recall 1} = \frac{| RCE1 \cap RE |}{| RE |}
= 4/6
@@ -63,5 +63,5 @@ Let us consider the refrence and the retrieved contexts given above.
= 1/6
$$
- We can see that the first context had a high entity recall, because it has a better entity coverage given the refrence. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
+ We can see that the first context had a high entity recall, because it has a better entity coverage given the reference. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
diff --git a/docs/concepts/metrics/available_metrics/context_precision.md b/docs/concepts/metrics/available_metrics/context_precision.md
index 449dadcea..f4e1e5476 100644
--- a/docs/concepts/metrics/available_metrics/context_precision.md
+++ b/docs/concepts/metrics/available_metrics/context_precision.md
@@ -17,10 +17,10 @@ The following metrics uses LLM to identify if a retrieved context is relevant or
### Context Precision without reference
-`LLMContextPrecisionWithoutReference` metric can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `response`.
+`LLMContextPrecisionWithoutReference` metric can be used when you have both retrieved contexts and also reference contexts associated with a `user_input`. To estimate if a retrieved context is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `response`.
#### Example
-
+
```python
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithoutReference
@@ -30,7 +30,7 @@ context_precision = LLMContextPrecisionWithoutReference(llm=evaluator_llm)
sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
- retrieved_contexts=["The Eiffel Tower is located in Paris."],
+ retrieved_contexts=["The Eiffel Tower is located in Paris."],
)
@@ -43,10 +43,10 @@ Output
### Context Precision with reference
-`LLMContextPrecisionWithReference` metric is can be used when you have both retrieved contexts and also reference answer associated with a `user_input`. To estimate if a retrieved contexts is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `reference`.
+`LLMContextPrecisionWithReference` metric is can be used when you have both retrieved contexts and also reference answer associated with a `user_input`. To estimate if a retrieved context is relevant or not this method uses the LLM to compare each of the retrieved context or chunk present in `retrieved_contexts` with `reference`.
#### Example
-
+
```python
from ragas import SingleTurnSample
from ragas.metrics import LLMContextPrecisionWithReference
@@ -56,7 +56,7 @@ context_precision = LLMContextPrecisionWithReference(llm=evaluator_llm)
sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
reference="The Eiffel Tower is located in Paris.",
- retrieved_contexts=["The Eiffel Tower is located in Paris."],
+ retrieved_contexts=["The Eiffel Tower is located in Paris."],
)
await context_precision.single_turn_ascore(sample)
@@ -75,7 +75,7 @@ This metric uses traditional methods to determine whether a retrieved context is
The `NonLLMContextPrecisionWithReference` metric is designed for scenarios where both retrieved contexts and reference contexts are available for a `user_input`. To determine if a retrieved context is relevant, this method compares each retrieved context or chunk in `retrieved_context`s with every context in `reference_contexts` using a non-LLM-based similarity measure.
#### Example
-
+
```python
from ragas import SingleTurnSample
from ragas.metrics import NonLLMContextPrecisionWithReference
@@ -83,7 +83,7 @@ from ragas.metrics import NonLLMContextPrecisionWithReference
context_precision = NonLLMContextPrecisionWithReference()
sample = SingleTurnSample(
- retrieved_contexts=["The Eiffel Tower is located in Paris."],
+ retrieved_contexts=["The Eiffel Tower is located in Paris."],
reference_contexts=["Paris is the capital of France.", "The Eiffel Tower is one of the most famous landmarks in Paris."]
)
diff --git a/docs/concepts/metrics/available_metrics/context_recall.md b/docs/concepts/metrics/available_metrics/context_recall.md
index 7b986a8ff..1ee8dbb46 100644
--- a/docs/concepts/metrics/available_metrics/context_recall.md
+++ b/docs/concepts/metrics/available_metrics/context_recall.md
@@ -7,7 +7,7 @@ In short, recall is about not missing anything important. Since it is about not
## LLM Based Context Recall
-`LLMContextRecall` is computed using `user_input`, `reference` and the `retrieved_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metric uses `reference` as a proxy to `reference_contexts` which also makes it easier to use as annotating reference contexts can be very time consuming. To estimate context recall from the `reference`, the reference is broken down into claims each claim in the `reference` answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.
+`LLMContextRecall` is computed using `user_input`, `reference` and the `retrieved_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metric uses `reference` as a proxy to `reference_contexts` which also makes it easier to use as annotating reference contexts can be very time-consuming. To estimate context recall from the `reference`, the reference is broken down into claims each claim in the `reference` answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all claims in the reference answer should be attributable to the retrieved context.
The formula for calculating context recall is as follows:
@@ -17,7 +17,7 @@ $$
$$
### Example
-
+
```python
from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import LLMContextRecall
@@ -26,7 +26,7 @@ sample = SingleTurnSample(
user_input="Where is the Eiffel Tower located?",
response="The Eiffel Tower is located in Paris.",
reference="The Eiffel Tower is located in Paris.",
- retrieved_contexts=["Paris is the capital of France."],
+ retrieved_contexts=["Paris is the capital of France."],
)
context_recall = LLMContextRecall(llm=evaluator_llm)
@@ -40,7 +40,7 @@ Output
## Non LLM Based Context Recall
-`NonLLMContextRecall` metric is computed using `retrieved_contexts` and `reference_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metrics uses non llm string comparison metrics to identify if a retrieved context is relevant or not. You can use any non LLM based metrics as distance measure to identify if a retrieved context is relevant or not.
+`NonLLMContextRecall` metric is computed using `retrieved_contexts` and `reference_contexts`, and the values range between 0 and 1, with higher values indicating better performance. This metrics uses non-LLM string comparison metrics to identify if a retrieved context is relevant or not. You can use any non LLM based metrics as distance measure to identify if a retrieved context is relevant or not.
The formula for calculating context recall is as follows:
@@ -49,7 +49,7 @@ $$
$$
### Example
-
+
```python
@@ -57,7 +57,7 @@ from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import NonLLMContextRecall
sample = SingleTurnSample(
- retrieved_contexts=["Paris is the capital of France."],
+ retrieved_contexts=["Paris is the capital of France."],
reference_contexts=["Paris is the capital of France.", "The Eiffel Tower is one of the most famous landmarks in Paris."]
)
diff --git a/docs/concepts/metrics/available_metrics/faithfulness.md b/docs/concepts/metrics/available_metrics/faithfulness.md
index f85b0e9ac..2c29944c9 100644
--- a/docs/concepts/metrics/available_metrics/faithfulness.md
+++ b/docs/concepts/metrics/available_metrics/faithfulness.md
@@ -1,13 +1,13 @@
## Faithfulness
-The **Faithfulness** metric measures how factually consistent a `response` is with the `retrieved context`. It ranges from 0 to 1, with higher scores indicating better consistency.
+The **Faithfulness** metric measures how factually consistent a `response` is with the `retrieved context`. It ranges from 0 to 1, with higher scores indicating better consistency.
-A response is considered **faithful** if all its claims can be supported by the retrieved context.
+A response is considered **faithful** if all its claims can be supported by the retrieved context.
-To calculate this:
-1. Identify all the claims in the response.
-2. Check each claim to see if it can be inferred from the retrieved context.
-3. Compute the faithfulness score using the formula:
+To calculate this:
+1. Identify all the claims in the response.
+2. Check each claim to see if it can be inferred from the retrieved context.
+3. Compute the faithfulness score using the formula:
$$
\text{Faithfulness Score} = \frac{\text{Number of claims in the response supported by the retrieved context}}{\text{Total number of claims in the response}}
@@ -17,7 +17,7 @@ $$
### Example
```python
-from ragas.dataset_schema import SingleTurnSample
+from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import Faithfulness
sample = SingleTurnSample(
@@ -36,12 +36,12 @@ Output
```
-## Faithfullness with HHEM-2.1-Open
+## Faithfulness with HHEM-2.1-Open
[Vectara's HHEM-2.1-Open](https://vectara.com/blog/hhem-2-1-a-better-hallucination-detection-model/) is a classifier model (T5) that is trained to detect hallucinations from LLM generated text. This model can be used in the second step of calculating faithfulness, i.e. when claims are cross-checked with the given context to determine if it can be inferred from the context. The model is free, small, and open-source, making it very efficient in production use cases. To use the model to calculate faithfulness, you can use the following code snippet:
```python
-from ragas.dataset_schema import SingleTurnSample
+from ragas.dataset_schema import SingleTurnSample
from ragas.metrics import FaithfulnesswithHHEM
@@ -69,7 +69,7 @@ await scorer.single_turn_ascore(sample)
```
-### How It’s Calculated
+### How It’s Calculated
!!! example
**Question**: Where and when was Einstein born?
diff --git a/docs/concepts/metrics/available_metrics/general_purpose.md b/docs/concepts/metrics/available_metrics/general_purpose.md
index 767a86a34..88cd6bcff 100644
--- a/docs/concepts/metrics/available_metrics/general_purpose.md
+++ b/docs/concepts/metrics/available_metrics/general_purpose.md
@@ -1,10 +1,10 @@
# General Purpose Metrics
-General purpose evaluation metrics are used to evaluate any given task.
+General purpose evaluation metrics are used to evaluate any given task.
-## Aspect Critic
+## Aspect Critic
-`AspectCritic` is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.
+`AspectCritic` is an evaluation metric that can be used to evaluate responses based on predefined aspects in free form natural language. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.
### Example
@@ -49,7 +49,7 @@ Critics are essentially basic LLM calls using the defined criteria. For example,
## Simple Criteria Scoring
-Course graned evaluation method is an evaluation metric that can be used to score (integer) responses based on predefined single free form scoring criteria. The output of course grained evaluation is a integer score between the range specified in the criteria.
+Course grained evaluation method is an evaluation metric that can be used to score (integer) responses based on predefined single free form scoring criteria. The output of course grained evaluation is an integer score between the range specified in the criteria.
```python
from ragas.dataset_schema import SingleTurnSample
@@ -63,7 +63,7 @@ sample = SingleTurnSample(
)
scorer = SimpleCriteriaScore(
- name="course_grained_score",
+ name="course_grained_score",
definition="Score 0 to 5 by similarity",
llm=evaluator_llm
)
@@ -77,7 +77,7 @@ Output
## Rubrics based criteria scoring
-The Rubric-Based Criteria Scoring Metric is used to do evaluations based on user-defined rubrics. Each rubric defines a detailed score description, typically ranging from 1 to 5. The LLM assesses and scores responses according to these descriptions, ensuring a consistent and objective evaluation.
+The Rubric-Based Criteria Scoring Metric is used to do evaluations based on user-defined rubrics. Each rubric defines a detailed score description, typically ranging from 1 to 5. The LLM assesses and scores responses according to these descriptions, ensuring a consistent and objective evaluation.
!!! note
When defining rubrics, ensure consistency in terminology to match the schema used in the `SingleTurnSample` or `MultiTurnSample` respectively. For instance, if the schema specifies a term such as reference, ensure that the rubrics use the same term instead of alternatives like ground truth.
@@ -111,10 +111,10 @@ Output
## Instance Specific rubrics criteria scoring
-Instance Specific Evaluation Metric is a rubric-based method used to evaluate each item in a dataset individually. To use this metric, you need to provide a rubric along with the items you want to evaluate.
+Instance Specific Evaluation Metric is a rubric-based method used to evaluate each item in a dataset individually. To use this metric, you need to provide a rubric along with the items you want to evaluate.
!!! note
- This differs from the `Rubric Based Criteria Scoring Metric`, where a single rubric is applied to uniformly evaluate all items in the dataset. In the `Instance-Specific Evaluation Metric`, you decide which rubric to use for each item. It's like the difference between giving the entire class the same quiz (rubric-based) and creating a personalized quiz for each student (instance-specific).
+ This differs from the `Rubric Based Criteria Scoring Metric`, where a single rubric is applied to uniformly evaluate all items in the dataset. In the `Instance-Specific Evaluation Metric`, you decide which rubric to use for each item. It's like the difference between giving the entire class the same quiz (rubric-based) and creating a personalized quiz for each student (instance-specific).
#### Example
```python
diff --git a/docs/concepts/metrics/available_metrics/noise_sensitivity.md b/docs/concepts/metrics/available_metrics/noise_sensitivity.md
index 63ad2ffd9..f4c8d3b8c 100644
--- a/docs/concepts/metrics/available_metrics/noise_sensitivity.md
+++ b/docs/concepts/metrics/available_metrics/noise_sensitivity.md
@@ -1,6 +1,6 @@
# Noise Sensitivity
-`NoiseSensitivity` measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents. The score ranges from 0 to 1, with lower values indicating better performance. Noise sensitivity is computed using the `user_input`, `reference`, `response`, and the `retrieved_contexts`.
+`NoiseSensitivity` measures how often a system makes errors by providing incorrect responses when utilizing either relevant or irrelevant retrieved documents. The score ranges from 0 to 1, with lower values indicating better performance. Noise sensitivity is computed using the `user_input`, `reference`, `response`, and the `retrieved_contexts`.
To estimate noise sensitivity, each claim in the generated response is examined to determine whether it is correct based on the ground truth and whether it can be attributed to the relevant (or irrelevant) retrieved context. Ideally, all claims in the answer should be supported by the relevant retrieved context.
@@ -37,7 +37,7 @@ Output
0.3333333333333333
```
-To calculate noise sensivity of irrelevant context, you can set the `mode` parameter to `irrelevant`.
+To calculate noise sensitivity of irrelevant context, you can set the `mode` parameter to `irrelevant`.
```python
scorer = NoiseSensitivity(mode="irrelevant")
@@ -51,12 +51,12 @@ await scorer.single_turn_ascore(sample)
Ground truth: The Life Insurance Corporation of India (LIC) is the largest insurance company in India, established in 1956 through the nationalization of the insurance industry. It is known for managing a large portfolio of investments.
- Relevant Retrieval:
+ Relevant Retrieval:
- The Life Insurance Corporation of India (LIC) was established in 1956 following the nationalization of the insurance industry in India.
- LIC is the largest insurance company in India, with a vast network of policyholders and a significant role in the financial sector.
- As the largest institutional investor in India, LIC manages a substantial life fund, contributing to the financial stability of the country.
-
- Irrelevant Retrieval:
+
+ Irrelevant Retrieval:
- The Indian economy is one of the fastest-growing major economies in the world, thanks to the sectors like finance, technology, manufacturing etc.
Let's examine how noise sensitivity in relevant context was calculated:
@@ -64,7 +64,7 @@ Let's examine how noise sensitivity in relevant context was calculated:
- **Step 1:** Identify the relevant contexts from which the ground truth can be inferred.
- Ground Truth:
- The Life Insurance Corporation of India (LIC) is the largest insurance company in India, established in 1956 through the nationalization of the insurance industry. It is known for managing a large portfolio of investments.
+ The Life Insurance Corporation of India (LIC) is the largest insurance company in India, established in 1956 through the nationalization of the insurance industry. It is known for managing a large portfolio of investments.
- Contexts:
- Context 1: The Life Insurance Corporation of India (LIC) was established in 1956 following the nationalization of the insurance industry in India.
@@ -74,7 +74,7 @@ Let's examine how noise sensitivity in relevant context was calculated:
- **Step 2:** Verify if the claims in the generated answer can be inferred from the relevant context.
- Answer:
- The Life Insurance Corporation of India (LIC) is the largest insurance company in India, known for its vast portfolio of investments. LIC contributs to the financial stability of the country.
+ The Life Insurance Corporation of India (LIC) is the largest insurance company in India, known for its vast portfolio of investments. LIC contributes to the financial stability of the country.
- Contexts:
- Context 1: The Life Insurance Corporation of India (LIC) was established in 1956 following the nationalization of the insurance industry in India.
@@ -88,7 +88,7 @@ Let's examine how noise sensitivity in relevant context was calculated:
The Life Insurance Corporation of India (LIC) is the largest insurance company in India, established in 1956 through the nationalization of the insurance industry. It is known for managing a large portfolio of investments.
- Answer:
- The Life Insurance Corporation of India (LIC) is the largest insurance company in India, known for its vast portfolio of investments. LIC contributs to the financial stability of the country.
+ The Life Insurance Corporation of India (LIC) is the largest insurance company in India, known for its vast portfolio of investments. LIC contributes to the financial stability of the country.
Explanation: The ground truth does not mention anything about LIC contributing to the financial stability of the country. Therefore, this statement in the answer is incorrect.
@@ -99,9 +99,9 @@ Let's examine how noise sensitivity in relevant context was calculated:
$$
\text{noise sensitivity} = { \text{1} \over \text{3} } = 0.333
- $$
+ $$
This results in a noise sensitivity score of 0.333, indicating that one out of three claims in the answer was incorrect.
-Credits: Noise senstivity was introduced in [RAGChecker](https://github.com/amazon-science/RAGChecker/tree/main/ragchecker)
\ No newline at end of file
+Credits: Noise sensitivity was introduced in [RAGChecker](https://github.com/amazon-science/RAGChecker/tree/main/ragchecker)
\ No newline at end of file
diff --git a/docs/concepts/metrics/available_metrics/nvidia_metrics.md b/docs/concepts/metrics/available_metrics/nvidia_metrics.md
index 89a40704b..ece565425 100644
--- a/docs/concepts/metrics/available_metrics/nvidia_metrics.md
+++ b/docs/concepts/metrics/available_metrics/nvidia_metrics.md
@@ -2,7 +2,7 @@
## Answer Accuracy
-**Answer Accuracy** measures the agreement between a model’s response and a reference ground truth for a given question. This is done via two distinct "LLM-as-a-judge" prompts that each return a rating (0, 2, or 4). The metric converts these ratings into a [0,1] scale and then takes the average of the two scores from the judges. Higher scores indicate that the model’s answer closely matches the reference.
+**Answer Accuracy** measures the agreement between a model’s response and a reference ground truth for a given question. This is done via two distinct "LLM-as-a-Judge" prompts that each return a rating (0, 2, or 4). The metric converts these ratings into a [0,1] scale and then takes the average of the two scores from the judges. Higher scores indicate that the model’s answer closely matches the reference.
- **0** → The **response** is inaccurate or does not address the same question as the **reference**.
- **2** → The **response** partially align with the **reference**.
@@ -67,7 +67,7 @@ Thus, the final **Answer Accuracy** score is **1**.
- **Explainability:** Answer Correctness offers high explainability by providing detailed insights into factual correctness and semantic similarity, whereas Answer Accuracy provides a straightforward raw score.
- **Robust Evaluation:** Answer Accuracy ensures consistency through dual LLM evaluations, while Answer Correctness offers a holistic view by deeply assessing the quality of the response.
-#### Answer Accuracy vs. Rubric Score
+#### Answer Accuracy vs. Rubric Score
- **LLM Calls**: Answer Accuracy makes two calls (one per LLM judge), while Rubric Score requires only one.
- **Token Usage**: Answer Accuracy is minimal since it outputs just a score, whereas Rubric Score generates reasoning, increasing token consumption.
@@ -76,7 +76,7 @@ Thus, the final **Answer Accuracy** score is **1**.
## Context Relevance
-**Context Relevance** evaluates whether the **retrieved_contexts** (chunks or passages) are pertinent to the **user_input**. This is done via two independent "LLM-as-a-judge" prompt calls that each rate the relevance on a scale of **0, 1, or 2**. The ratings are then converted to a [0,1] scale and averaged to produce the final score. Higher scores indicate that the contexts are more closely aligned with the user's query.
+**Context Relevance** evaluates whether the **retrieved_contexts** (chunks or passages) are pertinent to the **user_input**. This is done via two independent "LLM-as-a-Judge" prompt calls that each rate the relevance on a scale of **0, 1, or 2**. The ratings are then converted to a [0,1] scale and averaged to produce the final score. Higher scores indicate that the contexts are more closely aligned with the user's query.
- **0** → The retrieved contexts are not relevant to the user’s query at all.
- **1** → The contexts are partially relevant.
@@ -117,7 +117,7 @@ Output
- "Albert Einstein was born March 14, 1879."
- "Albert Einstein was born at Ulm, in Württemberg, Germany."
-In this example, the two retrieved contexts together fully address the user's query by providing both the birth date and location of Albert Einstein. Consequently, both prompts would rate the combined contexts as **2** (fully relevant). Normalizing each score yields **1.0** (2/2), and averaging the two results maintains the final Context Relevance score at **1**.
+In this example, the two retrieved contexts together fully address the user's query by providing both the birthdate and location of Albert Einstein. Consequently, both prompts would rate the combined contexts as **2** (fully relevant). Normalizing each score yields **1.0** (2/2), and averaging the two results maintains the final Context Relevance score at **1**.
### Similar Ragas Metrics
@@ -177,13 +177,13 @@ Output
- "Albert Einstein was born March 14, 1879."
- "Albert Einstein was born at Ulm, in Württemberg, Germany."
-In this example, the retrieved contexts provide both the birth date and location of Albert Einstein. Since the response's claim is supported by the context (even though the date is partially provided), both prompts would likely rate the grounding as **2** (fully grounded). Normalizing a score of 2 gives **1.0** (2/2), and averaging the two normalized ratings maintains the final Response Groundedness score at **1**.
+In this example, the retrieved contexts provide both the birthdate and location of Albert Einstein. Since the response's claim is supported by the context (even though the date is partially provided), both prompts would likely rate the grounding as **2** (fully grounded). Normalizing a score of 2 gives **1.0** (2/2), and averaging the two normalized ratings maintains the final Response Groundedness score at **1**.
### Similar Ragas Metrics
1. [Faithfulness](faithfulness.md): This metric measures how factually consistent a response is with the retrieved context, ensuring that every claim in the response is supported by the provided information. The Faithfulness score ranges from 0 to 1, with higher scores indicating better consistency.
-2. [Rubric Score](general_purpose.md#rubrics-based-criteria-scoring): This is a general-purpose metric that evaluates responses based on user-defined criteria and can be adapted to assess Answer Accuracy, Context Relevance or Response Groundedness by aligning the rubric with the requirements.
+2. [Rubric Score](general_purpose.md#rubrics-based-criteria-scoring): This is a general-purpose metric that evaluates responses based on user-defined criteria and can be adapted to assess Answer Accuracy, Context Relevance or Response Groundedness by aligning the rubric with the requirements.
### Comparison of Metrics
diff --git a/docs/concepts/metrics/available_metrics/sql.md b/docs/concepts/metrics/available_metrics/sql.md
index a319755d5..7d2675483 100644
--- a/docs/concepts/metrics/available_metrics/sql.md
+++ b/docs/concepts/metrics/available_metrics/sql.md
@@ -1,14 +1,14 @@
-# SQL
+# SQL
## Execution based metrics
-In these metrics the resulting SQL is compared after executing the SQL query on the database and then comparing the `response` with the expected results.
+In these metrics the resulting SQL is compared after executing the SQL query on the database and then comparing the `response` with the expected results.
### DataCompy Score
-`DataCompyScore` metric uses DataCompy, a python library that compares two pandas DataFrames. It provides a simple interface to compare two DataFrames and provides a detailed report of the differences. In this metric the `response` is executed on the database and the resulting data is compared with the expected data, ie `reference`. To enable comparison both `response` and `reference` should be in the form of a Comma-Separated Values as shown in the example.
+`DataCompyScore` metric uses DataCompy, a python library that compares two pandas DataFrames. It provides a simple interface to compare two DataFrames and provides a detailed report of the differences. In this metric the `response` is executed on the database and the resulting data is compared with the expected data, i.e. `reference`. To enable comparison both `response` and `reference` should be in the form of a Comma-Separated Values as shown in the example.
-Dataframes can be compared across rows or columns. This can be configured using `mode` parameter.
+DataFrames can be compared across rows or columns. This can be configured using `mode` parameter.
If mode is `row` then the comparison is done row-wise. If mode is `column` then the comparison is done column-wise.
diff --git a/docs/concepts/metrics/overview/index.md b/docs/concepts/metrics/overview/index.md
index 889b6622d..4b2785dbe 100644
--- a/docs/concepts/metrics/overview/index.md
+++ b/docs/concepts/metrics/overview/index.md
@@ -16,7 +16,7 @@ A metric is a quantitative measure used to evaluate the performance of a AI appl
**Metrics can be classified into two categories based on the mechanism used underneath the hood**:
- **LLM-based metrics**: These metrics use LLM underneath to do the evaluation. There might be one or more LLM calls that are performed to arrive at the score or result. These metrics can be somewhat non deterministic as the LLM might not always return the same result for the same input. On the other hand, these metrics has shown to be more accurate and closer to human evaluation.
+ **LLM-based metrics**: These metrics use LLM underneath to do the evaluation. There might be one or more LLM calls that are performed to arrive at the score or result. These metrics can be somewhat non-deterministic as the LLM might not always return the same result for the same input. On the other hand, these metrics has shown to be more accurate and closer to human evaluation.
All LLM based metrics in ragas are inherited from `MetricWithLLM` class. These metrics expects a LLM object to be set before scoring.
@@ -30,7 +30,7 @@ Each LLM based metrics also will have prompts associated with it written using [
**Non-LLM-based metrics**: These metrics do not use LLM underneath to do the evaluation. These metrics are deterministic and can be used to evaluate the performance of the AI application without using LLM. These metrics rely on traditional methods to evaluate the performance of the AI application, such as string similarity, BLEU score, etc. Due to the same, these metrics are known to have a lower correlation with human evaluation.
-All LLM based metrics in ragas are inherited from `Metric` class.
+All LLM based metrics in ragas are inherited from `Metric` class.
**Metrics can be broadly classified into two categories based on the type of data they evaluate**:
@@ -57,19 +57,19 @@ await scorer.multi_turn_ascore(sample)
Designing effective metrics for AI applications requires following to a set of core principles to ensure their reliability, interpretability, and relevance. Here are five key principles we follow in ragas when designing metrics:
-**1. Single-Aspect Focus**
+**1. Single-Aspect Focus**
A single metric should target only one specific aspect of the AI application's performance. This ensures that the metric is both interpretable and actionable, providing clear insights into what is being measured.
-**2. Intuitive and Interpretable**
+**2. Intuitive and Interpretable**
Metrics should be designed to be easy to understand and interpret. Clear and intuitive metrics make it simpler to communicate results and draw meaningful conclusions.
-**3. Effective Prompt Flows**
+**3. Effective Prompt Flows**
When developing metrics using large language models (LLMs), use intelligent prompt flows that align closely with human evaluation. Decomposing complex tasks into smaller sub-tasks with specific prompts can improve the accuracy and relevance of the metric.
-**4. Robustness**
+**4. Robustness**
Ensure that LLM-based metrics include sufficient few-shot examples that reflect the desired outcomes. This enhances the robustness of the metric by providing context and guidance for the LLM to follow.
-**5.Consistent Scoring Ranges**
+**5.Consistent Scoring Ranges**
It is crucial to normalize metric score values or ensure they fall within a specific range, such as 0 to 1. This facilitates comparison between different metrics and helps maintain consistency and interpretability across the evaluation framework.
These principles serve as a foundation for creating metrics that are not only effective but also practical and meaningful in evaluating AI applications.
diff --git a/docs/concepts/test_data_generation/index.md b/docs/concepts/test_data_generation/index.md
index 4de8feee4..347c00a44 100644
--- a/docs/concepts/test_data_generation/index.md
+++ b/docs/concepts/test_data_generation/index.md
@@ -5,11 +5,11 @@ Curating a high quality test dataset is crucial for evaluating the performance o
## Characteristics of an Ideal Test Dataset
- Contains high quality data samples
-- Covers wide variety of scenarios as observed in real world.
-- Contains enough number of samples to be derive statistically significant conclusions.
+- Covers wide variety of scenarios as observed in real world.
+- Contains enough number of samples to derive statistically significant conclusions.
- Continually updated to prevent data drift
-Curating such a dataset manually can be time consuming and expensive. Ragas provides a set of tools to generate synthetic test datasets for evaluating your AI applications.
+Curating such a dataset manually can be time-consuming and expensive. Ragas provides a set of tools to generate synthetic test datasets for evaluating your AI applications.
diff --git a/docs/concepts/test_data_generation/rag.md b/docs/concepts/test_data_generation/rag.md
index c2dcdbff4..a863e75ef 100644
--- a/docs/concepts/test_data_generation/rag.md
+++ b/docs/concepts/test_data_generation/rag.md
@@ -1,6 +1,6 @@
# Testset Generation for RAG
-In RAG application, when a user interacts through your application to a set of documents the there can be different patterns of queries that the system can encounter. Let's first understand the different types of queries that can be encountered in RAG application.
+In RAG application, when a user interacts through your application to a set of documents, there can be different patterns of queries that the system can encounter. Let's first understand the different types of queries that can be encountered in RAG application.
## Query types in RAG
@@ -8,13 +8,13 @@ In RAG application, when a user interacts through your application to a set of d
graph TD
A[Queries] --> B[Single-Hop Query]
A --> C[Multi-Hop Query]
-
+
B --> D1[Specific Query]
-
+
B --> E1[Abstract Query]
C --> F1[Specific Query]
-
+
C --> G1[Abstract Query]
```
@@ -53,16 +53,16 @@ This abstract query requires the retrieval of multiple pieces of information ove
### Specific vs. Abstract Queries in a RAG
- **Specific Query:** Focuses on clear, fact-based retrieval. The goal in RAG is to retrieve highly relevant information from one or more documents that directly address the specific question.
-
+
- **Abstract Query:** Requires a broader, more interpretive response. In RAG, abstract queries challenge the retrieval system to pull from documents that contain higher-level reasoning, explanations, or opinions, rather than simple facts.
In both single-hop and multi-hop cases, the distinction between specific and abstract queries shapes the retrieval and generation process by determining whether the focus is on precision (specific) or on synthesizing broader ideas (abstract).
-Different types of queries requires different contexts to be synthesize. To solve this problem, Ragas uses a Knowledge Graph based approach to Test set Generation.
+Different types of queries requires different contexts to be synthesized. To solve this problem, Ragas uses a Knowledge Graph based approach to Test set Generation.
## Knowledge Graph Creation
-Given that we want to manufacture different types of queries from the given set of documents, our major challenge is to identify the right set of chunks or documents to enable LLMs to create the queries. To solve this problem, Ragas uses a Knowledge Graph based approach to Test set Generation.
+Given that we want to manufacture different types of queries from the given set of documents, our major challenge is to identify the right set of chunks or documents to enable LLMs to create the queries. To solve this problem, Ragas uses a Knowledge Graph based approach to Test set Generation.
{width="auto"}
@@ -73,8 +73,8 @@ Given that we want to manufacture different types of queries from the given set
The knowledge graph is created by using the following components:
### Document Splitter
-
-The documents are chunked to form hierarchial nodes. The chunking can be done by using different splitters. For example, in the case of financial documents, the chunking can be done by using the splitter that splits the document based on the sections like Income Statement, Balance Sheet, Cash Flow Statement etc. You can write your own [custom splitters]() to split the document based on the sections that are relevant to your domain.
+
+The documents are chunked to form hierarchical nodes. The chunking can be done by using different splitters. For example, in the case of financial documents, the chunking can be done by using the splitter that splits the document based on the sections like Income Statement, Balance Sheet, Cash Flow Statement etc. You can write your own [custom splitters]() to split the document based on the sections that are relevant to your domain.
#### Example
@@ -103,7 +103,7 @@ graph TD
### Extractors
-Different extractors are used to extract information from each nodes that can be used to establish the relationship between the nodes. For example, in the case of financial documents, the extractor that can be used are entity extractor to extract the entities like Company Name, Keyphrase extractor to extract important key phrases present in each node, etc. You can write your own custom extractors to extract the information that is relevant to your domain.
+Different extractors are used to extract information from each node that can be used to establish the relationship between the nodes. For example, in the case of financial documents, the extractor that can be used are entity extractor to extract the entities like Company Name, Keyphrase extractor to extract important key phrases present in each node, etc. You can write your own custom extractors to extract the information that is relevant to your domain.
Extractors can be LLM based which are inherited from `LLMBasedExtractor` or rule based which are inherited from `Extractor`.
@@ -166,7 +166,7 @@ graph TD
The extracted information is used to establish the relationship between the nodes. For example, in the case of financial documents, the relationship can be established between the nodes based on the entities present in the nodes.
You can write your own [custom relationship builder]() to establish the relationship between the nodes based on the information that is relevant to your domain.
-#### Example
+#### Example
```python
from ragas.testset.graph import KnowledgeGraph
@@ -196,7 +196,7 @@ graph TD
Now let's understand how to build the knowledge graph using the above components with a `transform`, that would make your job easier.
-### Transforms
+### Transforms
All of the components used to build the knowledge graph can be combined into a single `transform` that can be applied to the knowledge graph to build the knowledge graph. Transforms is made of up of a list of components that are applied to the knowledge graph in a sequence. It can also handle parallel processing of the components. The `apply_transforms` method is used to apply the transforms to the knowledge graph.
@@ -232,17 +232,17 @@ apply_transforms(kg,transforms)
```
-Once the knowledge graph is created, the different types of queries can be generated by traversing the graph. For example, to generate the query “Compare the revenue growth of Company X and Company Y from FY2020 through FY2023”, the graph can be traversed to find the nodes that contain the information about the revenue growth of Company X and Company Y from FY2020 through FY2023.
+Once the knowledge graph is created, the different types of queries can be generated by traversing the graph. For example, to generate the query “Compare the revenue growth of Company X and Company Y from FY2020 through FY2023”, the graph can be traversed to find the nodes that contain the information about the revenue growth of Company X and Company Y from FY2020 through FY2023.
## Scenario Generation
Now we have the knowledge graph that can be used to manufacture the right context to generate any type of query. When a population of users interact with RAG system, they may formulate the queries in various ways depending upon their persona (eg, Senior Engineer, Junior Engineer, etc), Query length (Short, Long, etc), Query style (Formal, Informal, etc). To generate the queries that cover all these scenarios, Ragas uses a Scenario based approach to Test set Generation.
-Each `Scenario` in Test set Generation is a combination of following parameters.
+Each `Scenario` in Test set Generation is a combination of following parameters.
- Nodes : The nodes that are used to generate the query
-- Query Length : The length of the desired query, it can be short, medium or long, etc.
-- Query Style : The style of the query, it can be web search, chat, etc.
+- Query Length : The length of the desired query, it can be short, medium or long, etc.
+- Query Style : The style of the query, it can be web search, chat, etc.
- Persona : The persona of the user, it can be Senior Engineer, Junior Engineer, etc. (Coming soon)
@@ -253,13 +253,13 @@ Each `Scenario` in Test set Generation is a combination of following parameters.
### Query Synthesizer
-The `QuerySynthesizer` is responsible for generating different scenarios for a single query type. The `generate_scenarios` method is used to generate the scenarios for a single query type. The `generate_sample` method is used to generate the query and reference answer for a single scenario. Let's understand this with an example.
+The `QuerySynthesizer` is responsible for generating different scenarios for a single query type. The `generate_scenarios` method is used to generate the scenarios for a single query type. The `generate_sample` method is used to generate the query and reference answer for a single scenario. Let's understand this with an example.
#### Example
-In the previous example, we have created a knowledge graph that contains two nodes that are related to each other based on the entity similarity. Now imagine that you have 20 such pairs of nodes in your KG that are related to each other based on the entity similarity.
+In the previous example, we have created a knowledge graph that contains two nodes that are related to each other based on the entity similarity. Now imagine that you have 20 such pairs of nodes in your KG that are related to each other based on the entity similarity.
-Imagine your goal is to create 50 different queries where each query is about some abstract question comparing two entities. We first have to query the KG to get the pairs of nodes that are related to each other based on the entity similarity. Then we have to generate the scenarios for each pair of nodes untill we get 50 different scenarios. This logic is implemented in `generate_scenarios` method.
+Imagine your goal is to create 50 different queries where each query is about some abstract question comparing two entities. We first have to query the KG to get the pairs of nodes that are related to each other based on the entity similarity. Then we have to generate the scenarios for each pair of nodes until we get 50 different scenarios. This logic is implemented in `generate_scenarios` method.
```python
diff --git a/docs/extra/components/choose_evaluator_llm.md b/docs/extra/components/choose_evaluator_llm.md
index c57f4f1e9..7edc07dc8 100644
--- a/docs/extra/components/choose_evaluator_llm.md
+++ b/docs/extra/components/choose_evaluator_llm.md
@@ -112,11 +112,11 @@
```python
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
-
+
# Choose the appropriate import based on your API:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_vertexai import ChatVertexAI
-
+
# Initialize with Google AI Studio
evaluator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(
model=config["model"],
@@ -124,7 +124,7 @@
max_tokens=config["max_tokens"],
top_p=config["top_p"],
))
-
+
# Or initialize with Vertex AI
evaluator_llm = LangchainLLMWrapper(ChatVertexAI(
model=config["model"],
@@ -140,12 +140,12 @@
```python
from langchain_google_genai import HarmCategory, HarmBlockThreshold
-
+
safety_settings = {
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
# Add other safety settings as needed
}
-
+
# Apply to your LLM initialization
evaluator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(
model=config["model"],
@@ -159,7 +159,7 @@
```python
# Google AI Studio Embeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings
-
+
evaluator_embeddings = LangchainEmbeddingsWrapper(GoogleGenerativeAIEmbeddings(
model="models/embedding-001", # Google's text embedding model
task_type="retrieval_document" # Optional: specify the task type
@@ -169,7 +169,7 @@
```python
# Vertex AI Embeddings
from langchain_google_vertexai import VertexAIEmbeddings
-
+
evaluator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(
model_name="textembedding-gecko@001", # or other available model
project=config["project"], # Your GCP project ID
@@ -231,7 +231,7 @@
=== "Others"
- If you are using a different LLM provider and using Langchain to interact with it, you can wrap your LLM in `LangchainLLMWrapper` so that it can be used with ragas.
+ If you are using a different LLM provider and using LangChain to interact with it, you can wrap your LLM in `LangchainLLMWrapper` so that it can be used with ragas.
```python
from ragas.llms import LangchainLLMWrapper
diff --git a/docs/extra/components/choose_generator_llm.md b/docs/extra/components/choose_generator_llm.md
index e20bcabaa..ac63a1aef 100644
--- a/docs/extra/components/choose_generator_llm.md
+++ b/docs/extra/components/choose_generator_llm.md
@@ -112,11 +112,11 @@
```python
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
-
+
# Choose the appropriate import based on your API:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_vertexai import ChatVertexAI
-
+
# Initialize with Google AI Studio
generator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(
model=config["model"],
@@ -124,7 +124,7 @@
max_tokens=config["max_tokens"],
top_p=config["top_p"],
))
-
+
# Or initialize with Vertex AI
generator_llm = LangchainLLMWrapper(ChatVertexAI(
model=config["model"],
@@ -141,12 +141,12 @@
```python
from langchain_google_genai import HarmCategory, HarmBlockThreshold
-
+
safety_settings = {
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.BLOCK_NONE,
# Add other safety settings as needed
}
-
+
# Apply to your LLM initialization
generator_llm = LangchainLLMWrapper(ChatGoogleGenerativeAI(
model=config["model"],
@@ -160,7 +160,7 @@
```python
# Google AI Studio Embeddings
from langchain_google_genai import GoogleGenerativeAIEmbeddings
-
+
generator_embeddings = LangchainEmbeddingsWrapper(GoogleGenerativeAIEmbeddings(
model="models/embedding-001", # Google's text embedding model
task_type="retrieval_document" # Optional: specify the task type
@@ -170,7 +170,7 @@
```python
# Vertex AI Embeddings
from langchain_google_vertexai import VertexAIEmbeddings
-
+
generator_embeddings = LangchainEmbeddingsWrapper(VertexAIEmbeddings(
model_name="textembedding-gecko@001", # or other available model
project=config["project"], # Your GCP project ID
@@ -235,7 +235,7 @@
If you want more information on how to use other Azure services, please refer to the [langchain-azure](https://python.langchain.com/docs/integrations/chat/azure_chat_openai/) documentation.
=== "Others"
- If you are using a different LLM provider and using Langchain to interact with it, you can wrap your LLM in `LangchainLLMWrapper` so that it can be used with ragas.
+ If you are using a different LLM provider and using LangChain to interact with it, you can wrap your LLM in `LangchainLLMWrapper` so that it can be used with ragas.
```python
from ragas.llms import LangchainLLMWrapper
diff --git a/docs/getstarted/evals.md b/docs/getstarted/evals.md
index ed13493c9..870d16b29 100644
--- a/docs/getstarted/evals.md
+++ b/docs/getstarted/evals.md
@@ -42,7 +42,7 @@ As you may observe, this approach has two key limitations:
- **Time-Consuming Preparation:** Evaluating the application requires preparing the expected output (`reference`) for each input, which can be both time-consuming and challenging.
-- **Inaccurate Scoring:** Even though the `response` and `reference` are similar, the output score was low. This is a known limitation of non-LLM metrics like `BleuScore`.
+- **Inaccurate Scoring:** Even though the `response` and `reference` are similar, the output score was low. This is a known limitation of non-LLM metrics like `BleuScore`.
!!! info
@@ -88,7 +88,7 @@ Output
Success! Here 1 means pass and 0 means fail
!!! info
- There are many other types of metrics that are available in ragas (with and without `reference`), and you may also create your own metrics if none of those fits your case. To explore this more checkout [more on metrics](../concepts/metrics/index.md).
+ There are many other types of metrics that are available in ragas (with and without `reference`), and you may also create your own metrics if none of those fits your case. To explore this more checkout [more on metrics](../concepts/metrics/index.md).
### Evaluating on a Dataset
@@ -96,8 +96,8 @@ In the examples above, we used only a single sample to evaluate our application.
Here, we’ll load a dataset from Hugging Face Hub, but you can load data from any source, such as production logs or other datasets. Just ensure that each sample includes all the required attributes for the chosen metric.
-In our case, the required attributes are:
-- **`user_input`**: The input provided to the application (here the input text report).
+In our case, the required attributes are:
+- **`user_input`**: The input provided to the application (here the input text report).
- **`response`**: The output generated by the application (here the generated summary).
For example
@@ -149,7 +149,7 @@ Output
```
This score shows that out of all the samples in our test data, only 84% of summaries passes the given evaluation criteria. Now, **It
-s important to see why is this the case**.
+is important to see why is this the case**.
Export the sample level scores to pandas dataframe
@@ -171,7 +171,7 @@ Viewing the sample-level results in a CSV file, as shown above, is fine for quic
## Analyzing Results
-For this you may sign up and setup [app.ragas.io](https://app.ragas.io) easily. If not, you may use any alternative tools available to you.
+For this you may sign up and setup [app.ragas.io](https://app.ragas.io) easily. If not, you may use any alternative tools available to you.
In order to use the [app.ragas.io](http://app.ragas.io) dashboard, you need to have an account on [app.ragas.io](https://app.ragas.io/). If you don't have one, you can sign up for one [here](https://app.ragas.io/login). You will also need to generate a [Ragas APP token](https://app.ragas.io/dashboard/settings/app-tokens).
@@ -194,7 +194,7 @@ results.upload()
## Aligning Metrics
-In the example above, we can see that the LLM-based metric mistakenly marks some summary as accurate, even though it missed critical details like growth numbers and market domain. Such mistakes can occur when the metric does not align with your specific evaluation preferences. For example,
+In the example above, we can see that the LLM-based metric mistakenly marks some summary as accurate, even though it missed critical details like growth numbers and market domain. Such mistakes can occur when the metric does not align with your specific evaluation preferences. For example,

diff --git a/docs/getstarted/rag_testset_generation.md b/docs/getstarted/rag_testset_generation.md
index 101f2892d..75ac587d7 100644
--- a/docs/getstarted/rag_testset_generation.md
+++ b/docs/getstarted/rag_testset_generation.md
@@ -3,7 +3,7 @@
This simple guide will help you generate a testset for evaluating your RAG pipeline using your own documents.
## Quickstart
-Let's walk through an quick example of generating a testset for a RAG pipeline. Following that will will explore the main components of the testset generation pipeline.
+Let's walk through a quick example of generating a testset for a RAG pipeline. Following that we will explore the main components of the testset generation pipeline.
### Load Sample Documents
@@ -49,7 +49,7 @@ dataset = generator.generate_with_langchain_docs(docs, testset_size=10)
### Analyzing the testset
-Once you have generated a testset, you would want to view it and select the queries you see fit to include in your final testset. You can export the testset to a pandas dataframe and do various analysis on it.
+Once you have generated a testset, you would want to view it and select the queries you see fit to include in your final testset. You can export the testset to a pandas DataFrame and do various analysis on it.
```python
dataset.to_pandas()
@@ -117,7 +117,7 @@ Output
KnowledgeGraph(nodes: 10, relationships: 0)
```
-Now we will enrich the knowledge graph with additional information using [Transformations][ragas.testset.transforms.base.BaseGraphTransformation]. Here we will use [default_transforms][ragas.testset.transforms.default_transforms] to create a set of default transformations to apply with an LLM and Embedding Model of your choice.
+Now we will enrich the knowledge graph with additional information using [Transformations][ragas.testset.transforms.base.BaseGraphTransformation]. Here we will use [default_transforms][ragas.testset.transforms.default_transforms] to create a set of default transformations to apply with an LLM and Embedding Model of your choice.
But you can mix and match transforms or build your own as needed.
```python
diff --git a/docs/howtos/applications/_cost.md b/docs/howtos/applications/_cost.md
index 40db29688..c74504472 100644
--- a/docs/howtos/applications/_cost.md
+++ b/docs/howtos/applications/_cost.md
@@ -4,7 +4,7 @@ When using LLMs for evaluation and test set generation, cost will be an importan
## Implement `TokenUsageParser`
-By default Ragas does not calculate the usage of tokens for `evaluate()`. This is because langchain's LLMs do not always return information about token usage in a uniform way. So in order to get the usage data, we have to implement a `TokenUsageParser`.
+By default, Ragas does not calculate the usage of tokens for `evaluate()`. This is because langchain's LLMs do not always return information about token usage in a uniform way. So in order to get the usage data, we have to implement a `TokenUsageParser`.
A `TokenUsageParser` is function that parses the `LLMResult` or `ChatResult` from langchain models `generate_prompt()` function and outputs `TokenUsage` which Ragas expects.
@@ -98,7 +98,7 @@ Output
## Token Usage for Testset Generation
-You can use the same parser for testset generation but you need to pass in the `token_usage_parser` to the `generate()` function. For now it only calculates the cost for the generation process and not the cost for the transforms.
+You can use the same parser for testset generation, but you need to pass in the `token_usage_parser` to the `generate()` function. For now, it only calculates the cost for the generation process and not the cost for the transforms.
For an example let's load an existing KnowledgeGraph and generate a testset. If you want to know more about how to generate a testset please check out the [testset generation](../../getstarted/rag_testset_generation.md#a-deeper-look).
diff --git a/docs/howtos/applications/add_to_ci.md b/docs/howtos/applications/add_to_ci.md
index 348dae231..78c9a2161 100644
--- a/docs/howtos/applications/add_to_ci.md
+++ b/docs/howtos/applications/add_to_ci.md
@@ -5,20 +5,20 @@ search:
# Adding to your CI pipeline with Pytest
-You can add Ragas evaluations as part of your Continious Integration pipeline
-to keep track of the qualitative performance of your RAG pipeline. Consider these as
+You can add Ragas evaluations as part of your Continious Integration pipeline
+to keep track of the qualitative performance of your RAG pipeline. Consider these as
part of your end-to-end test suite which you run before major changes and releases.
-The usage is straight forward but the main things is to set the `in_ci` argument for the
-`evaluate()` function to `True`. This runs Ragas metrics in a special mode that ensures
-it produces more reproducable metrics but will be more costlier.
+The usage is straight forward, but the main thing is to set the `in_ci` argument for the
+`evaluate()` function to `True`. This runs Ragas metrics in a special mode that ensures
+it produces more reproducible metrics but will be costlier.
-You can easily write a pytest test as follows
+You can easily write a Pytest test as follows
!!! note
This dataset that is already populated with outputs from a reference RAG
- When testing your own system make sure you use outputs from RAG pipeline
- you want to test. For more information on how to build your datasets check
+ When testing your own system make sure you use outputs from RAG pipeline
+ you want to test. For more information on how to build your datasets check
[Building HF `Dataset` with your own Data](./data_preparation.md) docs.
```python
@@ -58,9 +58,9 @@ def test_amnesty_e2e():
## Using Pytest Markers for Ragas E2E tests
-Because these are long end-to-end test one thing that you can leverage is [Pytest Markers](https://docs.pytest.org/en/latest/example/markers.html) which help you mark your tests with special tags. It is recommended to mark Ragas tests with special tags so you can run them only when needed.
+Because these are long end-to-end test one thing that you can leverage is [Pytest Markers](https://docs.pytest.org/en/latest/example/markers.html) which help you mark your tests with special tags. It is recommended to mark Ragas tests with special tags, so you can run them only when needed.
-To add a new `ragas_ci` tag to pytest add the following to your `conftest.py`
+To add a new `ragas_ci` tag to Pytest, add the following to your `conftest.py`
```python
def pytest_configure(config):
"""
diff --git a/docs/howtos/applications/compare_llms.md b/docs/howtos/applications/compare_llms.md
index 459f7b3d4..09794792c 100644
--- a/docs/howtos/applications/compare_llms.md
+++ b/docs/howtos/applications/compare_llms.md
@@ -5,9 +5,9 @@ search:
# Compare LLMs using Ragas Evaluations
-The llm used in the Retrieval Augmented Generation (RAG) system has a major impact in the quality of the generated output. Evaluating the results generated by different llms can give an idea about the right llm to use for a particular use case.
+The LLM used in the Retrieval Augmented Generation (RAG) system has a major impact in the quality of the generated output. Evaluating the results generated by different LLMs can give an idea about the right LLM to use for a particular use case.
-This tutorial notebook provides a step-by-step guide on how to compare and choose the most suitable llm for your own data using the Ragas library.
+This tutorial notebook provides a step-by-step guide on how to compare and choose the most suitable LLM for your own data using the Ragas library.
{width="800"}
@@ -16,11 +16,11 @@ This tutorial notebook provides a step-by-step guide on how to compare and choos
-## Create synthetic test data
+## Create synthetic test data
!!! tip
- Ragas can also work with your dataset. Refer to [data preparation](./data_preparation.md) to see how you can use your dataset with ragas.
+ Ragas can also work with your dataset. Refer to [data preparation](./data_preparation.md) to see how you can use your dataset with ragas.
Ragas offers a unique test generation paradigm that enables the creation of evaluation datasets specifically tailored to your retrieval and generation tasks. Unlike traditional QA generators, Ragas can generate a wide variety of challenging test cases from your document corpus.
@@ -146,7 +146,7 @@ metrics = [
## Evaluate Zephyr 7B Alpha LLM
-For the first llm, I will be using HuggingFace [zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha). I am using HuggingFaceInferenceAPI to generate answers using the model. HuggingFaceInferenceAPI is free to use and token can be setup using [HuggingFaceToken](https://huggingface.co/docs/hub/security-tokens).
+For the first LLM, I will be using HuggingFace [zephyr-7b-alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha). I am using HuggingFaceInferenceAPI to generate answers using the model. HuggingFaceInferenceAPI is free to use and token can be setup using [HuggingFaceToken](https://huggingface.co/docs/hub/security-tokens).
```python
# Use zephyr model using HFInference API
@@ -194,7 +194,7 @@ result
Based on the evaluation results, it is apparent that the `faithfulness`, `answer_correctness` and `answer_relevancy` metrics of the HuggingFace zephyr-7b-alpha model slightly outperform the falcon-7b-instruct model in my RAG pipeline when applied to my own dataset.
-Refer to the complete colab notebook [here](https://colab.research.google.com/drive/10dNeU56XLOGUJ9gRuBFryyRwoy70rIeS?usp=sharing).
+Refer to the complete Colab notebook [here](https://colab.research.google.com/drive/10dNeU56XLOGUJ9gRuBFryyRwoy70rIeS?usp=sharing).
```python
import numpy as np
@@ -215,7 +215,7 @@ result_falcon_df = result.to_pandas()
analysis(
result_zephyr_df[['faithfulness', 'answer_relevancy', 'answer_correctness']],
result_falcon_df[['faithfulness', 'answer_relevancy', 'answer_correctness']]
-)
+)
```
### Score distribution analysis
diff --git a/docs/howtos/applications/evaluating_multi_turn_conversations.md b/docs/howtos/applications/evaluating_multi_turn_conversations.md
index db1b547cf..b6add292b 100644
--- a/docs/howtos/applications/evaluating_multi_turn_conversations.md
+++ b/docs/howtos/applications/evaluating_multi_turn_conversations.md
@@ -2,26 +2,26 @@
This tutorial is inspired by Hamel’s notes on evaluating multi-turn conversations for LLM-based applications. The goal is to create a simple and actionable evaluation framework using Ragas metrics that clearly defines what makes a conversation successful. By the end of this tutorial, you will be able to perform multi-turn evaluations based on insights gathered from the error analysis of your AI application.
-### Ragas Metrics
+### Ragas Metrics
-Ragas offers **AspectCritic**, a powerful evaluation metric for assessing multi-turn conversations with binary outcomes. It helps determine whether a conversation meets predefined success criteria.
+Ragas offers **AspectCritic**, a powerful evaluation metric for assessing multi-turn conversations with binary outcomes. It helps determine whether a conversation meets predefined success criteria.
-**[AspectCritic](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/general_purpose/#aspect-critic)**
+**[AspectCritic](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/general_purpose/#aspect-critic)**
AspectCritic evaluates responses based on predefined aspects written in free-form natural language. It returns a binary output indicating whether the response aligns with the defined aspect.
This metric aligns with Hamel's [suggestion](https://hamel.dev/notes/llm/officehours/evalmultiturn.html#focus-on-binary-decisions) to focus on binary decisions, which eliminate ambiguity and provide a clear, actionable approach to improving conversation quality.
-### Practical Example – Evaluating a Banking Chatbot
+### Practical Example – Evaluating a Banking Chatbot
-When evaluating, focus on metrics that directly align with your users’ needs. Any change in the score should reflect a meaningful impact on the user experience.
+When evaluating, focus on metrics that directly align with your users’ needs. Any change in the score should reflect a meaningful impact on the user experience.
-Let’s consider an example where you are building a chatbot for a bank.
+Let’s consider an example where you are building a chatbot for a bank.
-After conducting [error analysis](https://hamel.dev/notes/llm/officehours/erroranalysis.html#the-data-first-approach), you find that the chatbot occasionally forgets tasks it was asked to complete or performs them only partially. To improve the chatbot’s performance, you need a reliable method to **measure and evaluate** this behavior.
+After conducting [error analysis](https://hamel.dev/notes/llm/officehours/erroranalysis.html#the-data-first-approach), you find that the chatbot occasionally forgets tasks it was asked to complete or performs them only partially. To improve the chatbot’s performance, you need a reliable method to **measure and evaluate** this behavior.
-> **Note:** When defining the scoring criteria, use standardized terminology.
-> - Refer to the user’s message as `human` message.
-> - Refer to the chatbot’s message as `AI` message.
+> **Note:** When defining the scoring criteria, use standardized terminology.
+> - Refer to the user’s message as `human` message.
+> - Refer to the chatbot’s message as `AI` message.
```python
@@ -148,7 +148,7 @@ Evaluating: 100%|██████████| 2/2 [00:00, ?it/s]
-When evaluating with LLM-based metrics, each metric may involve one or more calls to the LLM. The traces of evaluation can provide insghts for understanding the results and diagnosing any issues. You can find more details on this process by visiting [this page](https://docs.ragas.io/en/stable/howtos/applications/_metrics_llm_calls/).
+When evaluating with LLM-based metrics, each metric may involve one or more calls to the LLM. The traces of evaluation can provide insights for understanding the results and diagnosing any issues. You can find more details on this process by visiting [this page](https://docs.ragas.io/en/stable/howtos/applications/_metrics_llm_calls/).
Another pattern identified during error analysis is that your banking chatbot occasionally drifts from discussing basic account services into offering unauthorized investment advice. To maintain user trust and ensure regulatory compliance, you want the system to implement **graceful transitions** when conversations approach these boundaries. You can achieve this by defining a metric like the one below.
@@ -387,21 +387,21 @@ Evaluating: 100%|██████████| 4/4 [00:00, ?it/s]
The above evaluation result highlights that what is considered polite in Mexico may not be perceived as polite in Japan.
-### Checking for Brand Tone
+### Checking for Brand Tone
-In this section, we will explore how to evaluate whether the chatbot’s tone is consistent with the business’s values, target audience, and overall brand identity.
+In this section, we will explore how to evaluate whether the chatbot’s tone is consistent with the business’s values, target audience, and overall brand identity.
**What is a Brand Tone of Voice?**
-A brand’s tone of voice refers to its choice of words when communicating with its audience in written or spoken interactions. By defining a unique tone of voice, brands can develop an authentic personality, style, and attitude.
-[Reference](https://filestage.io/blog/brand-tone-of-voice-examples/)
+A brand’s tone of voice refers to its choice of words when communicating with its audience in written or spoken interactions. By defining a unique tone of voice, brands can develop an authentic personality, style, and attitude.
+[Reference](https://filestage.io/blog/brand-tone-of-voice-examples/)
-For example:
+For example:
-**Google – Informative and Helpful Brand Voice**
-Have you noticed how simple and intuitive everything feels when you use Google products? But as soon as you switch to another tool, things suddenly feel more complicated. This seamless experience results from Google’s mastery of its brand voice.
+**Google – Informative and Helpful Brand Voice**
+Have you noticed how simple and intuitive everything feels when you use Google products? But as soon as you switch to another tool, things suddenly feel more complicated. This seamless experience results from Google’s mastery of its brand voice.
-Google maintains a friendly and approachable tone while keeping user communication clear and concise. Their entire brand voice revolves around being helpful, clear, and accessible, making their products intuitive for everyone.
-[Reference](https://filestage.io/blog/brand-tone-of-voice-examples/)
+Google maintains a friendly and approachable tone while keeping user communication clear and concise. Their entire brand voice revolves around being helpful, clear, and accessible, making their products intuitive for everyone.
+[Reference](https://filestage.io/blog/brand-tone-of-voice-examples/)
You can assess whether your chatbot’s responses align with your brand identity by defining a custom evaluation metric like the one below.
diff --git a/docs/howtos/applications/gemini_benchmarking.md b/docs/howtos/applications/gemini_benchmarking.md
index 9ca6fa16b..c3343cf60 100644
--- a/docs/howtos/applications/gemini_benchmarking.md
+++ b/docs/howtos/applications/gemini_benchmarking.md
@@ -6,7 +6,7 @@ In this tutorial, we'll benchmark Gemini models on the AllenAI's QASPER dataset
QASPER (Question Answering over Scientific Papers) is a dataset consisting of 5,049 questions based on 1,585 NLP research papers. Annotators created these questions from titles and abstracts, a different set of annotators answered them from the full paper texts.
-Data Collection Process:
+Data Collection Process:
1. Paper Selection: NLP domain papers from arXiv (LaTeX format) were selected from the S2ORC corpus.
2. Question Writing: Annotators wrote realistic, information-seeking questions based only on paper titles and abstracts.
@@ -15,7 +15,7 @@ Data Collection Process:

-Link to the [Dataset](https://huggingface.co/datasets/allenai/qasper) and further details about QASPER can be found [here](https://huggingface.co/datasets/allenai/qasper).
+Link to the [Dataset](https://huggingface.co/datasets/allenai/qasper) and further details about QASPER can be found [here](https://huggingface.co/datasets/allenai/qasper).
## Loading Dataset
@@ -43,7 +43,7 @@ Dataset({
## Processing Dataset
-Since our goal is to benchmark the model’s performance on academic question-answering tasks, we need answers generated by LLMs based on the entire text of each research paper. We extract the full text from the dataset’s "full_text" column and format it into markdown, clearly organizing it into sections and paragraphs for improved readability and context.
+Since our goal is to benchmark the model’s performance on academic question-answering tasks, we need answers generated by LLMs based on the entire text of each research paper. We extract the full text from the dataset’s "full_text" column and format it into Markdown, clearly organizing it into sections and paragraphs for improved readability and context.
To create question-answer pairs for evaluation, we use the dataset’s "qas" column. This column provides questions paired with answers in one of three formats: extractive spans, yes/no responses, or free-form answers. We then combine these into a single "golden answer" column, which serves as the ground truth for assessing model performance.
@@ -284,7 +284,7 @@ qa_prompt = (
)
```
-### Gemini 2.0 Falsh
+### Gemini 2.0 Flash
??? note "Code for AsyncExecutor"
@@ -457,7 +457,7 @@ processed_dataset["gemini_2_flash_responses"] = executor.results()
LLM Processing: 100%|██████████| 30/30 [00:04<00:00, 7.20it/s]
```
-### Gemini 1.5 Falsh
+### Gemini 1.5 Flash
```python
@@ -602,13 +602,13 @@ processed_dataset.head()
## Defining Metrics For Evaluation
-We are benchmarking a question-answering task and we want to ensure that each question is answered properly and accurately. To achieve this, we use the following metrics from Ragas you find the complete list of metrics available in Ragas [here](../../concepts/metrics/available_metrics/index.md)
+We are benchmarking a question-answering task, and we want to ensure that each question is answered properly and accurately. To achieve this, we use the following metrics from Ragas you find the complete list of metrics available in Ragas [here](../../concepts/metrics/available_metrics/index.md)
- Answer Accuracy: Measures how closely a response matches the reference answer.
- Answer Correctness: Assesses the alignment between the generated answer and the reference answer.
- Factual Correctness: Checks if all statements in a response are supported by the reference answer.
-For each question, we know whether it can be answered from the provided context, and we want to see if the model can correctly identify it or not. For this purpose, we define a custom binary metric using AspectCritique.
+For each question, we know whether it can be answered from the provided context, and we want to see if the model can correctly identify it or not. For this purpose, we define a custom binary metric using AspectCritic.
```python
@@ -642,7 +642,7 @@ metrics = [
We format the processed data into a Ragas EvaluationDataset, then apply the metrics to evaluate model performance, more information on it can be found [here](../../concepts/components/eval_dataset.md). We’ll construct the EvaluationDataset using the questions and the golden answer responses generated by the Gemini models from our processed dataset.
-### Gemini 2.0 Falsh
+### Gemini 2.0 Flash
We'll create EvaluationDataset for the Gemini 2.0 Flash.
@@ -733,7 +733,7 @@ gemini_2_dataset.to_pandas().head()
-Now, let’s evaluate the responses of Gemini 2.0 Falsh.
+Now, let’s evaluate the responses of Gemini 2.0 Flash.
```python
@@ -829,7 +829,7 @@ Evaluating: 100%|██████████| 120/120 [00:49<00:00, 2.44it/s
-A completely optional step, if you want to upload the evaluation results to your Ragas app, you can run the command below.You can learn more about Ragas app here.
+A completely optional step, if you want to upload the evaluation results to your Ragas app, you can run the command below. You can learn more about Ragas app here.
```python
@@ -1025,7 +1025,7 @@ Evaluating: 100%|██████████| 120/120 [01:02<00:00, 1.93it/s
## Comparing the Results
-Now that we have completed our evaluations, let’s compare how both models performed on acadmic question answering.
+Now that we have completed our evaluations, let’s compare how both models performed on academic question answering.
```python
@@ -1057,7 +1057,7 @@ Factual Correctness: 0.23633333333333334
Gemini 2.0 Flash performs slightly better overall.
-Let’s see how well the models performed on classifying if a given question can be answered with the provided text.
+Let’s see how well the models performed on classifying if a given question can be answered with the provided text.
For this, we’ll use the result from the “unanswerable” metric and compare it with the original ground truth from the “unanswerable” column in our pre-processed dataset.
@@ -1086,7 +1086,7 @@ def print_metrics(actuals, preds, model_name="Model", zero_division_value=0):
print("F1 Score:", f1_score(actuals, preds, zero_division=zero_division_value))
print("\nClassification Report:")
print(classification_report(actuals, preds, zero_division=zero_division_value))
-
+
gemini_1_5_flash_prediction = gemini_1_5_flash_score["unanswerable"]
gemini_2_flash_prediction = gemini_2_flash_score["unanswerable"]
groundtruth = processed_dataset["unanswerable"].astype(int)
@@ -1111,7 +1111,7 @@ Classification Report:
accuracy 0.93 30
macro avg 0.75 0.96 0.81 30
weighted avg 0.97 0.93 0.94 30
-```
+```
```python
print_metrics(groundtruth, gemini_1_5_flash_prediction, model_name="Gemini 1.5 Flash")
@@ -1123,7 +1123,7 @@ Accuracy: 0.9
Precision: 0.3333333333333333
Recall: 0.5
F1 Score: 0.4
-
+
Classification Report:
precision recall f1-score support
@@ -1133,14 +1133,14 @@ Classification Report:
accuracy 0.90 30
macro avg 0.65 0.71 0.67 30
weighted avg 0.92 0.90 0.91 30
-```
+```
Gemini 2.0 Flash also outperforms Gemini 1.5 Flash in identifying unanswerable questions.
## What's Next
-You can benchmark your models on any dataset using Ragas metrics as long as the dataset is formatted to Ragas EvaluationDatase. Try benchmarking your models on a variety of established benchmarking datasets.
+You can benchmark your models on any dataset using Ragas metrics as long as the dataset is formatted to Ragas EvaluationDataset. Try benchmarking your models on a variety of established benchmarking datasets.
- [PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA)
- [MultiHopRAG](https://huggingface.co/datasets/yixuantt/MultiHopRAG)
diff --git a/docs/howtos/applications/vertexai_alignment.md b/docs/howtos/applications/vertexai_alignment.md
index 40f4a50c1..c424c50e3 100644
--- a/docs/howtos/applications/vertexai_alignment.md
+++ b/docs/howtos/applications/vertexai_alignment.md
@@ -1,6 +1,6 @@
# Aligning LLM Evaluators with Human Judgment
-This tutorial is part of a three-part series on how to use Vertex AI models with Ragas. It is recommeded that you have gone through [Getting Started: Ragas with Vertex AI](./vertexai_x_ragas.md), even if you have not you can eaisly follow this. You can navigate to the Model Comparison tutorial using the [link](./vertexai_model_comparision.md).
+This tutorial is part of a three-part series on how to use Vertex AI models with Ragas. It is recommended that you have gone through [Getting Started: Ragas with Vertex AI](./vertexai_x_ragas.md), even if you have not, you can easily follow this. You can navigate to the Model Comparison tutorial using the [link](./vertexai_model_comparision.md).
## Overview
@@ -140,7 +140,7 @@ def alignment_score(human_score: List[float], llm_score: List[float]) -> float:
## Prepare your dataset
-The `process_hhh_dataset` function prepares data from the [HHH dataset](https://paperswithcode.com/dataset/hhh?utm_source=chatgpt.com) for use in training and aligning of the LLM evaluator. Alternate 0 and 1 scores (1 for helpful, 0 for non-helpful) are assigned to each example, indicating which response is preferred.
+The `process_hhh_dataset` function prepares data from the [HHH dataset](https://paperswithcode.com/dataset/hhh?utm_source=chatgpt.com) for use in training and aligning of the LLM evaluator. Alternate 0 and 1 scores (1 for helpful, 0 for non-helpful) are assigned to each example, indicating which response is preferred.
```python
@@ -169,7 +169,7 @@ def process_hhh_dataset(split: str = "helpful", total_count: int = 50):
score = 0
label_index = labels.index(target_label)
-
+
response = choices[label_index]
data.append({
diff --git a/docs/howtos/applications/vertexai_model_comparision.md b/docs/howtos/applications/vertexai_model_comparision.md
index f50249af4..4af195a83 100644
--- a/docs/howtos/applications/vertexai_model_comparision.md
+++ b/docs/howtos/applications/vertexai_model_comparision.md
@@ -1,10 +1,10 @@
# Compare models provided by VertexAI on RAG-based Q&A task using Ragas metrics
-This tutorial is part of a three-part series on how to use Vertex AI models with Ragas. It is recommeded that you have gone through [Getting Started: Ragas with Vertex AI](./vertexai_x_ragas.md), even if you have not followed it you’ll be golden. You can check to the Align LLM Metrics tutorial by [clicking](./vertexai_alignment.md).
+This tutorial is part of a three-part series on how to use Vertex AI models with Ragas. It is recommended that you have gone through [Getting Started: Ragas with Vertex AI](./vertexai_x_ragas.md), even if you have not followed it you’ll be golden. You can check to the Align LLM Metrics tutorial by [clicking](./vertexai_alignment.md).
## Overview
-In this tutorial, you will learn how to use the Ragas to score and evaluate different LLM models for a **Question Answering** (QA) task. Then visualise and compare the evaluation results to select a generative model.
+In this tutorial, you will learn how to use the Ragas to score and evaluate different LLM models for a **Question Answering** (QA) task. Then visualise and compare the evaluation results to select a generative model.
## Getting Started
@@ -427,7 +427,7 @@ Evaluating: 100%|██████████| 12/12 [00:00, ?it/s]
Evaluating: 100%|██████████| 12/12 [00:00, ?it/s]
```
-Wrap the results into google’s EvalResult structure:
+Wrap the results into Google’s EvalResult structure:
```python
diff --git a/docs/howtos/applications/vertexai_x_ragas.md b/docs/howtos/applications/vertexai_x_ragas.md
index fa8767fb1..2d3199e94 100644
--- a/docs/howtos/applications/vertexai_x_ragas.md
+++ b/docs/howtos/applications/vertexai_x_ragas.md
@@ -82,7 +82,7 @@ vertexai.init(project=PROJECT_ID, location=LOCATION)
In the sections below, you will learn how to leverage the various types of metrics available in Ragas:
-- **Custom Metrics:** Define and integrate your own metrics best tailored for you application evaluations.
+- **Custom Metrics:** Define and integrate your own metrics best tailored for your application evaluations.
- **Model-based Metrics:** Evaluations that analyse model outputs against specific criteria using LLM calls, either with or without references.
- **Computation-based Metrics:** Quantitative measures based on mathematical formulas that do not require LLM calls.
diff --git a/docs/howtos/customizations/_caching.md b/docs/howtos/customizations/_caching.md
index ec71c6390..fefc95bb2 100644
--- a/docs/howtos/customizations/_caching.md
+++ b/docs/howtos/customizations/_caching.md
@@ -7,7 +7,7 @@ You can use the [DiskCacheBackend][ragas.cache.DiskCacheBackend] which uses a lo
## Using DefaultCacher
-Let's see how you can use the [DiskCacheBackend][ragas.cache.DiskCacheBackend] LLM and Embedding models.
+Let's see how you can use the [DiskCacheBackend][ragas.cache.DiskCacheBackend] LLM and Embedding models.
diff --git a/docs/howtos/customizations/_run_config.md b/docs/howtos/customizations/_run_config.md
index 6c57bb8c3..39ffdcf41 100644
--- a/docs/howtos/customizations/_run_config.md
+++ b/docs/howtos/customizations/_run_config.md
@@ -9,7 +9,7 @@ How to configure the `RunConfig` in
## Rate Limits
-Ragas leverages parallelism with Async in python but the `RunConfig` has a field called `max_workers` which control the number of concurent requests allowed together. You adjust this to get the maximum concurency your provider allows
+Ragas leverages parallelism with Async in python but the `RunConfig` has a field called `max_workers` which control the number of concurrent requests allowed together. You adjust this to get the maximum concurrency your provider allows
```python
diff --git a/docs/howtos/customizations/customize_models.md b/docs/howtos/customizations/customize_models.md
index 2ee6b7dfe..848f48786 100644
--- a/docs/howtos/customizations/customize_models.md
+++ b/docs/howtos/customizations/customize_models.md
@@ -1,13 +1,13 @@
## Customize Models
-Ragas may use a LLM and or Embedding for evaluation and synthetic data generation. Both of these models can be customised according to you availabiity.
+Ragas may use a LLM and or Embedding for evaluation and synthetic data generation. Both of these models can be customised according to you availability.
!!! note
- Ragas supports all the [LLMs](https://python.langchain.com/docs/integrations/chat/) and [Embeddings](https://python.langchain.com/docs/integrations/text_embedding/) available in langchain
+ Ragas supports all the [LLMs](https://python.langchain.com/docs/integrations/chat/) and [Embeddings](https://python.langchain.com/docs/integrations/text_embedding/) available in LangChain
-- `BaseRagasLLM` and `BaseRagasEmbeddings` are the base classes Ragas uses internally for LLMs and Embeddings. Any custom LLM or Embeddings should be a subclass of these base classes.
+- `BaseRagasLLM` and `BaseRagasEmbeddings` are the base classes Ragas uses internally for LLMs and Embeddings. Any custom LLM or Embeddings should be a subclass of these base classes.
-- If you are using Langchain, you can pass the Langchain LLM and Embeddings directly and Ragas will wrap it with `LangchainLLMWrapper` or `LangchainEmbeddingsWrapper` as needed.
+- If you are using LangChain, you can pass the LangChain LLM and Embeddings directly and Ragas will wrap it with `LangchainLLMWrapper` or `LangchainEmbeddingsWrapper` as needed.
## Examples
@@ -57,7 +57,7 @@ azure_embeddings = AzureOpenAIEmbeddings(
azure_llm = LangchainLLMWrapper(azure_llm)
azure_embeddings = LangchainEmbeddingsWrapper(azure_embeddings)
```
-Yay! Now are you ready to use ragas with Azure OpenAI endpoints
+Yay! Now you are ready to use ragas with Azure OpenAI endpoints
### Google Vertex
@@ -81,7 +81,7 @@ config = {
# authenticate to GCP
creds, _ = google.auth.default(quota_project_id=config["project_id"])
-# create Langchain LLM and Embeddings
+# create LangChain LLM and Embeddings
vertextai_llm = ChatVertexAI(
credentials=creds,
model_name=config["chat_model_id"],
@@ -95,7 +95,7 @@ def gemini_is_finished_parser(response: LLMResult) -> bool:
is_finished_list = []
for g in response.flatten():
resp = g.generations[0][0]
-
+
# Check generation_info first
if resp.generation_info is not None:
finish_reason = resp.generation_info.get("finish_reason")
@@ -104,7 +104,7 @@ def gemini_is_finished_parser(response: LLMResult) -> bool:
finish_reason in ["STOP", "MAX_TOKENS"]
)
continue
-
+
# Check response_metadata as fallback
if isinstance(resp, ChatGeneration) and resp.message is not None:
metadata = resp.message.response_metadata
@@ -114,20 +114,20 @@ def gemini_is_finished_parser(response: LLMResult) -> bool:
)
elif metadata.get("stop_reason"):
is_finished_list.append(
- metadata["stop_reason"] in ["STOP", "MAX_TOKENS"]
+ metadata["stop_reason"] in ["STOP", "MAX_TOKENS"]
)
-
+
# If no finish reason found, default to True
if not is_finished_list:
is_finished_list.append(True)
-
+
return all(is_finished_list)
vertextai_llm = LangchainLLMWrapper(vertextai_llm, is_finished_parser=gemini_is_finished_parser)
vertextai_embeddings = LangchainEmbeddingsWrapper(vertextai_embeddings)
```
-Yay! Now are you ready to use ragas with Google VertexAI endpoints
+Yay! Now you are ready to use ragas with Google VertexAI endpoints
### AWS Bedrock
@@ -167,4 +167,4 @@ bedrock_embeddings = BedrockEmbeddings(
bedrock_llm = LangchainLLMWrapper(bedrock_llm)
bedrock_embeddings = LangchainEmbeddingsWrapper(bedrock_embeddings)
```
-Yay! Now are you ready to use ragas with AWS Bedrock endpoints
+Yay! Now you are ready to use ragas with AWS Bedrock endpoints
diff --git a/docs/howtos/customizations/index.md b/docs/howtos/customizations/index.md
index 65f1d18c1..2eebbf7aa 100644
--- a/docs/howtos/customizations/index.md
+++ b/docs/howtos/customizations/index.md
@@ -2,7 +2,7 @@
How to customize various aspects of Ragas to suit your needs.
-## General
+## General
- [Customize models](customize_models.md)
- [Customize timeouts, retries and others](./_run_config.md)
@@ -16,7 +16,7 @@ How to customize various aspects of Ragas to suit your needs.
## Testset Generation
-- [Generate test data from non-english corpus](testgenerator/_language_adaptation.md)
+- [Generate test data from non-English corpus](testgenerator/_language_adaptation.md)
- [Configure or automatically generate Personas](testgenerator/_persona_generator.md)
- [Customize single-hop queries for RAG evaluation](testgenerator/_testgen-custom-single-hop.md)
- [Create custom multi-hop queries for RAG evaluation](testgenerator/_testgen-customisation.md)
diff --git a/docs/howtos/customizations/metrics/_cost.md b/docs/howtos/customizations/metrics/_cost.md
index 3cd5501a5..8a0f93190 100644
--- a/docs/howtos/customizations/metrics/_cost.md
+++ b/docs/howtos/customizations/metrics/_cost.md
@@ -4,9 +4,9 @@ When using LLMs for evaluation and test set generation, cost will be an importan
## Understanding `TokenUsageParser`
-By default Ragas does not calculate the usage of tokens for `evaluate()`. This is because langchain's LLMs do not always return information about token usage in a uniform way. So in order to get the usage data, we have to implement a `TokenUsageParser`.
+By default, Ragas does not calculate the usage of tokens for `evaluate()`. This is because LangChain's LLMs do not always return information about token usage in a uniform way. So in order to get the usage data, we have to implement a `TokenUsageParser`.
-A `TokenUsageParser` is function that parses the `LLMResult` or `ChatResult` from langchain models `generate_prompt()` function and outputs `TokenUsage` which Ragas expects.
+A `TokenUsageParser` is function that parses the `LLMResult` or `ChatResult` from LangChain models `generate_prompt()` function and outputs `TokenUsage` which Ragas expects.
For an example here is one that will parse OpenAI by using a parser we have defined.
diff --git a/docs/howtos/customizations/metrics/_metrics_language_adaptation.md b/docs/howtos/customizations/metrics/_metrics_language_adaptation.md
index bd47b38c4..151c20c9e 100644
--- a/docs/howtos/customizations/metrics/_metrics_language_adaptation.md
+++ b/docs/howtos/customizations/metrics/_metrics_language_adaptation.md
@@ -1,6 +1,6 @@
# Adapting metrics to target language
-While using ragas to evaluate LLM application workflows, you may have applications to be evaluated that are in languages other than english. In this case, it is best to adapt your LLM powered evaluation metrics to the target language. One obivous way to do this is to manually change the instruction and demonstration, but this can be time consuming. Ragas here offers automatic language adaptation where you can automatically adapt any metrics to target language by using LLM itself. This notebook demonstrates this with simple example
+While using ragas to evaluate LLM application workflows, you may have applications to be evaluated that are in languages other than English. In this case, it is best to adapt your LLM powered evaluation metrics to the target language. One obvious way to do this is to manually change the instruction and demonstration, but this can be time-consuming. Ragas here offers automatic language adaptation where you can automatically adapt any metrics to target language by using LLM itself. This notebook demonstrates this with simple example
For the sake of this example, let's choose and metric and inspect the default prompts
@@ -26,7 +26,7 @@ scorer.get_prompts()
-As you can see, the instruction and demonstration are both in english. Setting up LLM to be used for this conversion
+As you can see, the instruction and demonstration are both in English. Setting up LLM to be used for this conversion
```python
@@ -36,7 +36,7 @@ llm = llm_factory()
```
Now let's adapt it to 'hindi' as the target language using `adapt` method.
-Language adaptation in Ragas works by translating few shot examples given along with the prompts to the target language. Instructions remains in english.
+Language adaptation in Ragas works by translating few shot examples given along with the prompts to the target language. Instructions remains in English.
```python
diff --git a/docs/howtos/customizations/metrics/_modifying-prompts-metrics.md b/docs/howtos/customizations/metrics/_modifying-prompts-metrics.md
index a8632bc79..f4fcd069d 100644
--- a/docs/howtos/customizations/metrics/_modifying-prompts-metrics.md
+++ b/docs/howtos/customizations/metrics/_modifying-prompts-metrics.md
@@ -1,12 +1,12 @@
# Modifying prompts in metrics
-Every metrics in ragas that uses LLM also uses one or more prompts to come up with intermediate results that is used for formulating scores. Prompts can be treated like hyperparameters when using LLM based metrics. An optimised prompt that suits your domain and use-case can increase the accuracy of your LLM based metrics by 10-20%. An optimal prompt is also depended on the LLM one is using, so as users you might want to tune prompts that powers each metric.
+Every metrics in ragas that uses LLM also uses one or more prompts to come up with intermediate results that is used for formulating scores. Prompts can be treated like hyperparameters when using LLM based metrics. An optimised prompt that suits your domain and use-case can increase the accuracy of your LLM based metrics by 10-20%. An optimal prompt is also depended on the LLM one is using, so as users you might want to tune prompts that powers each metric.
-Each prompt in Ragas is written using [Prompt Object][ragas.prompts.PydanticPrompt]lease make sure you have an understanding of it before going further.
+Each prompt in Ragas is written using [Prompt Object][ragas.prompts.PydanticPrompt] At least make sure you have an understanding of it before going further.
### Understand the prompts of your Metric
-Since Ragas treats prompts like hyperparameters in metrics, we have a unified interface of `get_prompts` to access prompts used underneath any metrics.
+Since Ragas treats prompts like hyperparameters in metrics, we have a unified interface of `get_prompts` to access prompts used underneath any metrics.
```python
@@ -33,7 +33,7 @@ Your task is to judge the faithfulness of a series of statements based on a give
```
### Modifying instruction in default prompt
-It is highly likely that one might want to modify the prompt to suit ones needs. Ragas provides `set_prompts` methods to allow you to do so. Let's change the one of the prompts used in `FactualCorrectness` metrics
+It is highly likely that one might want to modify the prompt to suit one's needs. Ragas provides `set_prompts` methods to allow you to do so. Let's change the one of the prompts used in `FactualCorrectness` metrics
```python
diff --git a/docs/howtos/customizations/metrics/_write_your_own_metric.md b/docs/howtos/customizations/metrics/_write_your_own_metric.md
index 83826fffc..81d8c85a3 100644
--- a/docs/howtos/customizations/metrics/_write_your_own_metric.md
+++ b/docs/howtos/customizations/metrics/_write_your_own_metric.md
@@ -1,8 +1,8 @@
-While Ragas has [a number of built-in metrics](./../../../concepts/metrics/available_metrics/index.md), you may find yourself needing to create a custom metric for your use case. This guide will help you do just that.
+While Ragas has [a number of built-in metrics](./../../../concepts/metrics/available_metrics/index.md), you may find yourself needing to create a custom metric for your use case. This guide will help you do just that.
For the sake of this tutorial, let's assume we want to build a custom metric that measures the hallucinations in a LLM application. While we do have a built-in metric called [Faithfulness][ragas.metrics.Faithfulness] which is similar but not exactly the same. `Faithfulness` measures the factual consistency of the generated answer against the given context while `Hallucinations` measures the presence of hallucinations in the generated answer.
-before we start, lets load the dataset and define the llm
+before we start, let's load the dataset and define the LLM:
```python
@@ -39,7 +39,7 @@ evaluator_llm = llm_factory("gpt-4o")
## Aspect Critic - Simple Criteria Scoring
-[Aspect Critic](./../../../concepts/metrics/available_metrics/aspect_critic.md) that outputs a binary score for `definition` you provide. A simple pass/fail metric can be bring clarity and focus to what you are trying to measure and is a better alocation of effort than building a more complex metric from scratch, especially when starting out.
+[Aspect Critic](./../../../concepts/metrics/available_metrics/aspect_critic.md) that outputs a binary score for `definition` you provide. A simple pass/fail metric can bring clarity and focus to what you are trying to measure and is a better allocation of effort than building a more complex metric from scratch, especially when starting out.
Check out these resources to learn more about the effectiveness of having a simple pass/fail metric:
@@ -84,7 +84,7 @@ rubric = {
}
```
-Now lets init the metric with the rubric and evaluator llm and evaluate the dataset.
+Now let's initialize the metric with the rubric and evaluator LLM and evaluate the dataset.
```python
@@ -115,9 +115,9 @@ If your use case is not covered by those two, you can build a custom metric by s
4. Do I need to use both LLM and Embeddings to evaluate my metric? If yes, subclass both the [MetricWithLLM][ragas.metrics.base.MetricWithLLM] and [MetricWithEmbeddings][ragas.metrics.base.MetricWithEmbeddings] classes.
-For our example, we need to to use LLMs to evaluate our metric so we will subclass the [MetricWithLLM][ragas.metrics.base.MetricWithLLM] class and we are working for only single turn interactions for now so we will subclass the [SingleTurnMetric][ragas.metrics.base.SingleTurnMetric] class.
+For our example, we need to use LLMs to evaluate our metric so we will subclass the [MetricWithLLM][ragas.metrics.base.MetricWithLLM] class, and we are working for only single turn interactions for now so we will subclass the [SingleTurnMetric][ragas.metrics.base.SingleTurnMetric] class.
-As for the implementation, we will use the [Faithfulness][ragas.metrics.Faithfulness] metric to evaluate our metric to measure the hallucinations with the formula
+As for the implementation, we will use the [Faithfulness][ragas.metrics.Faithfulness] metric to evaluate our metric to measure the hallucinations with the formula
$$
\text{Hallucinations} = 1 - \text{Faithfulness}
diff --git a/docs/howtos/customizations/metrics/_write_your_own_metric_advanced.md b/docs/howtos/customizations/metrics/_write_your_own_metric_advanced.md
index af4ca8c17..bd3875f90 100644
--- a/docs/howtos/customizations/metrics/_write_your_own_metric_advanced.md
+++ b/docs/howtos/customizations/metrics/_write_your_own_metric_advanced.md
@@ -2,7 +2,7 @@ While evaluating your LLM application with Ragas metrics, you may find yourself
It assumes that you are already familiar with the concepts of [Metrics](/concepts/metrics/overview/index.md) and [Prompt Objects](/concepts/components/prompt.md) in Ragas. If not, please review those topics before proceeding.
-For the sake of this tutorial, let's build a custom metric that scores the refusal rate in applications.
+For the sake of this tutorial, let's build a custom metric that scores the refusal rate in applications.
## Formulate your metric
@@ -13,14 +13,14 @@ $$
\text{Refusal rate} = \frac{\text{Total number of refused requests}}{\text{Total number of human requests}}
$$
-**Step 2**: Decide how are you going to derive this information from the sample. Here I am going to use LLM to do it, ie to check whether the request was refused or answered. You may use Non LLM based methods too. Since I am using LLM based method, this would become an LLM based metric.
+**Step 2**: Decide how are you going to derive this information from the sample. Here I am going to use LLM to do it, i.e. to check whether the request was refused or answered. You may use Non LLM based methods too. Since I am using LLM based method, this would become an LLM based metric.
-**Step 3**: Decide if your metric should work in Single Turn and or Multi Turn data.
+**Step 3**: Decide if your metric should work in Single Turn and or Multi Turn data.
## Import required base classes
-For refusal rate, I have decided it to be a LLM based metric that should work both in single turn and multi turn data samples.
+For refusal rate, I have decided it to be a LLM based metric that should work both in single turn and multi turn data samples.
```python
@@ -69,7 +69,7 @@ class RefusalPrompt(PydanticPrompt[RefusalInput, RefusalOutput]):
]
```
-Now let's implement the new metric. Here, since I want this metric to work with both `SingleTurnSample` and `MultiTurnSample` I am implementing scoring methods for both types.
+Now let's implement the new metric. Here, since I want this metric to work with both `SingleTurnSample` and `MultiTurnSample` I am implementing scoring methods for both types.
Also since for the sake of simplicity I am implementing a simple method to calculate refusal rate in multi-turn conversations
diff --git a/docs/howtos/customizations/metrics/tracing.md b/docs/howtos/customizations/metrics/tracing.md
index 326977185..16ed3531c 100644
--- a/docs/howtos/customizations/metrics/tracing.md
+++ b/docs/howtos/customizations/metrics/tracing.md
@@ -1,8 +1,8 @@
# Tracing and logging evaluations with Observability tools
-Logging and tracing results from llm are important for any language model-based application. This is a tutorial on how to do tracing with Ragas. Ragas provides `callbacks` functionality which allows you to hook various tracers like Langmsith, wandb, Opik, etc easily. In this notebook, I will be using Langmith for tracing.
+Logging and tracing results from LLM are important for any language model-based application. This is a tutorial on how to do tracing with Ragas. Ragas provides `callbacks` functionality which allows you to hook various tracers like LangSmith, wandb, Opik, etc easily. In this notebook, I will be using LangSmith for tracing.
-To set up Langsmith, we need to set some environment variables that it needs. For more information, you can refer to the [docs](https://docs.smith.langchain.com/)
+To set up LangSmith, we need to set some environment variables that it needs. For more information, you can refer to the [docs](https://docs.smith.langchain.com/)
```bash
export LANGCHAIN_TRACING_V2=true
@@ -11,10 +11,10 @@ export LANGCHAIN_API_KEY=
export LANGCHAIN_PROJECT= # if not specified, defaults to "default"
```
-Now we have to import the required tracer from langchain, here we are using `LangChainTracer` but you can similarly use any tracer supported by langchain like [WandbTracer](https://python.langchain.com/docs/integrations/providers/wandb_tracing) or [OpikTracer](https://comet.com/docs/opik/tracing/integrations/ragas?utm_source=ragas&utm_medium=docs&utm_campaign=opik&utm_content=tracing_how_to)
+Now we have to import the required tracer from LangChain, here we are using `LangChainTracer`, but you can similarly use any tracer supported by LangChain like [WandbTracer](https://python.langchain.com/docs/integrations/providers/wandb_tracing) or [OpikTracer](https://comet.com/docs/opik/tracing/integrations/ragas?utm_source=ragas&utm_medium=docs&utm_campaign=opik&utm_content=tracing_how_to)
```python
-# langsmith
+# LangSmith
from langchain.callbacks.tracers import LangChainTracer
tracer = LangChainTracer(project_name="callback-experiments")
@@ -37,9 +37,9 @@ evaluate(dataset, metrics=[LLMContextRecall()],callbacks=[tracer])
{'context_precision': 1.0000}
```
- 
- Tracing with Langsmith
+ 
+ Tracing with LangSmith
-
-You can also write your own custom callbacks using langchain’s `BaseCallbackHandler`, refer [here](https://www.notion.so/Docs-logging-and-tracing-6f21cde9b3cb4d499526f48fd615585d?pvs=21) to read more about it.
+
+You can also write your own custom callbacks using LangChain’s `BaseCallbackHandler`, refer [here](https://www.notion.so/Docs-logging-and-tracing-6f21cde9b3cb4d499526f48fd615585d?pvs=21) to read more about it.
diff --git a/docs/howtos/customizations/metrics/train_your_own_metric.md b/docs/howtos/customizations/metrics/train_your_own_metric.md
index 3f11f4799..d09bd20c1 100644
--- a/docs/howtos/customizations/metrics/train_your_own_metric.md
+++ b/docs/howtos/customizations/metrics/train_your_own_metric.md
@@ -1,8 +1,8 @@
# Train and Align your own Metric
-[Open notebook in colab](https://colab.research.google.com/drive/16RIHEAJ0Ded3RuPoMq5498vBuhvPIruv?usp=sharing)
+[Open notebook in Colab](https://colab.research.google.com/drive/16RIHEAJ0Ded3RuPoMq5498vBuhvPIruv?usp=sharing)
-LLM as judge metric often makes mistakes and lack alignment with human evaluators. This makes them risky to use as their results cannot be trusted fully. Now, you can fix this using ragas. This simple tutorial notebook showcasing how to train and align any LLM as judge metric using ragas. One can use this to train any LLM based metric in ragas.
+LLM as judge metric often makes mistakes and lack alignment with human evaluators. This makes them risky to use as their results cannot be trusted fully. Now, you can fix this using ragas. This simple tutorial notebook showcasing how to train and align any LLM as judge metric using ragas. One can use this to train any LLM based metric in ragas.
## Import required modules
@@ -32,7 +32,7 @@ embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
## Evaluation
### Load sample evaluation dataset
-Here, we are loading the sample dataset for evaluation. You can replace it with your own dataset.
+Here, we are loading the sample dataset for evaluation. You can replace it with your own dataset.
```python
@@ -54,11 +54,11 @@ reference:
```
-The dataset contains user input, reference and response. Our goal is to evaluate the response based on the reference. The response here is in ELI5 format, which is a simple way of explaining complex topics.
+The dataset contains user input, reference and response. Our goal is to evaluate the response based on the reference. The response here is in ELI5 format, which is a simple way of explaining complex topics.
-In this particular application, we need to align our evaluation metric to evaluate the correctness of the response compared to the reference.
+In this particular application, we need to align our evaluation metric to evaluate the correctness of the response compared to the reference.
-- LLM as judge by default may regard the response as incorrect as it's not written in the same way as the reference, which is not the case here.
+- LLM as judge by default may regard the response as incorrect as it's not written in the same way as the reference, which is not the case here.
- At the same time, we also need it to identify instances where response makes factual errors or misrepresents the reference.
@@ -100,7 +100,7 @@ os.environ['RAGAS_APP_TOKEN'] = 'your_app_token'
Once that's done, you can upload the evaluation results to app.ragas using the following code.
!!! note
Please ensure that you're in ragas 0.2.8 or above to use this feature.
-
+
```python
results.upload()
```
@@ -123,7 +123,7 @@ Here is a sample annotation for the above example. You can [download](../../../_
## Training and Alignment
### Train the metric
-Download the annotated samples from app.ragas.io using `Download annotated json` button.
+Download the annotated samples from app.ragas.io using `Download annotated json` button.
Instruction and demonstration configurations are required tells ragas how to optimize instruction and few shot demonstrations respectively. You can customize these configurations as per your requirements.
```python
@@ -152,7 +152,7 @@ print(critic.get_prompts()['single_turn_aspect_critic_prompt'].instruction)
```
```
-Evaluate the provided user responses against the reference information for accuracy and completeness.
+Evaluate the provided user responses against the reference information for accuracy and completeness.
Assign a verdict of 1 if the response is accurate and aligns well with the reference, or 0 if it contains inaccuracies or misrepresentations.
```
@@ -187,7 +187,7 @@ Evaluation results uploaded! View at https://app.ragas.io/dashboard/alignment/ev
Go to [app.ragas](https://app.ragas.io/dashboard) dashboard and compare the results before and after training.
-Here in this case, the metric has improved significantly. You can see the difference in the scores. To show the difference, let's compares the scores and changed reasoning for one specific example before and after training.
+Here in this case, the metric has improved significantly. You can see the difference in the scores. To show the difference, let's compare the scores and changed reasoning for one specific example before and after training.
|  |  |
|:-------------------------------:|:-------------------------------:|
diff --git a/docs/howtos/customizations/testgenerator/_language_adaptation.md b/docs/howtos/customizations/testgenerator/_language_adaptation.md
index 2fb7f03d8..6c4a18bc4 100644
--- a/docs/howtos/customizations/testgenerator/_language_adaptation.md
+++ b/docs/howtos/customizations/testgenerator/_language_adaptation.md
@@ -1,6 +1,6 @@
-## Synthetic test generation from non-english corpus
+## Synthetic test generation from non-English corpus
-In this notebook, you'll learn how to adapt synthetic test data generation to non-english corpus settings. For the sake of this tutorial, I am generating queries in Spanish from Spanish wikipedia articles.
+In this notebook, you'll learn how to adapt synthetic test data generation to non-English corpus settings. For the sake of this tutorial, I am generating queries in Spanish from Spanish Wikipedia articles.
### Download and Load corpus
@@ -61,7 +61,7 @@ generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
### Setup Persona and transforms
-you may automatically create personas using this [notebook](./_persona_generator.md). For the sake of simplicity, I am using a pre-defined person, two basic tranforms and simple specic query distribution.
+you may automatically create personas using this [notebook](./_persona_generator.md). For the sake of simplicity, I am using a pre-defined person, two basic transforms and simple query distribution.
```python
@@ -96,7 +96,7 @@ generator = TestsetGenerator(
### Load and Adapt Queries
-Here we load the required query types and adapt them to the target language.
+Here we load the required query types and adapt them to the target language.
```python
@@ -131,7 +131,7 @@ dataset = generator.generate_with_langchain_docs(
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
- Generating Scenarios: 100%|██████████| 1/1 [00:07<00:00, 7.75s/it]
+ Generating Scenarios: 100%|██████████| 1/1 [00:07<00:00, 7.75s/it]
Generating Samples: 100%|██████████| 5/5 [00:03<00:00, 1.65it/s]
diff --git a/docs/howtos/customizations/testgenerator/_persona_generator.md b/docs/howtos/customizations/testgenerator/_persona_generator.md
index d0d32824c..47e7d9653 100644
--- a/docs/howtos/customizations/testgenerator/_persona_generator.md
+++ b/docs/howtos/customizations/testgenerator/_persona_generator.md
@@ -1,8 +1,8 @@
## Persona's in Testset Generation
-You can add different persona's to the testset generation process by defining the [Persona][ragas.testset.persona.Persona] class with the name and role description of the different persona's that might be relevant to your use case and you want to generate testset for.
+You can add different persona's to the testset generation process by defining the [Persona][ragas.testset.persona.Persona] class with the name and role description of the different persona's that might be relevant to your use case, and you want to generate testset for.
-For example, for the [gitlab handbook](https://about.gitlab.com/handbook/) we might want to generate testset for different persona's like a new joinee, a manager, a senior manager, etc. And hence we will define them as follows:
+For example, for the [GitLab handbook](https://about.gitlab.com/handbook/) we might want to generate testset for different persona's like a new joinee, a manager, a senior manager, etc. And hence we will define them as follows:
1. New Joinee: Don't know much about the company and is looking for information on how to get started.
2. Manager: Wants to know about the different teams and how they collaborate with each other.
@@ -40,7 +40,7 @@ personas
-And then you can use these persona's in the testset generation process by passing them to the [TestsetGenerator][ragas.testset.generator.TestsetGenerator] class.
+And then you can use these personas in the testset generation process by passing them to the [TestsetGenerator][ragas.testset.generator.TestsetGenerator] class.
```python
diff --git a/docs/howtos/customizations/testgenerator/_testgen-custom-single-hop.md b/docs/howtos/customizations/testgenerator/_testgen-custom-single-hop.md
index 7551ea306..a4c73d722 100644
--- a/docs/howtos/customizations/testgenerator/_testgen-custom-single-hop.md
+++ b/docs/howtos/customizations/testgenerator/_testgen-custom-single-hop.md
@@ -1,7 +1,7 @@
# Create custom single-hop queries from your documents
### Load sample documents
-I am using documents from [sample of gitlab handbook](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can download it by running the below command.
+I am using documents from [sample of GitLab handbook](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can download it by running the below command.
```
! git clone https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown
@@ -55,7 +55,7 @@ embedding = embedding_factory()
Here we are using 2 extractors and 2 relationship builders.
-- Headline extrator: Extracts headlines from the documents
+- Headline extractor: Extracts headlines from the documents
- Keyphrase extractor: Extracts keyphrases from the documents
- Headline splitter: Splits the document into nodes based on headlines
@@ -94,7 +94,7 @@ Applying KeyphrasesExtractor: 72%|▋| 26/36 [00:04<00:00, 1Property 'keyphrase
Applying KeyphrasesExtractor: 81%|▊| 29/36 [00:04<00:00, Property 'keyphrases' already exists in node 'c230df'. Skipping!
Applying KeyphrasesExtractor: 89%|▉| 32/36 [00:04<00:00, 1Property 'keyphrases' already exists in node '4f2765'. Skipping!
Property 'keyphrases' already exists in node '4a4777'. Skipping!
-```
+```
### Configure personas
@@ -115,16 +115,16 @@ persona2 = Persona(
persona_list = [person1, persona2]
```
-##
+##
## SingleHop Query
-Inherit from `SingleHopQuerySynthesizer` and modify the function that generates scenarios for query creation.
+Inherit from `SingleHopQuerySynthesizer` and modify the function that generates scenarios for query creation.
**Steps**:
- find qualified set of nodes for the query creation. Here I am selecting all nodes with keyphrases extracted.
- For each qualified set
- - Match the keyphrase with one or more persona.
+ - Match the keyphrase with one or more persona.
- Create all possible combinations of (Node, Persona, Query Style, Query Length)
- Samples the required number of queries from the combinations
@@ -202,18 +202,18 @@ result = await query.generate_sample(scenario=scenarios[-1])
```
### Modify prompt to customize the query style
-Here I am replacing the default prompt with an instruction to generate only Yes/No questions. This is an optional step.
+Here I am replacing the default prompt with an instruction to generate only Yes/No questions. This is an optional step.
```python
-instruction = """Generate a Yes/No query and answer based on the specified conditions (persona, term, style, length)
-and the provided context. Ensure the answer is entirely faithful to the context, using only the information
+instruction = """Generate a Yes/No query and answer based on the specified conditions (persona, term, style, length)
+and the provided context. Ensure the answer is entirely faithful to the context, using only the information
directly from the provided context.
### Instructions:
-1. **Generate a Yes/No Query**: Based on the context, persona, term, style, and length, create a question
+1. **Generate a Yes/No Query**: Based on the context, persona, term, style, and length, create a question
that aligns with the persona's perspective, incorporates the term, and can be answered with 'Yes' or 'No'.
-2. **Generate an Answer**: Using only the content from the provided context, provide a 'Yes' or 'No' answer
+2. **Generate an Answer**: Using only the content from the provided context, provide a 'Yes' or 'No' answer
to the query. Do not add any information not included in or inferable from the context."""
```
diff --git a/docs/howtos/customizations/testgenerator/_testgen-customisation.md b/docs/howtos/customizations/testgenerator/_testgen-customisation.md
index 6731c93e4..7f71664c0 100644
--- a/docs/howtos/customizations/testgenerator/_testgen-customisation.md
+++ b/docs/howtos/customizations/testgenerator/_testgen-customisation.md
@@ -3,7 +3,7 @@
In this tutorial you will get to learn how to create custom multi-hop queries from your documents. This is a very powerful feature that allows you to create queries that are not possible with the standard query types. This also helps you to create queries that are more specific to your use case.
### Load sample documents
-I am using documents from [sample of gitlab handbook](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can download it by running the below command.
+I am using documents from [sample of GitLab handbook](https://huggingface.co/datasets/explodinggradients/Sample_Docs_Markdown). You can download it by running the below command.
```python
@@ -56,10 +56,10 @@ embedding = embedding_factory()
### Setup Extractors and Relationship builders
-To create multi-hop queries you need to undestand the set of documents that can be used for it. Ragas uses relationships between documents/nodes to quality nodes for creating multi-hop queries. To concretize, if Node A and Node B and conencted by a relationship (say entity or keyphrase overlap) then you can create a multi-hop query between them.
+To create multi-hop queries you need to understand the set of documents that can be used for it. Ragas uses relationships between documents/nodes to quality nodes for creating multi-hop queries. To concretize, if Node A and Node B are connected by a relationship (say entity or keyphrase overlap) then you can create a multi-hop query between them.
Here we are using 2 extractors and 2 relationship builders.
-- Headline extrator: Extracts headlines from the documents
+- Headline extractor: Extracts headlines from the documents
- Keyphrase extractor: Extracts keyphrases from the documents
- Headline splitter: Splits the document into nodes based on headlines
- OverlapScore Builder: Builds relationship between nodes based on keyphrase overlap
@@ -105,7 +105,7 @@ Applying KeyphrasesExtractor: 78%|███████████████
Property 'keyphrases' already exists in node 'd68f83'. Skipping!
Applying KeyphrasesExtractor: 83%|████████████████████████████████████████████████████████████████████████████████████████████▌ | 30/36 [00:03<00:00, 9.35it/s]Property 'keyphrases' already exists in node '8fdbea'. Skipping!
Applying KeyphrasesExtractor: 89%|██████████████████████████████████████████████████████████████████████████████████████████████████▋ | 32/36 [00:04<00:00, 7.76it/s]Property 'keyphrases' already exists in node 'ef6ae0'. Skipping!
-```
+```
### Configure personas
@@ -126,14 +126,14 @@ persona2 = Persona(
persona_list = [person1, persona2]
```
-### Create multi-hop query
+### Create multi-hop query
-Inherit from `MultiHopQuerySynthesizer` and modify the function that generates scenarios for query creation.
+Inherit from `MultiHopQuerySynthesizer` and modify the function that generates scenarios for query creation.
**Steps**:
- find qualified set of (nodeA, relationship, nodeB) based on the relationships between nodes
- For each qualified set
- - Match the keyphrase with one or more persona.
+ - Match the keyphrase with one or more persona.
- Create all possible combinations of (Nodes, Persona, Query Style, Query Length)
- Samples the required number of queries from the combinations
@@ -245,4 +245,4 @@ Output
Yay! You have created a multi-hop query. Now you can create any such queries by creating and exploring relationships between documents.
-##
+##
diff --git a/docs/howtos/index.md b/docs/howtos/index.md
index ea5c594ab..43a9ce86c 100644
--- a/docs/howtos/index.md
+++ b/docs/howtos/index.md
@@ -9,10 +9,10 @@ Each guide in this section provides a focused solution to real-world problems th
---
How to customize various aspects of Ragas to suit your needs.
-
+
Customize features such as [Metrics](customizations/index.md#metrics) and [Testset Generation](customizations/index.md#testset-generation).
-- :material-cube-outline:{ .lg .middle } [__Applications__](applications/index.md)
+- :material-cube-outline:{ .lg .middle } [__Applications__](applications/index.md)
---
@@ -26,6 +26,6 @@ Each guide in this section provides a focused solution to real-world problems th
How to integrate Ragas with other frameworks and observability tools.
- Use Ragas with frameworks like [Langchain](integrations/langchain.md), [LlamaIndex](integrations/_llamaindex.md), and [observability tools](./observability.md).
+ Use Ragas with frameworks like [LangChain](integrations/langchain.md), [LlamaIndex](integrations/_llamaindex.md), and [observability tools](./observability.md).
diff --git a/docs/howtos/integrations/_arize.md b/docs/howtos/integrations/_arize.md
index cf223e9ca..71fca030b 100644
--- a/docs/howtos/integrations/_arize.md
+++ b/docs/howtos/integrations/_arize.md
@@ -2,7 +2,7 @@
## 1. Introduction
-Building a baseline for a RAG pipeline is not usually difficult, but enhancing it to make it suitable for production and ensuring the quality of your responses is almost always hard. Choosing the right tools and parameters for RAG can itself be challenging when there is an abundance of options available. This tutorial shares a robust workflow for making the right choices while building your RAG and ensuring its quality.
+Building a baseline for a RAG pipeline is not usually difficult, but enhancing it to make it suitable for production and ensuring the quality of your responses is almost always hard. Choosing the right tools and parameters for RAG can itself be challenging when there is an abundance of options available. This tutorial shares a robust workflow for making the right choices while building your RAG and ensuring its quality.
This article covers how to evaluate, visualize and analyze your RAG using a combination of open-source libraries. We will be using:
@@ -34,7 +34,7 @@ Install and import Python dependencies.
```python
import pandas as pd
-# Display the complete contents of dataframe cells.
+# Display the complete contents of DataFrame cells.
pd.set_option("display.max_colwidth", None)
```
@@ -56,7 +56,7 @@ os.environ["OPENAI_API_KEY"] = openai_api_key
## 4. Generate Your Synthetic Test Dataset
-Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%.
+Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%.
Run the cell below to download a dataset of prompt engineering papers in PDF format from arXiv and read these documents using LlamaIndex.
@@ -100,7 +100,7 @@ You are free to change the question type distribution according to your needs. S
## 5. Build Your RAG Application With LlamaIndex
-LlamaIndex is an easy to use and flexible framework for building RAG applications. For the sake of simplicity, we use the default LLM (gpt-3.5-turbo) and embedding models (openai-ada-2).
+LlamaIndex is an easy-to-use and flexible framework for building RAG applications. For the sake of simplicity, we use the default LLM (gpt-3.5-turbo) and embedding models (openai-ada-2).
Launch Phoenix in the background and instrument your LlamaIndex application so that your OpenInference spans and traces are sent to and collected by Phoenix. [OpenInference](https://github.com/Arize-ai/openinference/tree/main/spec) is an open standard built atop OpenTelemetry that captures and stores LLM application executions. It is designed to be a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context, such as retrieval from vector stores and the usage of external tools such as search engines or APIs.
@@ -134,7 +134,7 @@ def build_query_engine(documents):
query_engine = build_query_engine(documents)
```
-If you check Phoenix, you should see embedding spans from when your corpus data was indexed. Export and save those embeddings into a dataframe for visualization later in the notebook.
+If you check Phoenix, you should see embedding spans from when your corpus data was indexed. Export and save those embeddings into a DataFrame for visualization later in the notebook.
```python
@@ -208,7 +208,7 @@ print(session.url)

-We save out a couple of dataframes, one containing embedding data that we'll visualize later, and another containing our exported traces and spans that we plan to evaluate using Ragas.
+We save out a couple of DataFrames, one containing embedding data that we'll visualize later, and another containing our exported traces and spans that we plan to evaluate using Ragas.
```python
@@ -230,7 +230,7 @@ spans_dataframe = get_qa_with_reference(client)
spans_dataframe.head()
```
-Ragas uses LangChain to evaluate your LLM application data. Let's instrument LangChain with OpenInference so we can see what's going on under the hood when we evaluate our LLM application.
+Ragas uses LangChain to evaluate your LLM application data. Let's instrument LangChain with OpenInference, so we can see what's going on under the hood when we evaluate our LLM application.
```python
@@ -239,7 +239,7 @@ from openinference.instrumentation.langchain import LangChainInstrumentor
LangChainInstrumentor().instrument()
```
-Evaluate your LLM traces and view the evaluation scores in dataframe format.
+Evaluate your LLM traces and view the evaluation scores in DataFrame format.
```python
@@ -258,7 +258,7 @@ evaluation_result = evaluate(
eval_scores_df = pd.DataFrame(evaluation_result.scores)
```
-Submit your evaluations to Phoenix so they are visible as annotations on your spans.
+Submit your evaluations to Phoenix, so they are visible as annotations on your spans.
```python
@@ -347,8 +347,8 @@ Once you launch Phoenix, you can visualize your data with the metric of your cho
Congrats! You built and evaluated a LlamaIndex query engine using Ragas and Phoenix. Let's recap what we learned:
-- With Ragas, you bootstraped a test dataset and computed metrics such as faithfulness and answer correctness to evaluate your LlamaIndex query engine.
-- With OpenInference, you instrumented your query engine so you could observe the inner workings of both LlamaIndex and Ragas.
+- With Ragas, you bootstrapped a test dataset and computed metrics such as faithfulness and answer correctness to evaluate your LlamaIndex query engine.
+- With OpenInference, you instrumented your query engine, so you could observe the inner workings of both LlamaIndex and Ragas.
- With Phoenix, you collected your spans and traces, imported your evaluations for easy inspection, and visualized your embedded queries and retrieved documents to identify pockets of poor performance.
This notebook is just an introduction to the capabilities of Ragas and Phoenix. To learn more, see the [Ragas](https://docs.ragas.io/en/stable/) and [Phoenix docs](https://docs.arize.com/phoenix/).
diff --git a/docs/howtos/integrations/_athina.md b/docs/howtos/integrations/_athina.md
index e0dd15c52..7ae19bdbd 100644
--- a/docs/howtos/integrations/_athina.md
+++ b/docs/howtos/integrations/_athina.md
@@ -74,7 +74,7 @@ pd.DataFrame(batch_eval_result)
If you are [logging your production inferences to Athina](https://docs.athina.ai/logging/log_via_api), you can configure Ragas metrics to run automatically against your production logs.
1. Navigate to the [Athina Dashboard](https://app.athina.ai/evals/config)
-
+
2. Open the **Evals** page (lightning icon on the left)
3. Click the "New Eval" button on the top right
4. Select the **Ragas** tab
@@ -85,5 +85,5 @@ If you are [logging your production inferences to Athina](https://docs.athina.ai
#### Learn more about Athina
- **Website:** [https://athina.ai](https://athina.ai)
- **Docs:** [https://docs.athina.ai](https://docs.athina.ai)
-- **Github Library:** [https://github.com/athina-ai/athina-evals](https://github.com/athina-ai/athina-evals)
+- **GitHub Library:** [https://github.com/athina-ai/athina-evals](https://github.com/athina-ai/athina-evals)
- **Sandbox**: [https://demo.athina.ai](https://demo.athina.ai/observe?filters=dateSpan%3D30)
diff --git a/docs/howtos/integrations/_langfuse.md b/docs/howtos/integrations/_langfuse.md
index e4831b71a..4d7d4cb7e 100644
--- a/docs/howtos/integrations/_langfuse.md
+++ b/docs/howtos/integrations/_langfuse.md
@@ -4,7 +4,7 @@ Ragas and Langfuse is a powerful combination that can help you evaluate and moni
## What is Langfuse?
-Langfuse ([GitHub](https://github.com/langfuse/langfuse)) is an open-source platform for LLM [tracing](https://langfuse.com/docs/tracing), [prompt management](https://langfuse.com/docs/prompts/get-started), and [evaluation](https://langfuse.com/docs/scores/overview). It allows you to score your traces and spans, providing insights into the performance of your RAG pipelines. Langfuse supports various integrations, including [OpenAI](https://langfuse.com/docs/integrations/openai/python/get-started), [Langchain](https://langfuse.com/docs/integrations/langchain/tracing), and [more](https://langfuse.com/docs/integrations/overview).
+Langfuse ([GitHub](https://github.com/langfuse/langfuse)) is an open-source platform for LLM [tracing](https://langfuse.com/docs/tracing), [prompt management](https://langfuse.com/docs/prompts/get-started), and [evaluation](https://langfuse.com/docs/scores/overview). It allows you to score your traces and spans, providing insights into the performance of your RAG pipelines. Langfuse supports various integrations, including [OpenAI](https://langfuse.com/docs/integrations/openai/python/get-started), [LangChain](https://langfuse.com/docs/integrations/langchain/tracing), and [more](https://langfuse.com/docs/integrations/overview).
## Key Benefits of using Langfuse with Ragas
@@ -142,7 +142,7 @@ In this cookbook, we'll show you how to setup both.
### Score the Trace
-Lets take a small example of a single trace and see how you can score that with Ragas. First lets load the data.
+Let's take a small example of a single trace and see how you can score that with Ragas. First lets load the data.
```python
@@ -153,23 +153,23 @@ print("answer: ", row["answer"])
question: What are the global implications of the USA Supreme Court ruling on abortion?
answer: The global implications of the USA Supreme Court ruling on abortion can be significant, as it sets a precedent for other countries and influences the global discourse on reproductive rights. Here are some potential implications:
-
+
1. Influence on other countries: The Supreme Court's ruling can serve as a reference point for other countries grappling with their own abortion laws. It can provide legal arguments and reasoning that advocates for reproductive rights can use to challenge restrictive abortion laws in their respective jurisdictions.
-
+
2. Strengthening of global reproductive rights movements: A favorable ruling by the Supreme Court can energize and empower reproductive rights movements worldwide. It can serve as a rallying point for activists and organizations advocating for women's rights, leading to increased mobilization and advocacy efforts globally.
-
+
3. Counteracting anti-abortion movements: Conversely, a ruling that restricts abortion rights can embolden anti-abortion movements globally. It can provide legitimacy to their arguments and encourage similar restrictive measures in other countries, potentially leading to a rollback of existing reproductive rights.
-
+
4. Impact on international aid and policies: The Supreme Court's ruling can influence international aid and policies related to reproductive health. It can shape the priorities and funding decisions of donor countries and organizations, potentially leading to increased support for reproductive rights initiatives or conversely, restrictions on funding for abortion-related services.
-
+
5. Shaping international human rights standards: The ruling can contribute to the development of international human rights standards regarding reproductive rights. It can influence the interpretation and application of existing human rights treaties and conventions, potentially strengthening the recognition of reproductive rights as fundamental human rights globally.
-
+
6. Global health implications: The Supreme Court's ruling can have implications for global health outcomes, particularly in countries with restrictive abortion laws. It can impact the availability and accessibility of safe and legal abortion services, potentially leading to an increase in unsafe abortions and related health complications.
-
+
It is important to note that the specific implications will depend on the nature of the Supreme Court ruling and the subsequent actions taken by governments, activists, and organizations both within and outside the United States.
-Now lets init a Langfuse client SDK to instrument you app.
+Now let's initialize a Langfuse client SDK to instrument you app.
```python
@@ -225,7 +225,7 @@ You compute the score with each request. Below we've outlined a dummy applicatio
2. Fetch context from the database or vector store that can be used to answer the question from the user
3. Pass the question and the contexts to the LLM to generate the answer
-All these step are logged as spans in a single trace in Langfuse. You can read more about traces and spans from the [Langfuse documentation](https://langfuse.com/docs/tracing).
+All these steps are logged as spans in a single trace in Langfuse. You can read more about traces and spans from the [Langfuse documentation](https://langfuse.com/docs/tracing).
```python
diff --git a/docs/howtos/integrations/_langgraph_agent_evaluation.md b/docs/howtos/integrations/_langgraph_agent_evaluation.md
index c32755ddc..d4b95c932 100644
--- a/docs/howtos/integrations/_langgraph_agent_evaluation.md
+++ b/docs/howtos/integrations/_langgraph_agent_evaluation.md
@@ -15,7 +15,7 @@ Click the [link](https://colab.research.google.com/github/explodinggradients/rag
- Basic understanding of LangGraph, LangChain and LLMs
## Installing Ragas and Other Dependencies
-Install Ragas and Langgraph with pip:
+Install Ragas and LangGraph with pip:
```python
@@ -29,13 +29,13 @@ Install Ragas and Langgraph with pip:
### Initializing External Components
To begin, you have two options for setting up the external components:
-1. Use a Live API Key:
+1. Use a Live API Key:
- - Sign up for an account on [metals.dev](https://metals.dev/) to get your API key.
-
-2. Simulate the API Response:
+ - Sign up for an account on [metals.dev](https://metals.dev/) to get your API key.
- - Alternatively, you can use a predefined JSON object to simulate the API response. This allows you to get started more quickly without needing a live API key.
+2. Simulate the API Response:
+
+ - Alternatively, you can use a predefined JSON object to simulate the API response. This allows you to get started more quickly without needing a live API key.
Choose the method that best fits your needs to proceed with the setup.
@@ -229,9 +229,9 @@ display(Image(react_graph.get_graph(xray=True).draw_mermaid_png()))
```
-
+

-
+
To test our setup, we will run the agent with a query. The agent will fetch the price of copper using the metals.dev API.
@@ -273,7 +273,7 @@ Each time a message is exchanged during agent execution, it gets added to the me
Ragas uses its own format to evaluate agent interactions. So, if you're using LangGraph, you will need to convert the LangChain message objects into Ragas message objects. This allows you to evaluate your AI agents with Ragas’ built-in evaluation tools.
-**Goal:** Convert the list of LangChain messages (e.g., HumanMessage, AIMessage, and ToolMessage) into the format expected by Ragas, so the evaluation framework can understand and process them properly.
+**Goal:** Convert the list of LangChain messages (e.g., HumanMessage, AIMessage, and ToolMessage) into the format expected by Ragas, so the evaluation framework can understand and process them properly.
To convert a list of LangChain messages into a format suitable for Ragas evaluation, Ragas provides the function [convert_to_ragas_messages][ragas.integrations.langgraph.convert_to_ragas_messages], which can be used to transform LangChain messages into the format expected by Ragas.
@@ -306,7 +306,7 @@ ragas_trace # List of Ragas messages
For this tutorial, let us evaluate the Agent with the following metrics:
-- [Tool call Accuracy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/#tool-call-accuracy):ToolCallAccuracy is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task.
+- [Tool call Accuracy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/#tool-call-accuracy):ToolCallAccuracy is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task.
- [Agent Goal accuracy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/#agent-goal-accuracy): Agent goal accuracy is a metric that can be used to evaluate the performance of the LLM in identifying and achieving the goals of the user. This is a binary metric, with 1 indicating that the AI has achieved the goal and 0 indicating that the AI has not achieved the goal.
@@ -358,7 +358,7 @@ result = react_graph.invoke({"messages": messages})
```python
-result["messages"] # List of Langchain messages
+result["messages"] # List of LangChain messages
```
diff --git a/docs/howtos/integrations/_llamaindex.md b/docs/howtos/integrations/_llamaindex.md
index 865880cb8..c4aebc1bd 100644
--- a/docs/howtos/integrations/_llamaindex.md
+++ b/docs/howtos/integrations/_llamaindex.md
@@ -1,14 +1,14 @@
# LlamaIndex
-[LlamaIndex](https://github.com/run-llama/llama_index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Makes it super easy to connect LLMs with your own data. But in order to figure out the best configuration for llamaIndex and your data you need a object measure of the performance. This is where ragas comes in. Ragas will help you evaluate your `QueryEngine` and gives you the confidence to tweak the configuration to get hightest score.
+[LlamaIndex](https://github.com/run-llama/llama_index) is a data framework for LLM applications to ingest, structure, and access private or domain-specific data. Makes it super easy to connect LLMs with your own data. But in order to figure out the best configuration for LlamaIndex and your data you need an object measure of the performance. This is where ragas comes in. Ragas will help you evaluate your `QueryEngine` and gives you the confidence to tweak the configuration to get the highest score.
-This guide assumes you have familarity with the LlamaIndex framework.
+This guide assumes you are familiar with the LlamaIndex framework.
## Building the Testset
-You will need an testset to evaluate your `QueryEngine` against. You can either build one yourself or use the [Testset Generator Module](./../../getstarted/rag_testset_generation.md) in Ragas to get started with a small synthetic one.
+You will need a testset to evaluate your `QueryEngine` against. You can either build one yourself or use the [Testset Generator Module](./../../getstarted/rag_testset_generation.md) in Ragas to get started with a small synthetic one.
-Let's see how that works with Llamaindex
+Let's see how that works with LlamaIndex
## load the documents
@@ -19,7 +19,7 @@ from llama_index.core import SimpleDirectoryReader
documents = SimpleDirectoryReader("./nyc_wikipedia").load_data()
```
-Now lets init the `TestsetGenerator` object with the corresponding generator and critic llms
+Now let's initialize the `TestsetGenerator` object with the corresponding generator and critic LLMs
```python
@@ -128,7 +128,7 @@ with a test dataset to test our `QueryEngine` lets now build one and evaluate it
## Building the `QueryEngine`
-To start lets build an `VectorStoreIndex` over the New York Citie's [wikipedia page](https://en.wikipedia.org/wiki/New_York_City) as an example and use ragas to evaluate it.
+To start lets build a `VectorStoreIndex` over the New York Cities' [Wikipedia page](https://en.wikipedia.org/wiki/New_York_City) as an example and use ragas to evaluate it.
Since we already loaded the dataset into `documents` lets use that.
@@ -142,7 +142,7 @@ vector_index = VectorStoreIndex.from_documents(documents)
query_engine = vector_index.as_query_engine()
```
-Lets try an sample question from the generated testset to see if it is working
+Let's try a sample question from the generated testset to see if it is working
```python
@@ -170,17 +170,17 @@ print(response_vector)
## Evaluating the `QueryEngine`
-Now that we have a `QueryEngine` for the `VectorStoreIndex` we can use the llama_index integration Ragas has to evaluate it.
+Now that we have a `QueryEngine` for the `VectorStoreIndex` we can use the llama_index integration Ragas has to evaluate it.
In order to run an evaluation with Ragas and LlamaIndex you need 3 things
1. LlamaIndex `QueryEngine`: what we will be evaluating
2. Metrics: Ragas defines a set of metrics that can measure different aspects of the `QueryEngine`. The available metrics and their meaning can be found [here](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/)
-3. Questions: A list of questions that ragas will test the `QueryEngine` against.
+3. Questions: A list of questions that ragas will test the `QueryEngine` against.
-first lets generate the questions. Ideally you should use that you see in production so that the distribution of question with which we evaluate matches the distribution of questions seen in production. This ensures that the scores reflect the performance seen in production but to start off we'll be using a few example question.
+first let's generate the questions. Ideally you should use that you see in production so that the distribution of question with which we evaluate matches the distribution of questions seen in production. This ensures that the scores reflect the performance seen in production but to start off we'll be using a few example questions.
-Now lets import the metrics we will be using to evaluate
+Now let's import the metrics we will be using to evaluate
```python
@@ -220,7 +220,7 @@ ragas_dataset
-Finally lets run the evaluation
+Finally, let's run the evaluation
```python
@@ -242,7 +242,7 @@ print(result)
{'faithfulness': 0.7454, 'answer_relevancy': 0.9348, 'context_precision': 0.6667, 'context_recall': 0.4667}
-You can convert into a pandas dataframe to run more analysis on it.
+You can convert into a pandas DataFrame to run more analysis on it.
```python
diff --git a/docs/howtos/integrations/_openlayer.md b/docs/howtos/integrations/_openlayer.md
index 2ba31a7aa..d836e80da 100644
--- a/docs/howtos/integrations/_openlayer.md
+++ b/docs/howtos/integrations/_openlayer.md
@@ -1,4 +1,4 @@
-# OpenLayer
+# Openlayer
## Evaluating RAG pipelines with Openlayer and Ragas
[Openlayer](https://www.openlayer.com/) is an evaluation tool that fits into your development and production pipelines to help you ship high-quality models with confidence.
diff --git a/docs/howtos/integrations/_opik.md b/docs/howtos/integrations/_opik.md
index c980774d0..4e35e8f6b 100644
--- a/docs/howtos/integrations/_opik.md
+++ b/docs/howtos/integrations/_opik.md
@@ -189,7 +189,7 @@ rag_pipeline("What is the capital of France?")
#### Evaluating datasets
-If you looking at evaluating a dataset, you can use the Ragas `evaluate` function. When using this function, the Ragas library will compute the metrics on all the rows of the dataset and return a summary of the results.
+If you are looking at evaluating a dataset, you can use the Ragas `evaluate` function. When using this function, the Ragas library will compute the metrics on all the rows of the dataset and return a summary of the results.
You can use the OpikTracer callback to log the results of the evaluation to the Opik platform. For this we will configure the OpikTracer
diff --git a/docs/howtos/integrations/haystack.md b/docs/howtos/integrations/haystack.md
index 2c279278d..ec31ac179 100644
--- a/docs/howtos/integrations/haystack.md
+++ b/docs/howtos/integrations/haystack.md
@@ -1,6 +1,6 @@
# Haystack Integration
-Haystack is a LLM orchestration framework to build customizable, production-ready LLM applications.
+Haystack is an LLM orchestration framework to build customizable, production-ready LLM applications.
The underlying concept of Haystack is that all individual tasks, such as storing documents, retrieving relevant data, and generating responses, are handled by modular components like Document Stores, Retrievers, and Generators, which are seamlessly connected and orchestrated using Pipelines.
@@ -46,7 +46,7 @@ document_store = InMemoryDocumentStore()
docs = [Document(content=doc) for doc in dataset]
```
-#### Initalize the Document and Text Embedder
+#### Initialize the Document and Text Embedder
```python
@@ -206,7 +206,7 @@ Output
```
Evaluating: 100%|██████████| 3/3 [00:14<00:00, 4.72s/it]
-Meta AI's LLaMA models stand out due to their open-source nature, which allows researchers and developers easy access to high-quality language models without the need for expensive resources. This accessibility fosters innovation and experimentation, enabling collaboration across various industries. Moreover, the strong performance of the LLaMA models further enhances their appeal, making them valuable tools for advancing AI development.
+Meta AI's LLaMA models stand out due to their open-source nature, which allows researchers and developers easy access to high-quality language models without the need for expensive resources. This accessibility fosters innovation and experimentation, enabling collaboration across various industries. Moreover, the strong performance of the LLaMA models further enhances their appeal, making them valuable tools for advancing AI development.
{'answer_relevancy': 0.9782, 'context_precision': 1.0000, 'faithfulness': 1.0000}
```
diff --git a/docs/howtos/integrations/index.md b/docs/howtos/integrations/index.md
index ecabc37c7..3ebd5d338 100644
--- a/docs/howtos/integrations/index.md
+++ b/docs/howtos/integrations/index.md
@@ -9,7 +9,7 @@ happy to look into it 🙂
## Frameworks
- [AWS Bedrock](./aws_bedrock.md) - AWS Bedrock is a managed framework for building, deploying, and scaling intelligent agents and integrated AI solutions; more information can be found [here](https://aws.amazon.com/bedrock/).
-- [Langchain](./langchain.md) - Langchain is a framework for building LLM applications, more information can be found [here](https://www.langchain.com/).
+- [LangChain](./langchain.md) - LangChain is a framework for building LLM applications, more information can be found [here](https://www.langchain.com/).
- [LlamaIndex](./_llamaindex.md) - LlamaIndex is a framework for building RAG applications, more information can be found [here](https://www.llamaindex.ai/).
- [Haystack](./haystack.md) - Haystack is a LLM orchestration framework to build customizable, production-ready LLM applications, more information can be found [here](https://haystack.deepset.ai/).
- [R2R](./r2r.md) - R2R is an all-in-one solution for AI Retrieval-Augmented Generation (RAG) with production-ready features, more information can be found [here](https://r2r-docs.sciphi.ai/introduction)
@@ -20,4 +20,4 @@ happy to look into it 🙂
Tools that help you trace the LLM calls can be integrated with Ragas to get the traces of the evaluator LLMs.
- [Arize Phoenix](./_arize.md) - Arize is a platform for observability and debugging of LLMs, more information can be found [here](https://phoenix.arize.com/).
-- [Langsmith](./langsmith.md) - Langsmith is a platform for observability and debugging of LLMs from Langchain, more information can be found [here](https://www.langchain.com/langsmith).
\ No newline at end of file
+- [LangSmith](./langsmith.md) - LangSmith is a platform for observability and debugging of LLMs from LangChain, more information can be found [here](https://www.langchain.com/langsmith).
\ No newline at end of file
diff --git a/docs/howtos/integrations/langchain.md b/docs/howtos/integrations/langchain.md
index b372d66ff..ecdc296ea 100644
--- a/docs/howtos/integrations/langchain.md
+++ b/docs/howtos/integrations/langchain.md
@@ -115,7 +115,7 @@ expected_responses = [
]
```
-To evaluate the Q&A system we need to structure the queries, expected_responses and other metric secpific requirments to [EvaluationDataset][ragas.dataset_schema.EvaluationDataset].
+To evaluate the Q&A system we need to structure the queries, expected_responses and other metric specific requirements to [EvaluationDataset][ragas.dataset_schema.EvaluationDataset].
```python
@@ -139,12 +139,12 @@ for query, reference in zip(sample_queries, expected_responses):
evaluation_dataset = EvaluationDataset.from_list(dataset)
```
-To evauate our Q&A application we will use the following metrices.
+To evaluate our Q&A application we will use the following metrics.
- `LLMContextRecall`: Evaluates how well retrieved contexts align with claims in the reference answer, estimating recall without manual reference context annotations.
- `Faithfulness`: Assesses whether all claims in the generated answer can be inferred directly from the provided context.
-- `Factual Correctness`: Checks the factual accuracy of the generated response by comparing it with a reference, using claim-based evaluation and natural language inference.
+- `Factual Correctness`: Checks the factual accuracy of the generated response by comparing it with a reference, using claim-based evaluation and natural language inference.
For more details on these metrics and how they apply to evaluating RAG systems, visit [Ragas Metrics Documentation](./../../concepts/metrics/available_metrics/).
diff --git a/docs/howtos/integrations/r2r.md b/docs/howtos/integrations/r2r.md
index 775d381a4..800920937 100644
--- a/docs/howtos/integrations/r2r.md
+++ b/docs/howtos/integrations/r2r.md
@@ -94,21 +94,21 @@ Meta AI’s LLaMA models stand out due to their open-source nature, which suppor
## Evaluations
-#### **Evaluating the `R2R Client` with Ragas**
+#### **Evaluating the `R2R Client` with Ragas**
-With the `R2R Client` in place, we can use Ragas `r2r` integration for evaluation. This process involves the following key components:
+With the `R2R Client` in place, we can use Ragas `r2r` integration for evaluation. This process involves the following key components:
-- **1. R2R Client and Configurations**
-The `R2RClient` and `/rag` configurations specifying RAG settings.
+- **1. R2R Client and Configurations**
+The `R2RClient` and `/rag` configurations specifying RAG settings.
-- **2. Evaluation Dataset**
-You need a Ragas `EvaluationDataset` that includes all necessary inputs required by Ragas metrics.
+- **2. Evaluation Dataset**
+You need a Ragas `EvaluationDataset` that includes all necessary inputs required by Ragas metrics.
-- **3. Ragas Metrics**
-Ragas provides various evaluation metrics to assess different aspects of the RAG, such as faithfulness, answer relevance, and context recall. You can explore the full list of available metrics in the [Ragas documentation](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/).
+- **3. Ragas Metrics**
+Ragas provides various evaluation metrics to assess different aspects of the RAG, such as faithfulness, answer relevance, and context recall. You can explore the full list of available metrics in the [Ragas documentation](https://docs.ragas.io/en/latest/concepts/metrics/available_metrics/).
-#### Constructing a Ragas EvaluationDataset
+#### Constructing a Ragas EvaluationDataset
The [`EvaluationDataset`](../../concepts/components/eval_dataset.md) is a data type in Ragas designed to represent evaluation samples. You can find more details about its structure and usage in the [core concepts section](../../concepts/components/eval_dataset.md).
We will use the `transform_to_ragas_dataset` function from ragas to get the EvaluationDataset for our data.
@@ -170,7 +170,7 @@ from ragas.llms import LangchainLLMWrapper
llm = ChatOpenAI(model="gpt-4o-mini")
evaluator_llm = LangchainLLMWrapper(llm)
-ragas_metrics = [AnswerRelevancy(llm=evaluator_llm ), ContextPrecision(llm=evaluator_llm ), Faithfulness(llm=evaluator_llm )]
+ragas_metrics = [AnswerRelevancy(llm=evaluator_llm), ContextPrecision(llm=evaluator_llm), Faithfulness(llm=evaluator_llm)]
results = evaluate(dataset=ragas_eval_dataset, metrics=ragas_metrics)
```
diff --git a/docs/howtos/integrations/swarm_agent_evaluation.md b/docs/howtos/integrations/swarm_agent_evaluation.md
index c8a759eb2..9665aef88 100644
--- a/docs/howtos/integrations/swarm_agent_evaluation.md
+++ b/docs/howtos/integrations/swarm_agent_evaluation.md
@@ -10,21 +10,21 @@ Install Ragas with pip and set up Swarm locally:
## Building the Customer Support Agent using Swarm
-In this tutorial, we will create an intelligent customer support agent using [swarm](https://github.com/openai/swarm) and evaluate its performance using [ragas](https://docs.ragas.io/en/stable/) metrics. The agent will focus on two key tasks:
+In this tutorial, we will create an intelligent customer support agent using [swarm](https://github.com/openai/swarm) and evaluate its performance using [ragas](https://docs.ragas.io/en/stable/) metrics. The agent will focus on two key tasks:
- Managing product returns
- Providing order tracking information.
For product returns, the agent will collect details from the customer about their order ID and the reason for the return. It will then determine whether the return meets predefined eligibility criteria. If the return is eligible, the agent will guide the customer through the necessary steps to complete the process. If the return is not eligible, the agent will explain the reasons clearly.
-For order tracking, the agent will retrieve the current status of the customer’s order and provide a friendly and detailed update.
+For order tracking, the agent will retrieve the current status of the customer’s order and provide a friendly and detailed update.
Throughout the interaction, the agent will adhere strictly to the outlined process, maintaining a professional and empathetic tone at all times. Before concluding the conversation, the agent will confirm that the customer’s concerns have been fully addressed, ensuring a satisfactory resolution.
### Setting Up the Agents
-To build the customer support agent, we will use a modular design with three specialized agents, each responsible for a specific part of the customer service workflow.
+To build the customer support agent, we will use a modular design with three specialized agents, each responsible for a specific part of the customer service workflow.
-Each agent will follow a set of instructions, called routines, to handle customer requests. A routine is essentially a step-by-step guide written in natural language that helps the agent complete tasks like processing a return or tracking an order. These routines ensure that the agent follows a clear and consistent process for every task.
+Each agent will follow a set of instructions, called routines, to handle customer requests. A routine is essentially a step-by-step guide written in natural language that helps the agent complete tasks like processing a return or tracking an order. These routines ensure that the agent follows a clear and consistent process for every task.
If you want to learn more about routines and how they shape agent behavior, check out the detailed explanations and examples in the routine section of this website: [OpenAI Cookbook - Orchestrating Agents with Routines](https://cookbook.openai.com/examples/orchestrating_agents#routines).
@@ -62,30 +62,30 @@ tracker_agent = Agent(name="Tracker Agent", instructions=TRACKER_AGENT_INSTRUCTI
#### Return Agent
-The Return Agent is responsible for handling product return requests. The Return Agent follows a structured routine to ensure the process is handled smoothly, using specific tools (`valid_to_return`, `initiate_return`, and `case_resolved`) at key steps.
+The Return Agent is responsible for handling product return requests. The Return Agent follows a structured routine to ensure the process is handled smoothly, using specific tools (`valid_to_return`, `initiate_return`, and `case_resolved`) at key steps.
-The routine works as follows:
+The routine works as follows:
-1. **Ask for Order ID**:
- The agent collects the customer’s order ID to proceed.
+1. **Ask for Order ID**:
+ The agent collects the customer’s order ID to proceed.
-2. **Ask for Return Reason**:
- The agent asks the customer for the reason for the return. It then checks whether the reason matches a predefined list of acceptable return reasons.
+2. **Ask for Return Reason**:
+ The agent asks the customer for the reason for the return. It then checks whether the reason matches a predefined list of acceptable return reasons.
-3. **Evaluate the Reason**:
- - If the reason is valid, the agent moves on to check eligibility.
- - If the reason is invalid, the agent responds empathetically and explains the return policy to the customer.
+3. **Evaluate the Reason**:
+ - If the reason is valid, the agent moves on to check eligibility.
+ - If the reason is invalid, the agent responds empathetically and explains the return policy to the customer.
-4. **Validate Eligibility**:
- The agent uses the `valid_to_return` tool to check if the product qualifies for a return based on the policy. Depending on the outcome, the agent provides a clear response to the customer.
+4. **Validate Eligibility**:
+ The agent uses the `valid_to_return` tool to check if the product qualifies for a return based on the policy. Depending on the outcome, the agent provides a clear response to the customer.
-5. **Initiate the Return**:
- If the product is eligible, the agent uses the `initiate_return` tool to start the return process and shares the next steps with the customer.
+5. **Initiate the Return**:
+ If the product is eligible, the agent uses the `initiate_return` tool to start the return process and shares the next steps with the customer.
-6. **Close the Case**:
- Before ending the conversation, the agent ensures the customer has no further questions. If everything is resolved, the agent uses the `case_resolved` tool to close the case.
+6. **Close the Case**:
+ Before ending the conversation, the agent ensures the customer has no further questions. If everything is resolved, the agent uses the `case_resolved` tool to close the case.
-Using the above logic, we will now create a structured workflow for the product return routine. You can learn more about routines and their implementation in the [OpenAI Cookbook](https://cookbook.openai.com/examples/orchestrating_agents#routines).
+Using the above logic, we will now create a structured workflow for the product return routine. You can learn more about routines and their implementation in the [OpenAI Cookbook](https://cookbook.openai.com/examples/orchestrating_agents#routines).
```python
@@ -105,19 +105,19 @@ You have the chat history, customer and order context available to you.
Here is the policy:"""
-PRODUCT_RETURN_POLICY = f"""1. Use the order ID provided by customer if not ask for it.
-2. Ask the customer for the reason they want to return the product.
-3. Check if the reason matches any of the following conditions:
- - "You received the wrong shipment."
- - "You received a damaged product."
- - "You received an expired product."
- 3a) If the reason matches any of these conditions, proceed to the step.
- 3b) If the reason does not match, politely inform the customer that the product is not eligible for return as per the policy.
-4. Call the `valid_to_return` function to validate the product's return eligibility based on the conditions:
- 4a) If the product is eligible for return: proceed to the next step.
- 4b) If the product is not eligible for return: politely inform the customer about the policy and why the return cannot be processed.
-5. Call the `initiate_return` function.
-6. If the customer has no further questions, call the `case_resolved` function to close the interaction.
+PRODUCT_RETURN_POLICY = f"""1. Use the order ID provided by customer if not ask for it.
+2. Ask the customer for the reason they want to return the product.
+3. Check if the reason matches any of the following conditions:
+ - "You received the wrong shipment."
+ - "You received a damaged product."
+ - "You received an expired product."
+ 3a) If the reason matches any of these conditions, proceed to the step.
+ 3b) If the reason does not match, politely inform the customer that the product is not eligible for return as per the policy.
+4. Call the `valid_to_return` function to validate the product's return eligibility based on the conditions:
+ 4a) If the product is eligible for return: proceed to the next step.
+ 4b) If the product is not eligible for return: politely inform the customer about the policy and why the return cannot be processed.
+5. Call the `initiate_return` function.
+6. If the customer has no further questions, call the `case_resolved` function to close the interaction.
"""
@@ -129,7 +129,7 @@ return_agent = Agent(
### Handoff Functions
-To allow the agent to transfer tasks smoothly to another specialized agent, we use handoff functions. These functions return an Agent object, such as `triage_agent`, `return_agent`, or `tracker_agent`, to specify which agent should handle the next steps.
+To allow the agent to transfer tasks smoothly to another specialized agent, we use handoff functions. These functions return an Agent object, such as `triage_agent`, `return_agent`, or `tracker_agent`, to specify which agent should handle the next steps.
For a detailed explanation of handoffs and their implementation, visit the [OpenAI Cookbook - Orchestrating Agents with Routines](https://cookbook.openai.com/examples/orchestrating_agents#handoff-functions).
@@ -182,7 +182,7 @@ def initiate_return():
return status
```
-### Adding tools to the Agnets
+### Adding tools to the Agents
```py
@@ -191,7 +191,7 @@ tracker_agent.functions = [transfer_to_triage_agent, track_order, case_resolved]
return_agent.functions = [transfer_to_triage_agent, valid_to_return, initiate_return, case_resolved]
```
-We need to capture the messages exchanged during the [demo loop](https://github.com/openai/swarm/blob/main/swarm/repl/repl.py#L60) to evaluate the interactions between the user and the agents. This can be done by modifying the `run_demo_loop` function in the Swarm codebase. Specifically, you’ll need to update the function to return the list of messages once the while loop ends.
+We need to capture the messages exchanged during the [demo loop](https://github.com/openai/swarm/blob/main/swarm/repl/repl.py#L60) to evaluate the interactions between the user and the agents. This can be done by modifying the `run_demo_loop` function in the Swarm codebase. Specifically, you’ll need to update the function to return the list of messages once the while loop ends.
Alternatively, you can redefine the function with this modification directly in your project.
@@ -294,13 +294,13 @@ AIMessage(content="You're welcome! 🎈 Your case is all wrapped up, and I'm thr
## Evaluating the Agent's Performance
-In this tutorial, we will evaluate the Agent using the following metrics:
+In this tutorial, we will evaluate the Agent using the following metrics:
-1. **[Tool Call Accuracy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/#tool-call-accuracy)**: This metric measures how accurately the Agent identifies and uses the correct tools to complete a task.
+1. **[Tool Call Accuracy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/#tool-call-accuracy)**: This metric measures how accurately the Agent identifies and uses the correct tools to complete a task.
-2. **[Agent Goal Accuracy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/#agent-goal-accuracy)**: This binary metric evaluates whether the Agent successfully identifies and achieves the user’s goals. A score of 1 means the goal was achieved, while 0 means it was not.
+2. **[Agent Goal Accuracy](https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/agents/#agent-goal-accuracy)**: This binary metric evaluates whether the Agent successfully identifies and achieves the user’s goals. A score of 1 means the goal was achieved, while 0 means it was not.
-To begin, we will run the Agent with a few sample queries and ensure we have the ground truth labels for these queries. This will allow us to accurately evaluate the Agent’s performance.
+To begin, we will run the Agent with a few sample queries and ensure we have the ground truth labels for these queries. This will allow us to accurately evaluate the Agent’s performance.
### Tool Call Accuracy
@@ -440,7 +440,7 @@ Output
```
-**Agent Goal Accuracy: 0.0**
+**Agent Goal Accuracy: 0.0**
The **AgentGoalAccuracyWithReference** metric compares the agent's final response to the expected goal. In this case, while the agent’s response follows company policy, it does not fulfill the user’s return request. Since the return request couldn’t be completed due to policy constraints, the reference goal ("successfully resolved the user's request") is not met. As a result, the score is 0.0.
diff --git a/docs/howtos/migrations/migrate_from_v01_to_v02.md b/docs/howtos/migrations/migrate_from_v01_to_v02.md
index 4032029a5..68db21df0 100644
--- a/docs/howtos/migrations/migrate_from_v01_to_v02.md
+++ b/docs/howtos/migrations/migrate_from_v01_to_v02.md
@@ -11,9 +11,9 @@ v0.2 is the start of the transition for Ragas from an evaluation library for RAG
## Evaluation Dataset
-We have moved from using HuggingFace [`Datasets`](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.Dataset) to our own [`EvaluationDataset`][ragas.dataset_schema.EvaluationDataset] . You can read more about it from the core concepts section for [EvaluationDataset](../../concepts/components/eval_dataset.md) and [EvaluationSample](../../concepts/components/eval_sample.md)
+We have moved from using HuggingFace [`Datasets`](https://huggingface.co/docs/datasets/v3.0.1/en/package_reference/main_classes#datasets.Dataset) to our own [`EvaluationDataset`][ragas.dataset_schema.EvaluationDataset]. You can read more about it from the core concepts section for [EvaluationDataset](../../concepts/components/eval_dataset.md) and [EvaluationSample](../../concepts/components/eval_sample.md)
-You can easily translate
+You can easily translate
```python
from ragas import EvaluationDataset, SingleTurnSample
@@ -30,11 +30,11 @@ eval_dataset = EvaluationDataset.from_csv("path/to/save/dataset.csv")
## Metrics
-All the default metrics are still supported and many new metrics have been added. Take a look at the [documentation page](../../concepts/metrics/available_metrics/index.md) for the entire list.
+All the default metrics are still supported, and many new metrics have been added. Take a look at the [documentation page](../../concepts/metrics/available_metrics/index.md) for the entire list.
-How ever there are a couple of changes in how you use metrics
+However, there are a couple of changes in how you use metrics
-Firstly it is now preferred to initialize metrics with the evaluator LLM of your choice as oppose to using the initialized version of the metrics into [`evaluate()`][ragas.evaluation.evaluate] . This avoids a lot of confusion regarding which LLMs are used where.
+Firstly it is now preferred to initialize metrics with the evaluator LLM of your choice as opposed to using the initialized version of the metrics into [`evaluate()`][ragas.evaluation.evaluate]. This avoids a lot of confusion regarding which LLMs are used where.
```python
from ragas.metrics import faithfullness # old way, not recommended but still supported till v0.3
@@ -50,7 +50,7 @@ Second is that [`metrics.ascore`][ragas.metrics.base.Metric.ascore] is now being
from ragas import SingleTurnSample
sample = SingleTurnSample(
- user_input="user query",
+ user_input="user query",
response="response from your pipeline",
retrieved_contexts=["retrieved", "contexts", "from your pipeline" ]
)
@@ -67,7 +67,7 @@ Output
## Testset Generation
-[Testset Generation](../../concepts/test_data_generation/rag.md) has been redesigned to be much more cost efficient. If you were using the end-to-end workflow checkout the [getting started](../../getstarted/rag_testset_generation.md).
+[Testset Generation](../../concepts/test_data_generation/rag.md) has been redesigned to be much more cost-efficient. If you were using the end-to-end workflow checkout the [getting started](../../getstarted/rag_testset_generation.md).
**Notable Changes**
@@ -83,7 +83,7 @@ This might be a bit rough but if you do need help here, feel free to chat or men
All the prompts have been rewritten to use [`PydanticPrompts`][ragas.prompt.pydantic_prompt.PydanticPrompt] which is based on [`BasePrompt`][ragas.prompt.base.BasePrompt] object. If you are using the old `Prompt` object you will have to upgrade it to the new one, check the docs to learn more on how to do it
- [How to Guide on how to create new prompts](./../customizations/metrics/_modifying-prompts-metrics.md)
-- [Github PR for the changes](https://github.com/explodinggradients/ragas/pull/1462)
+- [GitHub PR for the changes](https://github.com/explodinggradients/ragas/pull/1462)
!!! note "Need Further Assistance?"
diff --git a/docs/howtos/observability.md b/docs/howtos/observability.md
index 18a7116cf..d376acde9 100644
--- a/docs/howtos/observability.md
+++ b/docs/howtos/observability.md
@@ -4,7 +4,7 @@
### 1. Introduction
-Building a baseline for a RAG pipeline is not usually difficult, but enhancing it to make it suitable for production and ensuring the quality of your responses is almost always hard. Choosing the right tools and parameters for RAG can itself be challenging when there is an abundance of options available. This tutorial shares a robust workflow for making the right choices while building your RAG and ensuring its quality.
+Building a baseline for a RAG pipeline is not usually difficult, but enhancing it to make it suitable for production and ensuring the quality of your responses is almost always hard. Choosing the right tools and parameters for RAG can itself be challenging when there is an abundance of options available. This tutorial shares a robust workflow for making the right choices while building your RAG and ensuring its quality.
This article covers how to evaluate, visualize and analyze your RAG using a combination of open-source libraries. We will be using:
@@ -36,7 +36,7 @@ Install and import Python dependencies.
```python
import pandas as pd
-# Display the complete contents of dataframe cells.
+# Display the complete contents of DataFrame cells.
pd.set_option("display.max_colwidth", None)
```
@@ -58,7 +58,7 @@ os.environ["OPENAI_API_KEY"] = openai_api_key
### 4. Generate Your Synthetic Test Dataset
-Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%.
+Curating a golden test dataset for evaluation can be a long, tedious, and expensive process that is not pragmatic — especially when starting out or when data sources keep changing. This can be solved by synthetically generating high quality data points, which then can be verified by developers. This can reduce the time and effort in curating test data by 90%.
Run the cell below to download a dataset of prompt engineering papers in PDF format from arXiv and read these documents using LlamaIndex.
@@ -102,7 +102,7 @@ You are free to change the question type distribution according to your needs. S
### 5. Build Your RAG Application With LlamaIndex
-LlamaIndex is an easy to use and flexible framework for building RAG applications. For the sake of simplicity, we use the default LLM (gpt-3.5-turbo) and embedding models (openai-ada-2).
+LlamaIndex is an easy-to-use and flexible framework for building RAG applications. For the sake of simplicity, we use the default LLM (gpt-3.5-turbo) and embedding models (openai-ada-2).
Launch Phoenix in the background and instrument your LlamaIndex application so that your OpenInference spans and traces are sent to and collected by Phoenix. [OpenInference](https://github.com/Arize-ai/openinference/tree/main/spec) is an open standard built atop OpenTelemetry that captures and stores LLM application executions. It is designed to be a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context, such as retrieval from vector stores and the usage of external tools such as search engines or APIs.
@@ -136,7 +136,7 @@ def build_query_engine(documents):
query_engine = build_query_engine(documents)
```
-If you check Phoenix, you should see embedding spans from when your corpus data was indexed. Export and save those embeddings into a dataframe for visualization later in the notebook.
+If you check Phoenix, you should see embedding spans from when your corpus data was indexed. Export and save those embeddings into a DataFrame for visualization later in the notebook.
```python
@@ -210,7 +210,7 @@ print(session.url)

-We save out a couple of dataframes, one containing embedding data that we'll visualize later, and another containing our exported traces and spans that we plan to evaluate using Ragas.
+We save out a couple of DataFrames, one containing embedding data that we'll visualize later, and another containing our exported traces and spans that we plan to evaluate using Ragas.
```python
@@ -232,7 +232,7 @@ spans_dataframe = get_qa_with_reference(client)
spans_dataframe.head()
```
-Ragas uses LangChain to evaluate your LLM application data. Let's instrument LangChain with OpenInference so we can see what's going on under the hood when we evaluate our LLM application.
+Ragas uses LangChain to evaluate your LLM application data. Let's instrument LangChain with OpenInference, so we can see what's going on under the hood when we evaluate our LLM application.
```python
@@ -241,7 +241,7 @@ from openinference.instrumentation.langchain import LangChainInstrumentor
LangChainInstrumentor().instrument()
```
-Evaluate your LLM traces and view the evaluation scores in dataframe format.
+Evaluate your LLM traces and view the evaluation scores in DataFrame format.
```python
@@ -260,7 +260,7 @@ evaluation_result = evaluate(
eval_scores_df = pd.DataFrame(evaluation_result.scores)
```
-Submit your evaluations to Phoenix so they are visible as annotations on your spans.
+Submit your evaluations to Phoenix, so they are visible as annotations on your spans.
```python
@@ -349,8 +349,8 @@ Once you launch Phoenix, you can visualize your data with the metric of your cho
Congrats! You built and evaluated a LlamaIndex query engine using Ragas and Phoenix. Let's recap what we learned:
-- With Ragas, you bootstraped a test dataset and computed metrics such as faithfulness and answer correctness to evaluate your LlamaIndex query engine.
-- With OpenInference, you instrumented your query engine so you could observe the inner workings of both LlamaIndex and Ragas.
+- With Ragas, you bootstrapped a test dataset and computed metrics such as faithfulness and answer correctness to evaluate your LlamaIndex query engine.
+- With OpenInference, you instrumented your query engine, so you could observe the inner workings of both LlamaIndex and Ragas.
- With Phoenix, you collected your spans and traces, imported your evaluations for easy inspection, and visualized your embedded queries and retrieved documents to identify pockets of poor performance.
This notebook is just an introduction to the capabilities of Ragas and Phoenix. To learn more, see the [Ragas](https://docs.ragas.io/en/stable/) and [Phoenix docs](https://docs.arize.com/phoenix/).