explodinggradients · ker2xu · Apr 5, 2025 · Apr 6, 2025
diff --git a/docs/community/index.md b/docs/community/index.md
@@ -1,4 +1,4 @@
-# ❤️ Community 
+# ❤️ Community
 
 > "Alone we can do so little; together we can do so much." - Helen Keller
 
@@ -16,7 +16,7 @@ Meet some of our outstanding members who made significant contributions !
 
 Explore insightful articles, tutorials, and stories written by and for our community members.
 
-- [Luka Panić](https://www.linkedin.com/in/luka-pani%C4%87-20b671277/) shares his work on 
+- [Luka Panić](https://www.linkedin.com/in/luka-pani%C4%87-20b671277/) shares his work on
     - [Ragas Evaluation: In-Depth Insights | PIXION Blog](https://pixion.co/blog/ragas-evaluation-in-depth-insights): A detailed explanation of the metrics and how they are calculated.
     - [RAG in practice - Test Set Generation | PIXION Blog](https://pixion.co/blog/rag-in-practice-test-set-generation): A tutorial on how to generate a test set using Ragas.
 - [Shanthi Vardhan](https://www.linkedin.com/in/shanthivardhan/) shares how his team at [Atomicwork uses ragas](https://www.atomicwork.com/blog/ragas-improving-atom-accuracy) to improve their AI system's ability to accurately identify and retrieve more precise information for enhanced service management.
@@ -25,16 +25,16 @@ Explore insightful articles, tutorials, and stories written by and for our commu
 - Leonie (aka [@helloiamleonie](https://twitter.com/helloiamleonie?source=about_page-------------------------------------))  offers her perspective in the detailed article, ["Evaluating RAG Applications with RAGAs"](https://towardsdatascience.com/evaluating-rag-applications-with-ragas-81d67b0ee31a).
 - The joint efforts of [Erika Cardenas](https://twitter.com/ecardenas300) and [Connor Shorten](https://twitter.com/CShorten30) are showcased in their collaborative piece, ["An Overview on RAG Evaluation | Weaviate"](https://weaviate.io/blog/rag-evaluation), and their podcast with the Ragas team.
 - [Erika Cardenas](https://twitter.com/ecardenas300) further explores the "[RAG performance of hybrid search weightings (alpha)](https://www.linkedin.com/posts/erikacardenas300_i-tested-the-rag-performance-of-hybrid-search-activity-7139679925426376705-TVtc?utm_source=share&utm_medium=member_desktop)" in her recent experiment to tune weaviate alpha score using Ragas.
-- [Langchain’s](https://blog.langchain.dev/) work about [RAG Evaluating RAG pipelines with RAGAs and Langsmith](https://blog.langchain.dev/evaluating-rag-pipelines-with-ragas-langsmith/) provides a complete tutorial on how to leverage both tools to evaluate RAG pipelines.
-- [Plaban Nayak](https://nayakpplaban.medium.com/) shares his work [Evaluate RAG Pipeline using RAGAS](https://medium.aiplanet.com/evaluate-rag-pipeline-using-ragas-fbdd8dd466c1) on building and evaluating a simple RAG using Langchain and RAGAS
-- [Stephen Kurniawan](https://www.linkedin.com/in/stepkurniawan/) compares different RAG elements such as [Chunk Size](https://medium.com/@stepkurniawan/rag-chunk-size-experiment-e5e5ca437f44), [Vector Stores: FAISS vs ChromaDB](https://medium.com/@stepkurniawan/comparing-faiss-with-chroma-vector-stores-0953e1e619eb), [Vector Stores 2: Multiple Documents](https://medium.com/@stepkurniawan/comparing-faiss-vs-chroma-vector-store-retrieve-multiple-documents-07ad81a18851), and [Similarity Searches / Distance Metrics / Index Strategies](https://medium.com/@stepkurniawan/comparing-similarity-searches-distance-metrics-in-vector-stores-rag-model-f0b3f7532d6f). 
+- [LangChain’s](https://blog.langchain.dev/) work about [RAG Evaluating RAG pipelines with RAGAs and LangSmith](https://blog.langchain.dev/evaluating-rag-pipelines-with-ragas-langsmith/) provides a complete tutorial on how to leverage both tools to evaluate RAG pipelines.
+- [Plaban Nayak](https://nayakpplaban.medium.com/) shares his work [Evaluate RAG Pipeline using RAGAS](https://medium.aiplanet.com/evaluate-rag-pipeline-using-ragas-fbdd8dd466c1) on building and evaluating a simple RAG using LangChain and RAGAS
+- [Stephen Kurniawan](https://www.linkedin.com/in/stepkurniawan/) compares different RAG elements such as [Chunk Size](https://medium.com/@stepkurniawan/rag-chunk-size-experiment-e5e5ca437f44), [Vector Stores: FAISS vs ChromaDB](https://medium.com/@stepkurniawan/comparing-faiss-with-chroma-vector-stores-0953e1e619eb), [Vector Stores 2: Multiple Documents](https://medium.com/@stepkurniawan/comparing-faiss-vs-chroma-vector-store-retrieve-multiple-documents-07ad81a18851), and [Similarity Searches / Distance Metrics / Index Strategies](https://medium.com/@stepkurniawan/comparing-similarity-searches-distance-metrics-in-vector-stores-rag-model-f0b3f7532d6f).
 - Discover [Devanshu Brahmbhatt](https://www.linkedin.com/in/devanshubrahmbhatt/)'s insights on optimizing RAG systems in his article, [Enhancing LLM's Accuracy with RAGAS](https://devanshus-organization.gitbook.io/llm-testing-ragas). Learn about RAG architecture, key evaluation metrics, and how to use RAGAS scores to improve performance.
 - [Suzuki](https://www.linkedin.com/in/hirokazu-suzuki-206245110/) and [Hwang](https://www.linkedin.com/in/hwang-yongtae/) conducted an experiment to investigate if Ragas' performance is language-dependent by comparing the performance (correlation coefficient between human labels and scores from Ragas) using datasets of the same content in Japanese and English. They wrote blog about the result of the experiment and basic algorithm of Ragas.
     - [RAG Evaluation: Necessity and Challenge](https://tech.beatrust.com/entry/2024/05/02/RAG_Evaluation%3A_Necessity_and_Challenge)
     - [RAG Evaluation : Computational Metrics in RAG and Calculation Methods in Ragas](https://tech.beatrust.com/entry/2024/05/02/RAG_Evaluation_%3A_Computational_Metrics_in_RAG_and_Calculation_Methods_in_Ragas)
     - [RAG Evaluation: Assessing the Usefulness of Ragas](https://tech.beatrust.com/entry/2024/05/02/RAG_Evaluation%3A_Assessing_the_Usefulness_of_Ragas)
-- [Atita Arora](https://www.linkedin.com/in/atitaarora/) writes about [Evaluating Retrieval Augmented Generation using RAGAS](https://superlinked.com/vectorhub/articles/retrieval-augmented-generation-eval-qdrant-ragas), an end-to-end tutorial on building RAG using [Qdrant](https://qdrant.tech/) and [Langchain](https://www.langchain.com/) and evaluating it with RAGAS. 
-    - *Bonus content* : Learn how to create an evaluation dataset that serves as a reference point for evaluating our RAG pipeline, Understand the RAGAS evaluation metrics and how to make sense of them and putting them in action to test a Naive RAG pipeline and measure its performance using RAGAS metrics. 
+- [Atita Arora](https://www.linkedin.com/in/atitaarora/) writes about [Evaluating Retrieval Augmented Generation using RAGAS](https://superlinked.com/vectorhub/articles/retrieval-augmented-generation-eval-qdrant-ragas), an end-to-end tutorial on building RAG using [Qdrant](https://qdrant.tech/) and [LangChain](https://www.langchain.com/) and evaluating it with RAGAS.
+    - *Bonus content* : Learn how to create an evaluation dataset that serves as a reference point for evaluating our RAG pipeline, Understand the RAGAS evaluation metrics and how to make sense of them and putting them in action to test a Naive RAG pipeline and measure its performance using RAGAS metrics.
     - *Code walkthrough* : https://github.com/qdrant/qdrant-rag-eval/tree/master/workshop-rag-eval-qdrant-ragas
     - *Code walkthrough using [Deepset Haystack](https://haystack.deepset.ai/) and [Mixedbread.ai](https://www.mixedbread.ai/)* : https://github.com/qdrant/qdrant-rag-eval/tree/master/workshop-rag-eval-qdrant-ragas-haystack
 - [Minoru Onda](https://x.com/minorun365) writes for beginners about how to start Ragas v0.2 evaluation with Amazon Bedrock, and integrate with Langfuse.
@@ -50,5 +50,5 @@ Explore insightful articles, tutorials, and stories written by and for our commu
 Stay updated with our latest gatherings, meetups, and online webinars.
 
 - OpenAI Engineers shares their [RAG tricks and features Ragas](https://youtu.be/ahnGLM-RC1Y?si=rS_WSQF8XB04PzhP) on DevDay.
-- [Langchain](https://python.langchain.com/docs/get_started/introduction)’s a [LangChain "RAG Evaluation” Webinar](https://www.crowdcast.io/c/bnx91nz59cqq) with the Ragas team
+- [LangChain](https://python.langchain.com/docs/get_started/introduction)’s a [LangChain "RAG Evaluation” Webinar](https://www.crowdcast.io/c/bnx91nz59cqq) with the Ragas team
 
diff --git a/docs/concepts/components/eval_sample.md b/docs/concepts/components/eval_sample.md
@@ -1,6 +1,6 @@
-# Evaluation Sample 
+# Evaluation Sample
 
-An evaluation sample is a single structured data instance that is used to asses and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the `SingleTurnSample` and `MultiTurnSample` classes.
+An evaluation sample is a single structured data instance that is used to assess and measure the performance of your LLM application in specific scenarios. It represents a single unit of interaction or a specific use case that the AI application is expected to handle. In Ragas, evaluation samples are represented using the `SingleTurnSample` and `MultiTurnSample` classes.
 
 ## SingleTurnSample
 SingleTurnSample represents a single-turn interaction between a user, LLM, and expected results for evaluation. It is suitable for evaluations that involve a single question and answer pair, possibly with additional context or reference information.

diff --git a/docs/concepts/metrics/available_metrics/agents.md b/docs/concepts/metrics/available_metrics/agents.md
@@ -7,8 +7,8 @@ Agentic or tool use workflows can be evaluated in multiple dimensions. Here are
 
 AI systems deployed in real-world applications are expected to adhere to domains of interest while interacting with users but LLMs sometimes may answer general queries by ignoring this limitation. The topic adherence metric evaluates the ability of the AI to stay on predefined domains during the interactions. This metric is particularly important in conversational AI systems, where the AI is expected to only provide assistance to queries related to predefined domains.
 
-`TopicAdherenceScore` requires a predefined set of topics that the AI system is expected to adhere to which is provided using `reference_topics` along with `user_input`. The metric can compute precision, recall, and F1 score for topic adherence, defined as 
-    
+`TopicAdherenceScore` requires a predefined set of topics that the AI system is expected to adhere to which is provided using `reference_topics` along with `user_input`. The metric can compute precision, recall, and F1 score for topic adherence, defined as
+
 $$
 \text{Precision } = {|\text{Queries that are answered and are adheres to any present reference topics}| \over |\text{Queries that are answered and are adheres to any present reference topics}| + |\text{Queries that are answered and do not adheres to any present reference topics}|}
 $$
@@ -65,7 +65,7 @@ To change the mode to recall, set the `mode` parameter to `recall`.
 
 ```python
 scorer = TopicAdherenceScore(llm = evaluator_llm, mode="recall")
-```  
+```
 Output
 ```
 0.99999999995
@@ -75,7 +75,7 @@ Output
 
 ## Tool call Accuracy
 
-`ToolCallAccuracy` is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. This metric needs `user_input` and `reference_tool_calls` to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. The metric is computed by comparing the `reference_tool_calls` with the Tool calls made by the AI. The values range between 0 and 1, with higher values indicating better performance. 
+`ToolCallAccuracy` is a metric that can be used to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. This metric needs `user_input` and `reference_tool_calls` to evaluate the performance of the LLM in identifying and calling the required tools to complete a given task. The metric is computed by comparing the `reference_tool_calls` with the Tool calls made by the AI. The values range between 0 and 1, with higher values indicating better performance.
 
 ```python
 from ragas.metrics import ToolCallAccuracy
@@ -113,7 +113,7 @@ Output
 
 The tool call sequence specified in `reference_tool_calls` is used as the ideal outcome. If the tool calls made by the AI does not match the order or sequence of the `reference_tool_calls`, the metric will return a score of 0. This helps to ensure that the AI is able to identify and call the required tools in the correct order to complete a given task.
 
-By default the tool names and arguments are compared using exact string matching. But sometimes this might not be optimal, for example if the args are natural language strings. You can also use any ragas metrics (values between 0 and 1) as distance measure to identify if a retrieved context is relevant or not. For example,
+By default, the tool names and arguments are compared using exact string matching. But sometimes this might not be optimal, for example if the args are natural language strings. You can also use any ragas metrics (values between 0 and 1) as distance measure to identify if a retrieved context is relevant or not. For example,
 
 ```python
 from ragas.metrics._string import NonLLMStringSimilarity

diff --git a/docs/concepts/metrics/available_metrics/aspect_critic.md b/docs/concepts/metrics/available_metrics/aspect_critic.md
@@ -5,7 +5,7 @@ This is designed to assess submissions based on predefined aspects such as `harm
 
 Critiques within the LLM evaluators evaluate submissions based on the provided aspect. Ragas Critiques offers a range of predefined aspects like correctness, harmfulness, etc. (Please refer to `SUPPORTED_ASPECTS` for a complete list). If you prefer, you can also create custom aspects to evaluate submissions according to your unique requirements.
 
-The `strictness` parameter plays a crucial role in maintaining a certain level of self-consistency in predictions, with an ideal range typically falling between 2 to 4.
+The `strictness` parameter plays a crucial role in maintaining a certain level of self-consistency in predictions, with an ideal range typically falling from 2 to 4.
 
 
 ```{hint}

diff --git a/docs/concepts/metrics/available_metrics/context_entities_recall.md b/docs/concepts/metrics/available_metrics/context_entities_recall.md
@@ -1,13 +1,13 @@
 ## Context Entities Recall
 
-`ContextEntityRecall` metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities are recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.
+`ContextEntityRecall` metric gives the measure of recall of the retrieved context, based on the number of entities present in both `reference` and `retrieved_contexts` relative to the number of entities present in the `reference` alone. Simply put, it is a measure of what fraction of entities is recalled from `reference`. This metric is useful in fact-based use cases like tourism help desk, historical QA, etc. This metric can help evaluate the retrieval mechanism for entities, based on comparison with entities present in `reference`, because in cases where entities matter, we need the `retrieved_contexts` which cover them.
 
-To compute this metric, we use two sets:  
+To compute this metric, we use two sets:
 
-- **$RE$**: The set of entities in the reference.  
-- **$RCE$**: The set of entities in the retrieved contexts.  
+- **$RE$**: The set of entities in the reference.
+- **$RCE$**: The set of entities in the retrieved contexts.
 
-We calculate the number of entities common to both sets ($RCE \cap RE$) and divide it by the total number of entities in the reference ($RE$). The formula is:  
+We calculate the number of entities common to both sets ($RCE \cap RE$) and divide it by the total number of entities in the reference ($RE$). The formula is:
 
 $$
 \text{Context Entity Recall} = \frac{\text{Number of common entities between $RCE$ and $RE$}}{\text{Total number of entities in $RE$}}
@@ -22,7 +22,7 @@ from ragas.metrics import ContextEntityRecall
 
 sample = SingleTurnSample(
     reference="The Eiffel Tower is located in Paris.",
-    retrieved_contexts=["The Eiffel Tower is located in Paris."], 
+    retrieved_contexts=["The Eiffel Tower is located in Paris."],
 )
 
 scorer = ContextEntityRecall(llm=evaluator_llm)
@@ -43,15 +43,15 @@ Output
     **High entity recall context**: The Taj Mahal is a symbol of love and architectural marvel located in Agra, India. It was built by the Mughal emperor Shah Jahan in memory of his beloved wife, Mumtaz Mahal. The structure is renowned for its intricate marble work and beautiful gardens surrounding it.
     **Low entity recall context**: The Taj Mahal is an iconic monument in India. It is a UNESCO World Heritage Site and attracts millions of visitors annually. The intricate carvings and stunning architecture make it a must-visit destination.
 
-Let us consider the refrence and the retrieved contexts given above.
+Let us consider the reference and the retrieved contexts given above.
 
-- **Step-1**: Find entities present in the refrence.
+- **Step-1**: Find entities present in the reference.
     - Entities in ground truth (RE) - ['Taj Mahal', 'Yamuna', 'Agra', '1631', 'Shah Jahan', 'Mumtaz Mahal']
 - **Step-2**: Find entities present in the retrieved contexts.
     - Entities in context (RCE1) - ['Taj Mahal', 'Agra', 'Shah Jahan', 'Mumtaz Mahal', 'India']
     - Entities in context (RCE2) - ['Taj Mahal', 'UNESCO', 'India']
 - **Step-3**: Use the formula given above to calculate entity-recall
-    
+
     $$
     \text{context entity recall 1} = \frac{| RCE1 \cap RE |}{| RE |}
                                  = 4/6
@@ -63,5 +63,5 @@ Let us consider the refrence and the retrieved contexts given above.
                                  = 1/6
     $$
 
-    We can see that the first context had a high entity recall, because it has a better entity coverage given the refrence. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.
+    We can see that the first context had a high entity recall, because it has a better entity coverage given the reference. If these two retrieved contexts were fetched by two retrieval mechanisms on same set of documents, we could say that the first mechanism was better than the other in use-cases where entities are of importance.