DOCSP-51695: [IS Edits] Build a PDF Search Application with Vector Search and LLMs (#277)

mayaraman19 · web-flow · commit bc8cc6a2a4cb · 2025-07-18T10:25:21.000-04:00
* initial copy edit

* first pass edits

* JA Feedback

* build solution

* initial pass

* Diego feedback

* add privacy

* Revert "initial pass"

This reverts commit f82776aead47b4d80c7f5992977cfb80dd2580da.
diff --git a/source/solutions-library/pdf-search.txt b/source/solutions-library/pdf-search.txt
@@ -42,62 +42,58 @@ Superduper.io, and LLMs.
 Solution Overview
 -----------------
 
+Insurance firms rely heavily on data processing. To make investment
+decisions or handle claims, they leverage vast amounts of mostly
+unstructured data. Underwriters and claims adjusters need to comb
+through numerous pages of guidelines, contracts, and reports, typically
+in PDF format. Manually finding and reviewing every piece of information is
+time-consuming and can easily lead to expensive mistakes, such as
+incorrect risk estimations.
+
 `Retrieval-augmented generation
 <https://www.mongodb.com/resources/basics/retrieval-augmented-generation>`__
 (RAG) applications are a game changer for insurance companies, enabling
 them to harness the power of unstructured data while promoting
-accessibility and flexibility. Special attention goes to PDFs, which are
-ubiquitous yet difficult to search, leading claim adjusters and
-underwriters to spend hours reviewing contracts, claims, and guidelines
-in this common format. RAG for PDF search brings efficiency and accuracy
-to this historically cumbersome task. Now, users can simply type a
-question in natural language and the app will sift through the company
-data, provide an answer, summarize the content of the documents, and
-indicate the source of the information, including the page and paragraph
-where it was found.
+accessibility and flexibility. They are especially helpful for PDFs, which are
+common yet difficult to search. RAG makes PDF search more efficient and 
+accurate. Now, users can type a question in natural language 
+and the application will sift through the company data, provide an answer, 
+summarize the content of the documents, and indicate the source of the 
+information, including the page and paragraph where it was found.
 
 In this `GitHub repo
 <https://github.com/mongodb-industry-solutions/Insurance-PDF-Search>`__,
-you will find detailed, step-by-step instructions on how to build the
-PDF search application combining MongoDB, Superduper, and LLMs. Our use
-case for this solution focuses on a claim adjuster or an underwriter
-handling a specific case. Analyzing the guidelines PDF associated with a
-specific customer helps determine the loss amount in the event of an
-accident or the new premium in the case of a policy renewal. The app
-assists by answering questions and displaying the relevant sections of
+you will find detailed, step-by-step instructions on how to build a
+PDF search application that combines MongoDB, Superduper, and LLMs. Our use
+case for this solution focuses on a claims adjuster or an underwriter
+handling a specific case. Analyzing a PDF of guidelines that is associated
+with a specific customer helps determine the loss amount in the event of an
+accident, or the new premium in the case of a policy renewal. The application
+answers questions and displays the relevant sections of
 the document.
 
-Insurance firms rely heavily on data processing. To make investment
-decisions or handle claims, they leverage vast amounts of data, mostly
-unstructured. Underwriters and claim adjusters need to comb through
-numerous pages of guidelines, contracts, and reports, typically in PDF
-format. Manually finding and reviewing every piece of information is
-time-consuming and can easily lead to expensive mistakes, such as
-incorrect risk estimations. Quickly finding and accessing relevant
-content is essential. Combining `Atlas Vector Search
+Combining `Atlas Vector Search
 <https://www.mongodb.com/products/platform/atlas-vector-search>`__ and
 `LLMs
 <https://www.mongodb.com/resources/basics/large-language-models>`__ to
 build RAG apps can directly impact the bottom line of an insurance
-company. Visit the `Atlas Vector Search Quick Start guide
-<https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/vector-search-quick-start/?tck=ai_as_web>`__
-to try our semantic search tool now.
+company. To try our semantic search tool, visit the 
+:ref:`Atlas Vector Search Quick Start <vector-search-quick-start>`.
 
 .. video:: https://youtu.be/_w9okXmzTJw
 
-Building the Solution and Reference Architecture
-------------------------------------------------
+Reference Architecture
+----------------------
 
 Combining MongoDB and Superduper allows you to build an information
-retrieval system with ease. Let's break down the process:
+retrieval system with ease. The process involves the following steps:
 
 #. The user adds the PDFs that need to be searched.
     
 #. A script scans the PDFs, creates the chunks, and vectorizes them (see
-   Figure 1). The chunking step is carried out using a sliding window
-   methodology, which ensures that potentially important transitional
-   data between chunks is not lost, helping to preserve continuity of
-   context.
+   Figure 1). To ensure that you don't lose transitional data between
+   chunks, the script uses the sliding window methodology to produce
+   overlapping chunks. 
 
 #. Vectors and chunk metadata are stored in MongoDB, and a
    `Vector Search
@@ -119,10 +115,8 @@ Each customer has a guidelines PDF associated with their account based
 on country of residency. When the user selects a customer and asks a
 question, the system runs a vector search query only on that particular
 document, seamlessly filtering out the non-relevant ones. This is made
-possible by the `pre-filtering
-<https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-stage/#atlas-vector-search-pre-filter>`__
-(see code snippets below) field included in the index and in the search
-query. 
+possible by the :ref:`pre-filtering <vectorSearch-agg-pipeline-filter>`
+field included in the index and the search query (see code snippets below). 
 
 Atlas Vector Search also takes advantage of MongoDB's new `Search Nodes
 <https://www.mongodb.com/blog/post/search-nodes-now-public-preview-performance-scale-dedicated-infrastructure>`__
@@ -131,9 +125,9 @@ of resourcing for specific workload needs. Search Nodes provide
 dedicated infrastructure for `Atlas Search
 <https://www.mongodb.com/products/platform/atlas-search>`__ and Vector
 Search workloads, allowing you to optimize compute resources and fully
-scale search needs independent of the database. Search Nodes provide
-better performance at scale, delivering workload isolation, higher
-availability, and the ability to better optimize resource usage.
+scale search needs independent of the database. Search Nodes provide 
+workload isolation, higher availability, and the ability to better 
+optimize resource usage.
 
 .. figure:: /includes/images/industry-solutions/is-pdf-search-fig2.svg
    :figwidth: 1200px
@@ -186,77 +180,57 @@ Superduper.io
 framework for integrating AI models and workflows directly with and
 across major databases for more flexible and scalable custom enterprise
 AI solutions. It enables developers to build, deploy, and manage AI on
-their existing data infrastructure and data, while using their preferred
+their existing infrastructure and data while using their preferred
 tools, eliminating data migration and duplication.
 
 **With Superduper.io, developers can:**
 
-- Bring AI to their databases, eliminating data pipelines and moving
-  data, minimizing engineering efforts, time to production, and
+- Bring AI to their databases, which eliminates data pipelines and
+  minimizes engineering efforts, time to production, and
   computation resources.
 
-- Implement AI workflows with any AI models and APIs, on any type of
-  data, with any AI and Python framework, package, class, or function.
+- Implement AI workflows with any AI models and APIs on any type of
+  data.
 
-- Safeguard data by switching from APIs to hosting and fine-tuning your
-  own models, on your own and existing infrastructure, whether
-  on-premises or in the cloud.
+- Safeguard data by switching from APIs to hosting and fine-tuning their
+  own models on their own infrastructure.
 
-- Easily switch between embedding models and LLMs to other API providers
-  as well as hosting your own models on HuggingFace or elsewhere just by
-  changing a small configuration.
+- Switch from embedding models to other API providers. They can
+  also host their models on HuggingFace or other platforms.
 
 Superduper.io provides an array of sample use cases and notebooks that
 developers can use to get started, including vector search with MongoDB,
 embedding generation, multimodal search, RAG, transfer learning, and
-many more. The demo showcased in this solution is adapted from an `app
-previously developed <https://github.com/SuperDuperDB/poc-volvo>`__ by
+many more. This solution's demo is adapted from a
+`previously developed app <https://github.com/SuperDuperDB/poc-volvo>`__ from
 Superduper.io. 
 
-Key Learnings
--------------
+Build the Solution
+------------------
+
+Build the solution following the instructions in this `Github repo
+<https://github.com/mongodb-industry-solutions/Insurance-PDF-Search/tree/main>`__.
+The solution is composed of two steps:
 
-- Build the solution following the instructions in this `Github repo
-  <https://github.com/mongodb-industry-solutions/Insurance-PDF-Search/tree/main>`__.
-  It is important to note that the solution is made of two logical
-  steps:
+#. The initialization script breaks down the PDFs into chunks and then
+   turns them into vector embeddings.
+#. The querying step allows the user to interrogate the documents.
 
-  - The initialization script breaks down the PDFs into chunks and then
-    turns them into vector embeddings.
-  - The querying step allows the user to interrogate the documents.
 
-- **Text embedding creation:** The embedding generation process can be
-  carried out using different models and deployment options. It is
-  important to consider privacy and data protection requirements. You
+Key Learnings
+-------------
+
+- **Embeddings can use different models and deployment options:** You
   can deploy a model locally if your data needs to remain on the
   servers. Otherwise, you can call an API and get your vector embeddings
-  back, as explained in `this
-  <https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/>`__
-  tutorial. You can use `Voyage AI (acquired by MongoDB)
+  back, as explained in :ref:`this <create-vector-embeddings>`
+  tutorial. You can use `Voyage AI
   <https://www.mongodb.com/blog/post/redefining-database-ai-why-mongodb-acquired-voyage-ai>`__
-  or open-source models. 
-
-- Superduper is the framework that helps us with the plumbing of the
-  moving pieces, providing a simple and standard interface to interact
-  with Vector Search and LLMs.
-
-Technologies and Products Used
-------------------------------
-
-MongoDB Developer Data Platform
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-- `MongoDB Atlas <https://www.mongodb.com/atlas>`__
-- `MongoDB Atlas Vector Search
-  <https://www.mongodb.com/products/platform/atlas-vector-search>`__
-- `Atlas Collections <https://www.mongodb.com/docs/manual/core/databases-and-collections/>`__
-- `Atlas Clusters <https://www.mongodb.com/resources/products/fundamentals/clusters>`__
-
-Partner Technologies
-~~~~~~~~~~~~~~~~~~~~
+  or open-source models. When building your system, consider privacy and
+  security requirements.
 
-- `Superduper.io <https://www.superduper.io/>`__
-- `FastAPI <https://www.mongodb.com/developer/technologies/fastapi>`__
+- **Superduper integrates AI models and workflow:** Superduper is the framework 
+  that provides a simple and standard interface to interact with Vector Search and LLMs.
 
 Authors
 -------