Skip to content

Commit bc8cc6a

Browse files
authored
DOCSP-51695: [IS Edits] Build a PDF Search Application with Vector Search and LLMs (#277)
* initial copy edit * first pass edits * JA Feedback * build solution * initial pass * Diego feedback * add privacy * Revert "initial pass" This reverts commit f82776aead47b4d80c7f5992977cfb80dd2580da.
1 parent d9e5bda commit bc8cc6a

File tree

1 file changed

+65
-91
lines changed

1 file changed

+65
-91
lines changed

source/solutions-library/pdf-search.txt

Lines changed: 65 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -42,62 +42,58 @@ Superduper.io, and LLMs.
4242
Solution Overview
4343
-----------------
4444

45+
Insurance firms rely heavily on data processing. To make investment
46+
decisions or handle claims, they leverage vast amounts of mostly
47+
unstructured data. Underwriters and claims adjusters need to comb
48+
through numerous pages of guidelines, contracts, and reports, typically
49+
in PDF format. Manually finding and reviewing every piece of information is
50+
time-consuming and can easily lead to expensive mistakes, such as
51+
incorrect risk estimations.
52+
4553
`Retrieval-augmented generation
4654
<https://www.mongodb.com/resources/basics/retrieval-augmented-generation>`__
4755
(RAG) applications are a game changer for insurance companies, enabling
4856
them to harness the power of unstructured data while promoting
49-
accessibility and flexibility. Special attention goes to PDFs, which are
50-
ubiquitous yet difficult to search, leading claim adjusters and
51-
underwriters to spend hours reviewing contracts, claims, and guidelines
52-
in this common format. RAG for PDF search brings efficiency and accuracy
53-
to this historically cumbersome task. Now, users can simply type a
54-
question in natural language and the app will sift through the company
55-
data, provide an answer, summarize the content of the documents, and
56-
indicate the source of the information, including the page and paragraph
57-
where it was found.
57+
accessibility and flexibility. They are especially helpful for PDFs, which are
58+
common yet difficult to search. RAG makes PDF search more efficient and
59+
accurate. Now, users can type a question in natural language
60+
and the application will sift through the company data, provide an answer,
61+
summarize the content of the documents, and indicate the source of the
62+
information, including the page and paragraph where it was found.
5863

5964
In this `GitHub repo
6065
<https://github.com/mongodb-industry-solutions/Insurance-PDF-Search>`__,
61-
you will find detailed, step-by-step instructions on how to build the
62-
PDF search application combining MongoDB, Superduper, and LLMs. Our use
63-
case for this solution focuses on a claim adjuster or an underwriter
64-
handling a specific case. Analyzing the guidelines PDF associated with a
65-
specific customer helps determine the loss amount in the event of an
66-
accident or the new premium in the case of a policy renewal. The app
67-
assists by answering questions and displaying the relevant sections of
66+
you will find detailed, step-by-step instructions on how to build a
67+
PDF search application that combines MongoDB, Superduper, and LLMs. Our use
68+
case for this solution focuses on a claims adjuster or an underwriter
69+
handling a specific case. Analyzing a PDF of guidelines that is associated
70+
with a specific customer helps determine the loss amount in the event of an
71+
accident, or the new premium in the case of a policy renewal. The application
72+
answers questions and displays the relevant sections of
6873
the document.
6974

70-
Insurance firms rely heavily on data processing. To make investment
71-
decisions or handle claims, they leverage vast amounts of data, mostly
72-
unstructured. Underwriters and claim adjusters need to comb through
73-
numerous pages of guidelines, contracts, and reports, typically in PDF
74-
format. Manually finding and reviewing every piece of information is
75-
time-consuming and can easily lead to expensive mistakes, such as
76-
incorrect risk estimations. Quickly finding and accessing relevant
77-
content is essential. Combining `Atlas Vector Search
75+
Combining `Atlas Vector Search
7876
<https://www.mongodb.com/products/platform/atlas-vector-search>`__ and
7977
`LLMs
8078
<https://www.mongodb.com/resources/basics/large-language-models>`__ to
8179
build RAG apps can directly impact the bottom line of an insurance
82-
company. Visit the `Atlas Vector Search Quick Start guide
83-
<https://www.mongodb.com/docs/atlas/atlas-vector-search/tutorials/vector-search-quick-start/?tck=ai_as_web>`__
84-
to try our semantic search tool now.
80+
company. To try our semantic search tool, visit the
81+
:ref:`Atlas Vector Search Quick Start <vector-search-quick-start>`.
8582

8683
.. video:: https://youtu.be/_w9okXmzTJw
8784

88-
Building the Solution and Reference Architecture
89-
------------------------------------------------
85+
Reference Architecture
86+
----------------------
9087

9188
Combining MongoDB and Superduper allows you to build an information
92-
retrieval system with ease. Let's break down the process:
89+
retrieval system with ease. The process involves the following steps:
9390

9491
#. The user adds the PDFs that need to be searched.
9592

9693
#. A script scans the PDFs, creates the chunks, and vectorizes them (see
97-
Figure 1). The chunking step is carried out using a sliding window
98-
methodology, which ensures that potentially important transitional
99-
data between chunks is not lost, helping to preserve continuity of
100-
context.
94+
Figure 1). To ensure that you don't lose transitional data between
95+
chunks, the script uses the sliding window methodology to produce
96+
overlapping chunks.
10197

10298
#. Vectors and chunk metadata are stored in MongoDB, and a
10399
`Vector Search
@@ -119,10 +115,8 @@ Each customer has a guidelines PDF associated with their account based
119115
on country of residency. When the user selects a customer and asks a
120116
question, the system runs a vector search query only on that particular
121117
document, seamlessly filtering out the non-relevant ones. This is made
122-
possible by the `pre-filtering
123-
<https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-stage/#atlas-vector-search-pre-filter>`__
124-
(see code snippets below) field included in the index and in the search
125-
query.
118+
possible by the :ref:`pre-filtering <vectorSearch-agg-pipeline-filter>`
119+
field included in the index and the search query (see code snippets below).
126120

127121
Atlas Vector Search also takes advantage of MongoDB's new `Search Nodes
128122
<https://www.mongodb.com/blog/post/search-nodes-now-public-preview-performance-scale-dedicated-infrastructure>`__
@@ -131,9 +125,9 @@ of resourcing for specific workload needs. Search Nodes provide
131125
dedicated infrastructure for `Atlas Search
132126
<https://www.mongodb.com/products/platform/atlas-search>`__ and Vector
133127
Search workloads, allowing you to optimize compute resources and fully
134-
scale search needs independent of the database. Search Nodes provide
135-
better performance at scale, delivering workload isolation, higher
136-
availability, and the ability to better optimize resource usage.
128+
scale search needs independent of the database. Search Nodes provide
129+
workload isolation, higher availability, and the ability to better
130+
optimize resource usage.
137131

138132
.. figure:: /includes/images/industry-solutions/is-pdf-search-fig2.svg
139133
:figwidth: 1200px
@@ -186,77 +180,57 @@ Superduper.io
186180
framework for integrating AI models and workflows directly with and
187181
across major databases for more flexible and scalable custom enterprise
188182
AI solutions. It enables developers to build, deploy, and manage AI on
189-
their existing data infrastructure and data, while using their preferred
183+
their existing infrastructure and data while using their preferred
190184
tools, eliminating data migration and duplication.
191185

192186
**With Superduper.io, developers can:**
193187

194-
- Bring AI to their databases, eliminating data pipelines and moving
195-
data, minimizing engineering efforts, time to production, and
188+
- Bring AI to their databases, which eliminates data pipelines and
189+
minimizes engineering efforts, time to production, and
196190
computation resources.
197191

198-
- Implement AI workflows with any AI models and APIs, on any type of
199-
data, with any AI and Python framework, package, class, or function.
192+
- Implement AI workflows with any AI models and APIs on any type of
193+
data.
200194

201-
- Safeguard data by switching from APIs to hosting and fine-tuning your
202-
own models, on your own and existing infrastructure, whether
203-
on-premises or in the cloud.
195+
- Safeguard data by switching from APIs to hosting and fine-tuning their
196+
own models on their own infrastructure.
204197

205-
- Easily switch between embedding models and LLMs to other API providers
206-
as well as hosting your own models on HuggingFace or elsewhere just by
207-
changing a small configuration.
198+
- Switch from embedding models to other API providers. They can
199+
also host their models on HuggingFace or other platforms.
208200

209201
Superduper.io provides an array of sample use cases and notebooks that
210202
developers can use to get started, including vector search with MongoDB,
211203
embedding generation, multimodal search, RAG, transfer learning, and
212-
many more. The demo showcased in this solution is adapted from an `app
213-
previously developed <https://github.com/SuperDuperDB/poc-volvo>`__ by
204+
many more. This solution's demo is adapted from a
205+
`previously developed app <https://github.com/SuperDuperDB/poc-volvo>`__ from
214206
Superduper.io.
215207

216-
Key Learnings
217-
-------------
208+
Build the Solution
209+
------------------
210+
211+
Build the solution following the instructions in this `Github repo
212+
<https://github.com/mongodb-industry-solutions/Insurance-PDF-Search/tree/main>`__.
213+
The solution is composed of two steps:
218214

219-
- Build the solution following the instructions in this `Github repo
220-
<https://github.com/mongodb-industry-solutions/Insurance-PDF-Search/tree/main>`__.
221-
It is important to note that the solution is made of two logical
222-
steps:
215+
#. The initialization script breaks down the PDFs into chunks and then
216+
turns them into vector embeddings.
217+
#. The querying step allows the user to interrogate the documents.
223218

224-
- The initialization script breaks down the PDFs into chunks and then
225-
turns them into vector embeddings.
226-
- The querying step allows the user to interrogate the documents.
227219

228-
- **Text embedding creation:** The embedding generation process can be
229-
carried out using different models and deployment options. It is
230-
important to consider privacy and data protection requirements. You
220+
Key Learnings
221+
-------------
222+
223+
- **Embeddings can use different models and deployment options:** You
231224
can deploy a model locally if your data needs to remain on the
232225
servers. Otherwise, you can call an API and get your vector embeddings
233-
back, as explained in `this
234-
<https://www.mongodb.com/docs/atlas/atlas-vector-search/create-embeddings/>`__
235-
tutorial. You can use `Voyage AI (acquired by MongoDB)
226+
back, as explained in :ref:`this <create-vector-embeddings>`
227+
tutorial. You can use `Voyage AI
236228
<https://www.mongodb.com/blog/post/redefining-database-ai-why-mongodb-acquired-voyage-ai>`__
237-
or open-source models.
238-
239-
- Superduper is the framework that helps us with the plumbing of the
240-
moving pieces, providing a simple and standard interface to interact
241-
with Vector Search and LLMs.
242-
243-
Technologies and Products Used
244-
------------------------------
245-
246-
MongoDB Developer Data Platform
247-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
248-
249-
- `MongoDB Atlas <https://www.mongodb.com/atlas>`__
250-
- `MongoDB Atlas Vector Search
251-
<https://www.mongodb.com/products/platform/atlas-vector-search>`__
252-
- `Atlas Collections <https://www.mongodb.com/docs/manual/core/databases-and-collections/>`__
253-
- `Atlas Clusters <https://www.mongodb.com/resources/products/fundamentals/clusters>`__
254-
255-
Partner Technologies
256-
~~~~~~~~~~~~~~~~~~~~
229+
or open-source models. When building your system, consider privacy and
230+
security requirements.
257231

258-
- `Superduper.io <https://www.superduper.io/>`__
259-
- `FastAPI <https://www.mongodb.com/developer/technologies/fastapi>`__
232+
- **Superduper integrates AI models and workflow:** Superduper is the framework
233+
that provides a simple and standard interface to interact with Vector Search and LLMs.
260234

261235
Authors
262236
-------

0 commit comments

Comments
 (0)