Skip to content

Commit ef4046d

Browse files
authored
Merge pull request #71 from mayooear/feat/add-directory-loader
Add directory loader to load multiple pdf files
2 parents b4c88e1 + 90381f0 commit ef4046d

File tree

4 files changed

+23
-29
lines changed

4 files changed

+23
-29
lines changed

README.md

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Docs
1+
# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Files
22

3-
Use the new GPT-4 api to build a chatGPT chatbot for Large PDF docs (56 pages used in this example).
3+
Use the new GPT-4 api to build a chatGPT chatbot for multiple Large PDF files.
44

55
Tech stack used includes LangChain, Pinecone, Typescript, Openai, and Next.js. LangChain is a framework that makes it easier to build scalable AI/LLM apps and chatbots. Pinecone is a vectorstore for storing embeddings and your PDF in text to later retrieve similar docs.
66

@@ -48,15 +48,15 @@ PINECONE_INDEX_NAME=
4848

4949
5. In `utils/makechain.ts` chain change the `QA_PROMPT` for your own usecase. Change `modelName` in `new OpenAIChat` to `gpt-3.5-turbo`, if you don't have access to `gpt-4`. Please verify outside this repo that you have access to `gpt-4`, otherwise the application will not work with it.
5050

51-
## Convert your PDF to embeddings
51+
## Convert your PDF files to embeddings
5252

53-
1. In `docs` folder replace the pdf with your own pdf doc.
53+
**This repo can load multiple PDF files**
5454

55-
2. In `scripts/ingest-data.ts` replace `filePath` with `docs/{yourdocname}.pdf`
55+
1. Inside `docs` folder, add your pdf files or folders that contain pdf files.
5656

57-
3. Run the script `pnpm run ingest` to 'ingest' and embed your docs
57+
2. Run the script `npm run ingest` to 'ingest' and embed your docs. If you run into errors troubleshoot below.
5858

59-
4. Check Pinecone dashboard to verify your namespace and vectors have been added.
59+
3. Check Pinecone dashboard to verify your namespace and vectors have been added.
6060

6161
## Run the app
6262

@@ -73,16 +73,15 @@ In general, keep an eye out in the `issues` and `discussions` section of this re
7373
- Check that you've created an `.env` file that contains your valid (and working) API keys, environment and index name.
7474
- If you change `modelName` in `OpenAIChat` note that the correct name of the alternative model is `gpt-3.5-turbo`
7575
- Make sure you have access to `gpt-4` if you decide to use. Test your openAI keys outside the repo and make sure it works and that you have enough API credits.
76+
- Your pdf file is corrupted and cannot be parsed.
7677

7778
**Pinecone errors**
7879

7980
- Make sure your pinecone dashboard `environment` and `index` matches the one in the `pinecone.ts` and `.env` files.
8081
- Check that you've set the vector dimensions to `1536`.
8182
- Make sure your pinecone namespace is in lowercase.
8283
- Pinecone indexes of users on the Starter(free) plan are deleted after 7 days of inactivity. To prevent this, send an API request to Pinecone to reset the counter.
83-
- Retry with a new Pinecone index.
84-
85-
If you're stuck after trying all these steps, delete `node_modules`, restart your computer, then `pnpm install` again.
84+
- Retry from scratch with a new Pinecone index and cloned repo.
8685

8786
## Credit
8887

docs/finance/turingfinance.pdf

3.32 MB
Binary file not shown.
File renamed without changes.

scripts/ingest-data.ts

Lines changed: 14 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -4,18 +4,20 @@ import { PineconeStore } from 'langchain/vectorstores';
44
import { pinecone } from '@/utils/pinecone-client';
55
import { CustomPDFLoader } from '@/utils/customPDFLoader';
66
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
7+
import { DirectoryLoader } from 'langchain/document_loaders';
78

8-
/* Name of directory to retrieve files from. You can change this as required */
9-
const filePath = 'docs/MorseVsFrederick.pdf';
9+
/* Name of directory to retrieve your files from */
10+
const filePath = 'docs';
1011

1112
export const run = async () => {
1213
try {
13-
/*load raw docs from the pdf file in the directory */
14-
const loader = new CustomPDFLoader(filePath);
15-
// const loader = new PDFLoader(filePath);
16-
const rawDocs = await loader.load();
14+
/*load raw docs from the all files in the directory */
15+
const directoryLoader = new DirectoryLoader(filePath, {
16+
'.pdf': (path) => new CustomPDFLoader(path),
17+
});
1718

18-
console.log(rawDocs);
19+
// const loader = new PDFLoader(filePath);
20+
const rawDocs = await directoryLoader.load();
1921

2022
/* Split text into chunks */
2123
const textSplitter = new RecursiveCharacterTextSplitter({
@@ -32,18 +34,11 @@ export const run = async () => {
3234
const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name
3335

3436
//embed the PDF documents
35-
36-
/* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
37-
const chunkSize = 50;
38-
for (let i = 0; i < docs.length; i += chunkSize) {
39-
const chunk = docs.slice(i, i + chunkSize);
40-
console.log('chunk', i, chunk);
41-
await PineconeStore.fromDocuments(chunk, embeddings, {
42-
pineconeIndex: index,
43-
namespace: PINECONE_NAME_SPACE,
44-
textKey: 'text',
45-
});
46-
}
37+
await PineconeStore.fromDocuments(docs, embeddings, {
38+
pineconeIndex: index,
39+
namespace: PINECONE_NAME_SPACE,
40+
textKey: 'text',
41+
});
4742
} catch (error) {
4843
console.log('error', error);
4944
throw new Error('Failed to ingest your data');

0 commit comments

Comments
 (0)