Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Docs
# GPT-4 & LangChain - Create a ChatGPT Chatbot for Your PDF Files

Use the new GPT-4 api to build a chatGPT chatbot for Large PDF docs (56 pages used in this example).
Use the new GPT-4 api to build a chatGPT chatbot for multiple Large PDF files.

Tech stack used includes LangChain, Pinecone, Typescript, Openai, and Next.js. LangChain is a framework that makes it easier to build scalable AI/LLM apps and chatbots. Pinecone is a vectorstore for storing embeddings and your PDF in text to later retrieve similar docs.

Expand Down Expand Up @@ -48,15 +48,15 @@ PINECONE_INDEX_NAME=

5. In `utils/makechain.ts` chain change the `QA_PROMPT` for your own usecase. Change `modelName` in `new OpenAIChat` to `gpt-3.5-turbo`, if you don't have access to `gpt-4`. Please verify outside this repo that you have access to `gpt-4`, otherwise the application will not work with it.

## Convert your PDF to embeddings
## Convert your PDF files to embeddings

1. In `docs` folder replace the pdf with your own pdf doc.
**This repo can load multiple PDF files**

2. In `scripts/ingest-data.ts` replace `filePath` with `docs/{yourdocname}.pdf`
1. Inside `docs` folder, add your pdf files or folders that contain pdf files.

3. Run the script `pnpm run ingest` to 'ingest' and embed your docs
2. Run the script `npm run ingest` to 'ingest' and embed your docs. If you run into errors troubleshoot below.

4. Check Pinecone dashboard to verify your namespace and vectors have been added.
3. Check Pinecone dashboard to verify your namespace and vectors have been added.

## Run the app

Expand All @@ -73,16 +73,15 @@ In general, keep an eye out in the `issues` and `discussions` section of this re
- Check that you've created an `.env` file that contains your valid (and working) API keys, environment and index name.
- If you change `modelName` in `OpenAIChat` note that the correct name of the alternative model is `gpt-3.5-turbo`
- Make sure you have access to `gpt-4` if you decide to use. Test your openAI keys outside the repo and make sure it works and that you have enough API credits.
- Your pdf file is corrupted and cannot be parsed.

**Pinecone errors**

- Make sure your pinecone dashboard `environment` and `index` matches the one in the `pinecone.ts` and `.env` files.
- Check that you've set the vector dimensions to `1536`.
- Make sure your pinecone namespace is in lowercase.
- Pinecone indexes of users on the Starter(free) plan are deleted after 7 days of inactivity. To prevent this, send an API request to Pinecone to reset the counter.
- Retry with a new Pinecone index.

If you're stuck after trying all these steps, delete `node_modules`, restart your computer, then `pnpm install` again.
- Retry from scratch with a new Pinecone index and cloned repo.

## Credit

Expand Down
Binary file added docs/finance/turingfinance.pdf
Binary file not shown.
File renamed without changes.
33 changes: 14 additions & 19 deletions scripts/ingest-data.ts
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,20 @@ import { PineconeStore } from 'langchain/vectorstores';
import { pinecone } from '@/utils/pinecone-client';
import { CustomPDFLoader } from '@/utils/customPDFLoader';
import { PINECONE_INDEX_NAME, PINECONE_NAME_SPACE } from '@/config/pinecone';
import { DirectoryLoader } from 'langchain/document_loaders';

/* Name of directory to retrieve files from. You can change this as required */
const filePath = 'docs/MorseVsFrederick.pdf';
/* Name of directory to retrieve your files from */
const filePath = 'docs';

export const run = async () => {
try {
/*load raw docs from the pdf file in the directory */
const loader = new CustomPDFLoader(filePath);
// const loader = new PDFLoader(filePath);
const rawDocs = await loader.load();
/*load raw docs from the all files in the directory */
const directoryLoader = new DirectoryLoader(filePath, {
'.pdf': (path) => new CustomPDFLoader(path),
});

console.log(rawDocs);
// const loader = new PDFLoader(filePath);
const rawDocs = await directoryLoader.load();

/* Split text into chunks */
const textSplitter = new RecursiveCharacterTextSplitter({
Expand All @@ -32,18 +34,11 @@ export const run = async () => {
const index = pinecone.Index(PINECONE_INDEX_NAME); //change to your own index name

//embed the PDF documents

/* Pinecone recommends a limit of 100 vectors per upsert request to avoid errors*/
const chunkSize = 50;
for (let i = 0; i < docs.length; i += chunkSize) {
const chunk = docs.slice(i, i + chunkSize);
console.log('chunk', i, chunk);
await PineconeStore.fromDocuments(chunk, embeddings, {
pineconeIndex: index,
namespace: PINECONE_NAME_SPACE,
textKey: 'text',
});
}
await PineconeStore.fromDocuments(docs, embeddings, {
pineconeIndex: index,
namespace: PINECONE_NAME_SPACE,
textKey: 'text',
});
} catch (error) {
console.log('error', error);
throw new Error('Failed to ingest your data');
Expand Down