Skip to content

llama_tokenize: too many tokens (Requested tokens exceed context window of 512) #416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
4 tasks done
frankandrobot opened this issue Jun 23, 2023 · 6 comments
Closed
4 tasks done

Comments

@frankandrobot
Copy link

frankandrobot commented Jun 23, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

  • Currently using llama-cpp with a langchain vector store.
  • The model is ggml-vic13b-q4_0.bin
  • I'm also chunking documents using RecursiveCharacterTextSplitter.
  • I expect chunks of size up to 1000 tokens to allow stuff'ed chained queries... specially when there are only a few matching chunks.
  • For comparison, vicuna 7b, not using llama-cpp, works just fine using a chunk size of 1000.

See the notebook below.

Current Behavior

  • llama-cpp starts to give the "too many tokens" errors whenever the chunk size is over 500 tokens.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:

mac m2 pro

  • Operating System, e.g. for Linux:

Darwin UAVALOS-M-NR30 22.5.0 Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:19 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T6020 arm64

  • SDK version, e.g. for Linux:
Python 3.10.10
GNU Make 3.81
Apple clang version 14.0.3 (clang-1403.0.22.14.1)

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Here is the jupyter notebook:

!pip uninstall llama-cpp-python -y
!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
!pip install 'llama-cpp-python[server]'
!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install -v datasets loralib sentencepiece 
!pip -v install bitsandbytes accelerate
!pip -v install langchain
!pip install scipy
!pip install xformers
!pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
!pip install langchain faiss-cpu
!pip install sentence-transformers
!pip install GitPython 
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

model_path = "/Users/uavalos/Downloads/ggml-vicuna-13b-1.1/ggml-vic13b-q4_0.bin";

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

n_gpu_layers = 1 # Change this value based on your model and your GPU VRAM pool.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

llm = LlamaCpp(
    model_path=model_path,
    n_gpu_layers=n_gpu_layers, n_batch=n_batch,
    callback_manager=callback_manager, 
    verbose=True,
)

```python
import locale
locale.getpreferredencoding = lambda: "UTF-8"

from git import Repo
import os

repo_path = "/Users/uavalos/Documents/manage"

if os.path.exists(repo_path):
  repo = Repo(repo_path)
branch = repo.head.reference

from langchain.document_loaders import GitLoader
import re

import time
start_time = time.time()

loader = GitLoader(repo_path=repo_path,
                   branch=branch,
                   file_filter=lambda file_path: re.match(".*(ts|tsx|json|md|rb)$", file_path))
data = loader.load()

print (f'--- {time.time() - start_time} seconds ---')
print (f'There are {len(data)} files')
!pip install chromadb

from langchain.text_splitter import RecursiveCharacterTextSplitter

import time
start_time = time.time()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50)
chunks = text_splitter.split_documents(data)

from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

print (f'--- {time.time() - start_time} seconds ---')
print (f'There are {len(chunks)} chunks')

from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="/Users/uavalos/Documents/chroma"
)
query = "How to convert the RSSI value to a text label?"
docs = vectorstore.similarity_search(query)

print (f'There are {len(docs)} matching chunks')

from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(llm, chain_type="stuff")

import time
start_time = time.time()
result = chain.run(input_documents=docs, question=query)
print (f'--- {time.time() - start_time} seconds ---')
result
query = "How do you get RSSI from a store?"
docs = vectorstore.similarity_search(query)

print (f'There are {len(docs)} matching chunks')

chain = load_qa_chain(llm, chain_type="stuff")

start_time = time.time()
result = chain.run(input_documents=docs, question=query)
print (f'--- {time.time() - start_time} seconds ---')

result

Error Logs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[20], line 9
      6 chain = load_qa_chain(llm, chain_type="stuff")
      8 start_time = time.time()
----> 9 result = chain.run(input_documents=docs, question=query)
     10 print (f'--- {time.time() - start_time} seconds ---')
     12 result

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/base.py:261, in Chain.run(self, callbacks, *args, **kwargs)
    258     return self(args[0], callbacks=callbacks)[self.output_keys[0]]
    260 if kwargs and not args:
--> 261     return self(kwargs, callbacks=callbacks)[self.output_keys[0]]
    263 if not kwargs and not args:
    264     raise ValueError(
    265         "`run` supported with either positional arguments or keyword arguments,"
    266         " but none were provided."
    267     )

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/base.py:147, in Chain.__call__(self, inputs, return_only_outputs, callbacks, include_run_info)
    145 except (KeyboardInterrupt, Exception) as e:
    146     run_manager.on_chain_error(e)
--> 147     raise e
    148 run_manager.on_chain_end(outputs)
    149 final_outputs: Dict[str, Any] = self.prep_outputs(
    150     inputs, outputs, return_only_outputs
    151 )

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/base.py:141, in Chain.__call__(self, inputs, return_only_outputs, callbacks, include_run_info)
    135 run_manager = callback_manager.on_chain_start(
    136     dumpd(self),
    137     inputs,
    138 )
    139 try:
    140     outputs = (
--> 141         self._call(inputs, run_manager=run_manager)
    142         if new_arg_supported
    143         else self._call(inputs)
    144     )
    145 except (KeyboardInterrupt, Exception) as e:
    146     run_manager.on_chain_error(e)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/combine_documents/base.py:84, in BaseCombineDocumentsChain._call(self, inputs, run_manager)
     82 # Other keys are assumed to be needed for LLM prediction
     83 other_keys = {k: v for k, v in inputs.items() if k != self.input_key}
---> 84 output, extra_return_dict = self.combine_docs(
     85     docs, callbacks=_run_manager.get_child(), **other_keys
     86 )
     87 extra_return_dict[self.output_key] = output
     88 return extra_return_dict

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/combine_documents/stuff.py:87, in StuffDocumentsChain.combine_docs(self, docs, callbacks, **kwargs)
     85 inputs = self._get_inputs(docs, **kwargs)
     86 # Call predict on the LLM.
---> 87 return self.llm_chain.predict(callbacks=callbacks, **inputs), {}

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/llm.py:218, in LLMChain.predict(self, callbacks, **kwargs)
    203 def predict(self, callbacks: Callbacks = None, **kwargs: Any) -> str:
    204     """Format prompt with kwargs and pass to LLM.
    205 
    206     Args:
   (...)
    216             completion = llm.predict(adjective="funny")
    217     """
--> 218     return self(kwargs, callbacks=callbacks)[self.output_key]

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/base.py:147, in Chain.__call__(self, inputs, return_only_outputs, callbacks, include_run_info)
    145 except (KeyboardInterrupt, Exception) as e:
    146     run_manager.on_chain_error(e)
--> 147     raise e
    148 run_manager.on_chain_end(outputs)
    149 final_outputs: Dict[str, Any] = self.prep_outputs(
    150     inputs, outputs, return_only_outputs
    151 )

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/base.py:141, in Chain.__call__(self, inputs, return_only_outputs, callbacks, include_run_info)
    135 run_manager = callback_manager.on_chain_start(
    136     dumpd(self),
    137     inputs,
    138 )
    139 try:
    140     outputs = (
--> 141         self._call(inputs, run_manager=run_manager)
    142         if new_arg_supported
    143         else self._call(inputs)
    144     )
    145 except (KeyboardInterrupt, Exception) as e:
    146     run_manager.on_chain_error(e)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/llm.py:74, in LLMChain._call(self, inputs, run_manager)
     69 def _call(
     70     self,
     71     inputs: Dict[str, Any],
     72     run_manager: Optional[CallbackManagerForChainRun] = None,
     73 ) -> Dict[str, str]:
---> 74     response = self.generate([inputs], run_manager=run_manager)
     75     return self.create_outputs(response)[0]

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/llm.py:84, in LLMChain.generate(self, input_list, run_manager)
     82 """Generate LLM result from inputs."""
     83 prompts, stop = self.prep_prompts(input_list, run_manager=run_manager)
---> 84 return self.llm.generate_prompt(
     85     prompts, stop, callbacks=run_manager.get_child() if run_manager else None
     86 )

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/base.py:139, in BaseLLM.generate_prompt(self, prompts, stop, callbacks, **kwargs)
    131 def generate_prompt(
    132     self,
    133     prompts: List[PromptValue],
   (...)
    136     **kwargs: Any,
    137 ) -> LLMResult:
    138     prompt_strings = [p.to_string() for p in prompts]
--> 139     return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/base.py:203, in BaseLLM.generate(self, prompts, stop, callbacks, **kwargs)
    201 except (KeyboardInterrupt, Exception) as e:
    202     run_manager.on_llm_error(e)
--> 203     raise e
    204 run_manager.on_llm_end(output)
    205 if run_manager:

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/base.py:195, in BaseLLM.generate(self, prompts, stop, callbacks, **kwargs)
    190 run_manager = callback_manager.on_llm_start(
    191     dumpd(self), prompts, invocation_params=params, options=options
    192 )
    193 try:
    194     output = (
--> 195         self._generate(
    196             prompts, stop=stop, run_manager=run_manager, **kwargs
    197         )
    198         if new_arg_supported
    199         else self._generate(prompts, stop=stop, **kwargs)
    200     )
    201 except (KeyboardInterrupt, Exception) as e:
    202     run_manager.on_llm_error(e)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/base.py:493, in LLM._generate(self, prompts, stop, run_manager, **kwargs)
    490 new_arg_supported = inspect.signature(self._call).parameters.get("run_manager")
    491 for prompt in prompts:
    492     text = (
--> 493         self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
    494         if new_arg_supported
    495         else self._call(prompt, stop=stop, **kwargs)
    496     )
    497     generations.append([Generation(text=text)])
    498 return LLMResult(generations=generations)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/llamacpp.py:226, in LlamaCpp._call(self, prompt, stop, run_manager, **kwargs)
    221 if self.streaming:
    222     # If streaming is enabled, we use the stream
    223     # method that yields as they are generated
    224     # and return the combined strings from the first choices's text:
    225     combined_text_output = ""
--> 226     for token in self.stream(prompt=prompt, stop=stop, run_manager=run_manager):
    227         combined_text_output += token["choices"][0]["text"]
    228     return combined_text_output

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/llamacpp.py:276, in LlamaCpp.stream(self, prompt, stop, run_manager)
    274 params = self._get_parameters(stop)
    275 result = self.client(prompt=prompt, stream=True, **params)
--> 276 for chunk in result:
    277     token = chunk["choices"][0]["text"]
    278     log_probs = chunk["choices"][0].get("logprobs", None)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/llama_cpp/llama.py:817, in Llama._create_completion(self, prompt, suffix, max_tokens, temperature, top_p, logprobs, echo, stop, frequency_penalty, presence_penalty, repeat_penalty, top_k, stream, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, model, stopping_criteria, logits_processor)
    814     llama_cpp.llama_reset_timings(self.ctx)
    816 if len(prompt_tokens) > self._n_ctx:
--> 817     raise ValueError(f"Requested tokens exceed context window of {self._n_ctx}")
    819 # Truncate max_tokens if requested tokens would exceed the context window
    820 max_tokens = (
    821     max_tokens
    822     if max_tokens + len(prompt_tokens) < self._n_ctx
    823     else (self._n_ctx - len(prompt_tokens))
    824 )

ValueError: Requested tokens exceed context window of 512

@gjmulder
Copy link
Contributor

Is this error from langchain or llama-cpp-python?

@gjmulder gjmulder added the bug Something isn't working label Jun 23, 2023
@frankandrobot
Copy link
Author

@gjmulder as far as i can tell from llama-cpp-python. Langchain is just an orchestration layer. And the vector store (chromadb) works just fine---it reports the number of hits before llama gives this error.

@frankandrobot frankandrobot changed the title llama_tokenize: too many tokens llama_tokenize: too many tokens (Requested tokens exceed context window of 512) Jun 23, 2023
@frankandrobot
Copy link
Author

Updated with failure logs. Note: i'm getting a Requested tokens exceed context window of 512 error

@gjmulder gjmulder removed the bug Something isn't working label Jun 23, 2023
@gjmulder
Copy link
Contributor

Why not increase n_ctx to avoid the error?

@frankandrobot
Copy link
Author

frankandrobot commented Jun 23, 2023

Did not know this parameter was for token size :-)

Changing this to n_ctx=1100 made the error go away. Yay!

Thanks @gjmulder

@gjmulder
Copy link
Contributor

Note that llama models support a maximum context size of 2048.

antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this issue Oct 30, 2023
Delete this for now to avoid confusion since it contains some wrong checksums from the old tokenizer format
Re-add after abetlen#374 is resolved
antoine-lizee pushed a commit to antoine-lizee/llama-cpp-python that referenced this issue Oct 30, 2023
* Revert "Delete SHA256SUMS for now (abetlen#416)"

This reverts commit 8eea5ae.

* Remove ggml files until they can be verified
* Remove alpaca json
* Add also model/tokenizer.model to SHA256SUMS + update README

---------

Co-authored-by: Pavol Rusnak <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants