llama_tokenize: too many tokens (Requested tokens exceed context window of 512) #416

frankandrobot · 2023-06-23T04:05:35Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Currently using llama-cpp with a langchain vector store.
The model is ggml-vic13b-q4_0.bin
I'm also chunking documents using RecursiveCharacterTextSplitter.
I expect chunks of size up to 1000 tokens to allow stuff'ed chained queries... specially when there are only a few matching chunks.
For comparison, vicuna 7b, not using llama-cpp, works just fine using a chunk size of 1000.

See the notebook below.

Current Behavior

llama-cpp starts to give the "too many tokens" errors whenever the chunk size is over 500 tokens.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

mac m2 pro

Operating System, e.g. for Linux:

Darwin UAVALOS-M-NR30 22.5.0 Darwin Kernel Version 22.5.0: Mon Apr 24 20:53:19 PDT 2023; root:xnu-8796.121.2~5/RELEASE_ARM64_T6020 arm64

SDK version, e.g. for Linux:

Python 3.10.10
GNU Make 3.81
Apple clang version 14.0.3 (clang-1403.0.22.14.1)

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Here is the jupyter notebook:

!pip uninstall llama-cpp-python -y
!CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install -U llama-cpp-python --no-cache-dir
!pip install 'llama-cpp-python[server]'
!pip install git+https://github.com/huggingface/peft.git
!pip install git+https://github.com/huggingface/transformers.git
!pip install -v datasets loralib sentencepiece 
!pip -v install bitsandbytes accelerate
!pip -v install langchain
!pip install scipy
!pip install xformers
!pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
!pip install langchain faiss-cpu
!pip install sentence-transformers
!pip install GitPython

from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

model_path = "/Users/uavalos/Downloads/ggml-vicuna-13b-1.1/ggml-vic13b-q4_0.bin";

callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

n_gpu_layers = 1 # Change this value based on your model and your GPU VRAM pool.
n_batch = 512 # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

llm = LlamaCpp(
    model_path=model_path,
    n_gpu_layers=n_gpu_layers, n_batch=n_batch,
    callback_manager=callback_manager, 
    verbose=True,
)

```python
import locale
locale.getpreferredencoding = lambda: "UTF-8"

from git import Repo
import os

repo_path = "/Users/uavalos/Documents/manage"

if os.path.exists(repo_path):
  repo = Repo(repo_path)
branch = repo.head.reference

from langchain.document_loaders import GitLoader
import re

import time
start_time = time.time()

loader = GitLoader(repo_path=repo_path,
                   branch=branch,
                   file_filter=lambda file_path: re.match(".*(ts|tsx|json|md|rb)$", file_path))
data = loader.load()

print (f'--- {time.time() - start_time} seconds ---')
print (f'There are {len(data)} files')

!pip install chromadb

from langchain.text_splitter import RecursiveCharacterTextSplitter

import time
start_time = time.time()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50)
chunks = text_splitter.split_documents(data)

from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()

print (f'--- {time.time() - start_time} seconds ---')
print (f'There are {len(chunks)} chunks')

from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="/Users/uavalos/Documents/chroma"
)

query = "How to convert the RSSI value to a text label?"
docs = vectorstore.similarity_search(query)

print (f'There are {len(docs)} matching chunks')

from langchain.chains.question_answering import load_qa_chain
chain = load_qa_chain(llm, chain_type="stuff")

import time
start_time = time.time()
result = chain.run(input_documents=docs, question=query)
print (f'--- {time.time() - start_time} seconds ---')
result

query = "How do you get RSSI from a store?"
docs = vectorstore.similarity_search(query)

print (f'There are {len(docs)} matching chunks')

chain = load_qa_chain(llm, chain_type="stuff")

start_time = time.time()
result = chain.run(input_documents=docs, question=query)
print (f'--- {time.time() - start_time} seconds ---')

result

Error Logs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[20], line 9
      6 chain = load_qa_chain(llm, chain_type="stuff")
      8 start_time = time.time()
----> 9 result = chain.run(input_documents=docs, question=query)
     10 print (f'--- {time.time() - start_time} seconds ---')
     12 result

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/base.py:261, in Chain.run(self, callbacks, *args, **kwargs)
    258     return self(args[0], callbacks=callbacks)[self.output_keys[0]]
    260 if kwargs and not args:
--> 261     return self(kwargs, callbacks=callbacks)[self.output_keys[0]]
    263 if not kwargs and not args:
    264     raise ValueError(
    265         "`run` supported with either positional arguments or keyword arguments,"
    266         " but none were provided."
    267     )

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/base.py:147, in Chain.__call__(self, inputs, return_only_outputs, callbacks, include_run_info)
    145 except (KeyboardInterrupt, Exception) as e:
    146     run_manager.on_chain_error(e)
--> 147     raise e
    148 run_manager.on_chain_end(outputs)
    149 final_outputs: Dict[str, Any] = self.prep_outputs(
    150     inputs, outputs, return_only_outputs
    151 )

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/base.py:141, in Chain.__call__(self, inputs, return_only_outputs, callbacks, include_run_info)
    135 run_manager = callback_manager.on_chain_start(
    136     dumpd(self),
    137     inputs,
    138 )
    139 try:
    140     outputs = (
--> 141         self._call(inputs, run_manager=run_manager)
    142         if new_arg_supported
    143         else self._call(inputs)
    144     )
    145 except (KeyboardInterrupt, Exception) as e:
    146     run_manager.on_chain_error(e)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/combine_documents/base.py:84, in BaseCombineDocumentsChain._call(self, inputs, run_manager)
     82 # Other keys are assumed to be needed for LLM prediction
     83 other_keys = {k: v for k, v in inputs.items() if k != self.input_key}
---> 84 output, extra_return_dict = self.combine_docs(
     85     docs, callbacks=_run_manager.get_child(), **other_keys
     86 )
     87 extra_return_dict[self.output_key] = output
     88 return extra_return_dict

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/combine_documents/stuff.py:87, in StuffDocumentsChain.combine_docs(self, docs, callbacks, **kwargs)
     85 inputs = self._get_inputs(docs, **kwargs)
     86 # Call predict on the LLM.
---> 87 return self.llm_chain.predict(callbacks=callbacks, **inputs), {}

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/llm.py:218, in LLMChain.predict(self, callbacks, **kwargs)
    203 def predict(self, callbacks: Callbacks = None, **kwargs: Any) -> str:
    204     """Format prompt with kwargs and pass to LLM.
    205 
    206     Args:
   (...)
    216             completion = llm.predict(adjective="funny")
    217     """
--> 218     return self(kwargs, callbacks=callbacks)[self.output_key]

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/base.py:147, in Chain.__call__(self, inputs, return_only_outputs, callbacks, include_run_info)
    145 except (KeyboardInterrupt, Exception) as e:
    146     run_manager.on_chain_error(e)
--> 147     raise e
    148 run_manager.on_chain_end(outputs)
    149 final_outputs: Dict[str, Any] = self.prep_outputs(
    150     inputs, outputs, return_only_outputs
    151 )

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/base.py:141, in Chain.__call__(self, inputs, return_only_outputs, callbacks, include_run_info)
    135 run_manager = callback_manager.on_chain_start(
    136     dumpd(self),
    137     inputs,
    138 )
    139 try:
    140     outputs = (
--> 141         self._call(inputs, run_manager=run_manager)
    142         if new_arg_supported
    143         else self._call(inputs)
    144     )
    145 except (KeyboardInterrupt, Exception) as e:
    146     run_manager.on_chain_error(e)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/llm.py:74, in LLMChain._call(self, inputs, run_manager)
     69 def _call(
     70     self,
     71     inputs: Dict[str, Any],
     72     run_manager: Optional[CallbackManagerForChainRun] = None,
     73 ) -> Dict[str, str]:
---> 74     response = self.generate([inputs], run_manager=run_manager)
     75     return self.create_outputs(response)[0]

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/chains/llm.py:84, in LLMChain.generate(self, input_list, run_manager)
     82 """Generate LLM result from inputs."""
     83 prompts, stop = self.prep_prompts(input_list, run_manager=run_manager)
---> 84 return self.llm.generate_prompt(
     85     prompts, stop, callbacks=run_manager.get_child() if run_manager else None
     86 )

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/base.py:139, in BaseLLM.generate_prompt(self, prompts, stop, callbacks, **kwargs)
    131 def generate_prompt(
    132     self,
    133     prompts: List[PromptValue],
   (...)
    136     **kwargs: Any,
    137 ) -> LLMResult:
    138     prompt_strings = [p.to_string() for p in prompts]
--> 139     return self.generate(prompt_strings, stop=stop, callbacks=callbacks, **kwargs)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/base.py:203, in BaseLLM.generate(self, prompts, stop, callbacks, **kwargs)
    201 except (KeyboardInterrupt, Exception) as e:
    202     run_manager.on_llm_error(e)
--> 203     raise e
    204 run_manager.on_llm_end(output)
    205 if run_manager:

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/base.py:195, in BaseLLM.generate(self, prompts, stop, callbacks, **kwargs)
    190 run_manager = callback_manager.on_llm_start(
    191     dumpd(self), prompts, invocation_params=params, options=options
    192 )
    193 try:
    194     output = (
--> 195         self._generate(
    196             prompts, stop=stop, run_manager=run_manager, **kwargs
    197         )
    198         if new_arg_supported
    199         else self._generate(prompts, stop=stop, **kwargs)
    200     )
    201 except (KeyboardInterrupt, Exception) as e:
    202     run_manager.on_llm_error(e)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/base.py:493, in LLM._generate(self, prompts, stop, run_manager, **kwargs)
    490 new_arg_supported = inspect.signature(self._call).parameters.get("run_manager")
    491 for prompt in prompts:
    492     text = (
--> 493         self._call(prompt, stop=stop, run_manager=run_manager, **kwargs)
    494         if new_arg_supported
    495         else self._call(prompt, stop=stop, **kwargs)
    496     )
    497     generations.append([Generation(text=text)])
    498 return LLMResult(generations=generations)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/llamacpp.py:226, in LlamaCpp._call(self, prompt, stop, run_manager, **kwargs)
    221 if self.streaming:
    222     # If streaming is enabled, we use the stream
    223     # method that yields as they are generated
    224     # and return the combined strings from the first choices's text:
    225     combined_text_output = ""
--> 226     for token in self.stream(prompt=prompt, stop=stop, run_manager=run_manager):
    227         combined_text_output += token["choices"][0]["text"]
    228     return combined_text_output

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/langchain/llms/llamacpp.py:276, in LlamaCpp.stream(self, prompt, stop, run_manager)
    274 params = self._get_parameters(stop)
    275 result = self.client(prompt=prompt, stream=True, **params)
--> 276 for chunk in result:
    277     token = chunk["choices"][0]["text"]
    278     log_probs = chunk["choices"][0].get("logprobs", None)

File ~/miniconda3/envs/llama/lib/python3.11/site-packages/llama_cpp/llama.py:817, in Llama._create_completion(self, prompt, suffix, max_tokens, temperature, top_p, logprobs, echo, stop, frequency_penalty, presence_penalty, repeat_penalty, top_k, stream, tfs_z, mirostat_mode, mirostat_tau, mirostat_eta, model, stopping_criteria, logits_processor)
    814     llama_cpp.llama_reset_timings(self.ctx)
    816 if len(prompt_tokens) > self._n_ctx:
--> 817     raise ValueError(f"Requested tokens exceed context window of {self._n_ctx}")
    819 # Truncate max_tokens if requested tokens would exceed the context window
    820 max_tokens = (
    821     max_tokens
    822     if max_tokens + len(prompt_tokens) < self._n_ctx
    823     else (self._n_ctx - len(prompt_tokens))
    824 )

ValueError: Requested tokens exceed context window of 512

The text was updated successfully, but these errors were encountered:

gjmulder · 2023-06-23T07:04:10Z

Is this error from langchain or llama-cpp-python?

frankandrobot · 2023-06-23T15:07:04Z

@gjmulder as far as i can tell from llama-cpp-python. Langchain is just an orchestration layer. And the vector store (chromadb) works just fine---it reports the number of hits before llama gives this error.

frankandrobot · 2023-06-23T15:44:00Z

Updated with failure logs. Note: i'm getting a Requested tokens exceed context window of 512 error

gjmulder · 2023-06-23T16:18:33Z

Why not increase n_ctx to avoid the error?

frankandrobot · 2023-06-23T18:32:52Z

Did not know this parameter was for token size :-)

Changing this to n_ctx=1100 made the error go away. Yay!

Thanks @gjmulder

gjmulder · 2023-06-24T07:22:01Z

Note that llama models support a maximum context size of 2048.

Delete this for now to avoid confusion since it contains some wrong checksums from the old tokenizer format Re-add after abetlen#374 is resolved

* Revert "Delete SHA256SUMS for now (abetlen#416)" This reverts commit 8eea5ae. * Remove ggml files until they can be verified * Remove alpaca json * Add also model/tokenizer.model to SHA256SUMS + update README --------- Co-authored-by: Pavol Rusnak <[email protected]>

gjmulder added the bug Something isn't working label Jun 23, 2023

frankandrobot changed the title ~~llama_tokenize: too many tokens~~ llama_tokenize: too many tokens (Requested tokens exceed context window of 512) Jun 23, 2023

gjmulder removed the bug Something isn't working label Jun 23, 2023

frankandrobot closed this as completed Jun 23, 2023

gjmulder mentioned this issue Jul 9, 2023

Models not working on version 0.1.69, fail with an exception: Requested tokens (85) exceed context window of 2048 #462

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama_tokenize: too many tokens (Requested tokens exceed context window of 512) #416

llama_tokenize: too many tokens (Requested tokens exceed context window of 512) #416

frankandrobot commented Jun 23, 2023 •

edited

Loading

gjmulder commented Jun 23, 2023

Uh oh!

frankandrobot commented Jun 23, 2023

Uh oh!

frankandrobot commented Jun 23, 2023

Uh oh!

gjmulder commented Jun 23, 2023

Uh oh!

frankandrobot commented Jun 23, 2023 •

edited

Loading

Uh oh!

gjmulder commented Jun 24, 2023

Uh oh!

llama_tokenize: too many tokens (Requested tokens exceed context window of 512) #416

llama_tokenize: too many tokens (Requested tokens exceed context window of 512) #416

Comments

frankandrobot commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Error Logs

gjmulder commented Jun 23, 2023

Uh oh!

frankandrobot commented Jun 23, 2023

Uh oh!

frankandrobot commented Jun 23, 2023

Uh oh!

gjmulder commented Jun 23, 2023

Uh oh!

frankandrobot commented Jun 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gjmulder commented Jun 24, 2023

Uh oh!

frankandrobot commented Jun 23, 2023 •

edited

Loading

frankandrobot commented Jun 23, 2023 •

edited

Loading