Open
Description
A recent paper by Meta/MIT/CMU proposed StreamingLLM, a simple yet efficient solution to enable "infinite" context. Better yet, the implementation in llama.cpp is as trivial as changing the n_keep
value with option --keep
as discussed in this issue. Unfortunately, the high-level API of llama-cpp-python does not support the keep
/n_keep
parameter.
It should be simple to add the parameter to the high-level API, ideally in the constructor for class Llama
and to pass it along to function llama_cpp.llama_load_model_from_file
as part of parameter lparams
here.