Firstly, thanks to GG and contributors for a great library/utility.
When generating using gpt-2, ggml bombs out at around 824 or 825 tokens, reporting an error then dumping core.
I would expect there to be a problem (hopefully not involving fatal errors and core dumps) when the total tokens equal the context size, but 824 or 825 total seems an odd number?
The same error is referenced in the llama.cpp repo, but possibly for a different reason: ggml-org/llama.cpp#2404
REPRODUCE:
Clean build, CPU only, Ubuntu 22: git pull && rm -Rf build && mkdir build && cd build && cmake .. && make
with ggml-model-f16.bin (gpt2-xl), eg bin/gpt-2 -m ~/gpt-2/models/1558M/ggml-model-f32.bin -n ...
-n 823: ok (run completes without error)
-n 824: ggml_new_object: not enough space in the context's memory pool (needed 268457104, available 268435456)
-n 825: ggml_new_object: not enough space in the context's memory pool (needed 268457104, available 268435456)
with ggml-model-f32.bin (gpt2-xl):
-n 823: ok
-n 824: ok
-n 825: ggml_new_object: not enough space in the context's memory pool (needed 268457104, available 268435456)
Note: I had to repeat some runs several times as ggml will stop prematurely if an <|endoftext|> token is generated. Getting to 823+ tokens can take a few tries.