-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Description
What happened?
To reproduce:
Download the official released gguf model from huggingface/microsoft.
Run server.exe -m Phi3-mini-4k.gguf -c 4096
When input prompt < ~2048: Output fine. (but output starts getting weird right after it hits ~2048 in total)
When input prompt > ~2048: Output weird.
The weird output seems like what we expect to see when the context is more than the model support, but happens in ~2048, which seems like there are some bugs.
Also tested Llama3-8B, works fine with input prompt < 8192 as expected (with -c 8192), also works fine with input prompt < 4096 as expected (with -c 4096).
Name and Version
version: 3015 (74b239b)
built with MSVC 19.39.33523.0 for x64
Tried both cuda and avx2 version.
Also tried latest version built it myself @ Intel SYCL
version: 3075 (3d7ebf6)
built with IntelLLVM 2024.1.0
What operating system are you seeing the problem on?
Win10, Win11