Skip to content

[User] No output on Windows with interactive mode. #1529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
chigkim opened this issue May 19, 2023 · 6 comments
Closed

[User] No output on Windows with interactive mode. #1529

chigkim opened this issue May 19, 2023 · 6 comments

Comments

@chigkim
Copy link

chigkim commented May 19, 2023

I get no output when running in interactive mode on Windows.
However, I get output if I take out --color --interactive --reverse-prompt "User:" and run.
Also if I run the same command on Mac with --interactive --reverse-prompt "User:", I get output.
I built with w64devkit-1.19.0.
Here's the log.

main.exe --ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 1024 --repeat_penalty 1.17647 --model "models/wizard-vicuna-13B.ggml.q4_0.bin" --n_predict 2048 --color --interactive --reverse-prompt "User:" --prompt "Text transcript of a never ending dialog, where User interacts with an AI assistant named ChatLLaMa. ChatLLaMa is helpful, kind, honest, friendly, good at writing and never fails to answer User's requests immediately and with details and precision. There are no annotations like (30 seconds passed...) or (to himself), just what User and ChatLLaMa say aloud to each other. The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long. The transcript only includes text, it does not include markup like HTML and Markdown."
main: build = 565 (943e608)
main: seed  = 1684519772
llama.cpp: loading model from models/wizard-vicuna-13B.ggml.q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =   0.09 MB
llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User:'
sampling: repeat_last_n = 256, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.500000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 Text transcript of a never ending dialog, where User interacts with an AI assistant named ChatLLaMa. ChatLLaMa is helpful, kind, honest, friendly, good at writing and never fails to answer User's requests immediately and with details and precision. There are no annotations like (30 seconds passed...) or (to himself), just what User and ChatLLaMa say aloud to each other. The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long. The transcript only includes text, it does not include markup like HTML and Markdown.

User:Hello!

llama_print_timings:        load time = 29434.94 ms
llama_print_timings:      sample time =     6.12 ms /     4 runs   (    1.53 ms per token)
llama_print_timings: prompt eval time = 27966.57 ms /   137 tokens (  204.14 ms per token)
llama_print_timings:        eval time =  1033.31 ms /     3 runs   (  344.44 ms per token)
llama_print_timings:       total time = 466281.64 ms
Terminate batch job (Y/N)? y

Here's the output without interactive.

main.exe --ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 1024 --repeat_penalty 1.17647 --model "models/wizard-vicuna-13B.ggml.q4_0.bin" --n_predict 2048 --prompt "Hello!"
main: build = 565 (943e608)
main: seed  = 1684520348
llama.cpp: loading model from models/wizard-vicuna-13B.ggml.q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =   0.09 MB
llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 256, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.500000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 0


 Hello!
Hello! is a fun and interactive app that allows users to connect with friends, family, and other people they know. With Hello!, you can easily send messages, share photos and videos, play games, and more. Whether you're looking for a new way to stay in touch with loved ones or want to meet new people, Hello! has something for everyone. So why wait? Download Hello! today and start connecting with the world around you! [end of text]

llama_print_timings:        load time =  2828.62 ms
llama_print_timings:      sample time =   142.54 ms /    93 runs   (    1.53 ms per token)
llama_print_timings: prompt eval time =  1651.94 ms /     3 tokens (  550.65 ms per token)
llama_print_timings:        eval time = 30975.91 ms /    92 runs   (  336.69 ms per token)
llama_print_timings:       total time = 34149.42 ms
@DannyDaemonic
Copy link
Contributor

Try #1462. If you don't know how to checkout a pull request you can apply this patch with patch -p1 < 1462.patch.

@chigkim
Copy link
Author

chigkim commented May 19, 2023

Thanks! That worked.
However, if I merge master, I get this error and quit:

llama.cpp: loading model from models/wizard-vicuna-13B.ggml.q4_0.bin
terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpectedly reached end of file

Hopefully you guys can sort it out when you merge.

@DannyDaemonic
Copy link
Contributor

That error is happening in llama.cpp, so I think it's unrelated to this one. If you're still getting that error after #1462 is merged, please open a new issue for it.

@chigkim
Copy link
Author

chigkim commented May 20, 2023

Got it. Thanks!

@chigkim chigkim closed this as completed May 20, 2023
@andzejsp
Copy link

im kinda confused, im trying to make it run on windows, as per guide:

pip uninstall llama-cpp-python -y
pip cache purge
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
pip install llama-cpp-python --upgrade

But when i load model BLAS =0 :(

What am i missing?

@JerryYao80
Copy link

@chigkim Congatulations, But I got this error:

main: build = 603 (0e730dd)
main: seed = 1685353971
llama.cpp: loading model from /media/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state)
Killed

Have you met that?
My environment is :

Docker Toolbox 1.13.1
docker client: 1.13.1 os/arch: windows 7 /amd64
docker server:19.03.12 os/arch:ubuntu 22.04 /amd64
CPU type: Intel Core i7 6700 , supported command set: MMX, SSE, SSE2, ......, AVX, AVX2, FMA3, TSX

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants