[User] No output on Windows with interactive mode. #1529

chigkim · 2023-05-19T18:23:49Z

I get no output when running in interactive mode on Windows.
However, I get output if I take out --color --interactive --reverse-prompt "User:" and run.
Also if I run the same command on Mac with --interactive --reverse-prompt "User:", I get output.
I built with w64devkit-1.19.0.
Here's the log.

main.exe --ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 1024 --repeat_penalty 1.17647 --model "models/wizard-vicuna-13B.ggml.q4_0.bin" --n_predict 2048 --color --interactive --reverse-prompt "User:" --prompt "Text transcript of a never ending dialog, where User interacts with an AI assistant named ChatLLaMa. ChatLLaMa is helpful, kind, honest, friendly, good at writing and never fails to answer User's requests immediately and with details and precision. There are no annotations like (30 seconds passed...) or (to himself), just what User and ChatLLaMa say aloud to each other. The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long. The transcript only includes text, it does not include markup like HTML and Markdown."
main: build = 565 (943e608)
main: seed  = 1684519772
llama.cpp: loading model from models/wizard-vicuna-13B.ggml.q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =   0.09 MB
llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: 'User:'
sampling: repeat_last_n = 256, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.500000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

 Text transcript of a never ending dialog, where User interacts with an AI assistant named ChatLLaMa. ChatLLaMa is helpful, kind, honest, friendly, good at writing and never fails to answer User's requests immediately and with details and precision. There are no annotations like (30 seconds passed...) or (to himself), just what User and ChatLLaMa say aloud to each other. The dialog lasts for years, the entirety of it is shared below. It's 10000 pages long. The transcript only includes text, it does not include markup like HTML and Markdown.

User:Hello!

llama_print_timings:        load time = 29434.94 ms
llama_print_timings:      sample time =     6.12 ms /     4 runs   (    1.53 ms per token)
llama_print_timings: prompt eval time = 27966.57 ms /   137 tokens (  204.14 ms per token)
llama_print_timings:        eval time =  1033.31 ms /     3 runs   (  344.44 ms per token)
llama_print_timings:       total time = 466281.64 ms
Terminate batch job (Y/N)? y

Here's the output without interactive.

main.exe --ctx_size 2048 --temp 0.7 --top_k 40 --top_p 0.5 --repeat_last_n 256 --batch_size 1024 --repeat_penalty 1.17647 --model "models/wizard-vicuna-13B.ggml.q4_0.bin" --n_predict 2048 --prompt "Hello!"
main: build = 565 (943e608)
main: seed  = 1684520348
llama.cpp: loading model from models/wizard-vicuna-13B.ggml.q4_0.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =   0.09 MB
llama_model_load_internal: mem required  = 9807.48 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  = 1600.00 MB

system_info: n_threads = 4 / 8 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: repeat_last_n = 256, repeat_penalty = 1.176470, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.500000, typical_p = 1.000000, temp = 0.700000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 2048, n_batch = 512, n_predict = 2048, n_keep = 0


 Hello!
Hello! is a fun and interactive app that allows users to connect with friends, family, and other people they know. With Hello!, you can easily send messages, share photos and videos, play games, and more. Whether you're looking for a new way to stay in touch with loved ones or want to meet new people, Hello! has something for everyone. So why wait? Download Hello! today and start connecting with the world around you! [end of text]

llama_print_timings:        load time =  2828.62 ms
llama_print_timings:      sample time =   142.54 ms /    93 runs   (    1.53 ms per token)
llama_print_timings: prompt eval time =  1651.94 ms /     3 tokens (  550.65 ms per token)
llama_print_timings:        eval time = 30975.91 ms /    92 runs   (  336.69 ms per token)
llama_print_timings:       total time = 34149.42 ms

The text was updated successfully, but these errors were encountered:

DannyDaemonic · 2023-05-19T18:26:07Z

Try #1462. If you don't know how to checkout a pull request you can apply this patch with patch -p1 < 1462.patch.

chigkim · 2023-05-19T20:51:38Z

Thanks! That worked.
However, if I merge master, I get this error and quit:

llama.cpp: loading model from models/wizard-vicuna-13B.ggml.q4_0.bin
terminate called after throwing an instance of 'std::runtime_error'
  what():  unexpectedly reached end of file

Hopefully you guys can sort it out when you merge.

DannyDaemonic · 2023-05-19T21:25:12Z

That error is happening in llama.cpp, so I think it's unrelated to this one. If you're still getting that error after #1462 is merged, please open a new issue for it.

chigkim · 2023-05-20T02:44:21Z

Got it. Thanks!

andzejsp · 2023-05-20T18:44:54Z

im kinda confused, im trying to make it run on windows, as per guide:

pip uninstall llama-cpp-python -y
pip cache purge
set CMAKE_ARGS=-DLLAMA_CUBLAS=on
set FORCE_CMAKE=1
pip install llama-cpp-python --upgrade

But when i load model BLAS =0 :(

What am i missing?

JerryYao80 · 2023-05-29T11:15:49Z

@chigkim Congatulations, But I got this error:

main: build = 603 (0e730dd)
main: seed = 1685353971
llama.cpp: loading model from /media/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state)
Killed

Have you met that?
My environment is :

Docker Toolbox 1.13.1
docker client: 1.13.1 os/arch: windows 7 /amd64
docker server:19.03.12 os/arch:ubuntu 22.04 /amd64
CPU type: Intel Core i7 6700 , supported command set: MMX, SSE, SSE2, ......, AVX, AVX2, FMA3, TSX

DannyDaemonic mentioned this issue May 19, 2023

Fix for mingw compilers (including wx64devkit) - fixes #1423 fixes #1529 #1462

Merged

chigkim closed this as completed May 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User] No output on Windows with interactive mode. #1529

[User] No output on Windows with interactive mode. #1529

chigkim commented May 19, 2023

DannyDaemonic commented May 19, 2023

chigkim commented May 19, 2023

DannyDaemonic commented May 19, 2023

chigkim commented May 20, 2023

andzejsp commented May 20, 2023

JerryYao80 commented May 29, 2023

[User] No output on Windows with interactive mode. #1529

[User] No output on Windows with interactive mode. #1529

Comments

chigkim commented May 19, 2023

DannyDaemonic commented May 19, 2023

chigkim commented May 19, 2023

DannyDaemonic commented May 19, 2023

chigkim commented May 20, 2023

andzejsp commented May 20, 2023

JerryYao80 commented May 29, 2023