-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Question: How to access feature vector of the intermediate layer of network? #2047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You can see some example of how to extract the vector in the embedding example, but it only extracts after the last layer and only the last vector. There is also my experiment #1472 where I extract the input to an arbitrary layer. It's a little more complext because it extracts it by multiplying it with a coefficient (like +1.0 or -1.0) and can add. It also supports adding it back later on inference. |
Thank you for your very informative comments. $ ./main -m ./models/7B/ggml-model-q4_0.bin --seed 123 -n 64 --steering-add "Love" --steering-sub "Hate" --steering-source 4 --steering-layer 4 --steering-mul 5 --prompt "I hate you because "
main: build = 0 (unknown)
main: seed = 123
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.07 MB
llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size = 256.00 MB
system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 |
main: steering: ('Love' - 'Hate') * 5.000000
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0
I hate you because 1) You are the best thing that has ever happened to me and 2) I'm going to be in your life for a long time.
Love this one so much!
I wish you had said "You are the only woman who has ever been in my life" instead of "You are
llama_print_timings: load time = 60662.62 ms
llama_print_timings: sample time = 134.56 ms / 64 runs ( 2.10 ms per token)
llama_print_timings: prompt eval time = 4729.86 ms / 12 tokens ( 394.15 ms per token)
llama_print_timings: eval time = 74006.17 ms / 63 runs ( 1174.70 ms per token)
llama_print_timings: total time = 139246.77 ms I have additional questions about steering branch.
Best regards. |
The .bin file is just a dump of the floating point number, I think it's somewhere there but commented out. It can be read in easily with Numpy. The steering processes the same model as normal and the layers are processed in a loop. The ggml models only have the weights of the models, the model definition is inside code only, in llama.cpp, the function If you want to add some numbers on a specific layer, you have to have a condition inside the loop that checks the layer number, then you add your numbers to the Anyway, this is a rough explanation. |
Thank you for your explanation. As you have taught me, I found lines to save the .bin file and uncommented out and executed. >>> vec = np.fromfile('steering.bin', dtype=np.float32)
>>> vec
array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)
>>> vec.shape
(2097152,)
>>> a.shape[0] / 512
4096.0 First question. Second question. ubuntu@ubuntu:~/dlbr/llama.cpp-steering$ ./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0
main: build = 0 (unknown)
main: quantizing './models/7B/ggml-model-f16.bin' to './models/7B/ggml-model-q4_0.bin' as q4_0
llama.cpp: loading model from ./models/7B/ggml-model-f16.bin
llama.cpp: saving model to ./models/7B/ggml-model-q4_0.bin
[ 1/ 291] tok_embeddings.weight - 4096 x 32000, type = f16, quantizing .. size = 250.00 MB -> 70.31 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 2/ 291] norm.weight - 4096, type = f32, size = 0.016 MB
[ 3/ 291] output.weight - 4096 x 32000, type = f16, quantizing .. size = 250.00 MB -> 70.31 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020
[ 4/ 291] layers.0.attention.wq.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.035 0.012 0.019 0.030 0.047 0.069 0.097 0.129 0.152 0.129 0.098 0.070 0.047 0.031 0.019 0.016
[ 5/ 291] layers.0.attention.wk.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.035 0.012 0.020 0.032 0.049 0.072 0.098 0.125 0.139 0.125 0.099 0.072 0.050 0.033 0.021 0.017
[ 6/ 291] layers.0.attention.wv.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.024 0.037 0.055 0.075 0.096 0.114 0.124 0.114 0.096 0.075 0.055 0.038 0.024 0.020
[ 7/ 291] layers.0.attention.wo.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.013 0.021 0.033 0.051 0.073 0.099 0.123 0.133 0.123 0.099 0.073 0.051 0.033 0.021 0.018
[ 8/ 291] layers.0.attention_norm.weight - 4096, type = f32, size = 0.016 MB
[ 9/ 291] layers.0.feed_forward.w1.weight - 4096 x 11008, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 10/ 291] layers.0.feed_forward.w2.weight - 11008 x 4096, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 11/ 291] layers.0.feed_forward.w3.weight - 4096 x 11008, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 12/ 291] layers.0.ffn_norm.weight - 4096, type = f32, size = 0.016 MB
[ 13/ 291] layers.1.attention.wq.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.120 0.114 0.097 0.076 0.055 0.038 0.024 0.020
[ 14/ 291] layers.1.attention.wk.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.037 0.024 0.020
[ 15/ 291] layers.1.attention.wv.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.097 0.114 0.122 0.115 0.097 0.076 0.055 0.037 0.024 0.020
[ 16/ 291] layers.1.attention.wo.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.013 0.021 0.033 0.050 0.072 0.098 0.124 0.136 0.124 0.098 0.072 0.050 0.033 0.021 0.018
[ 17/ 291] layers.1.attention_norm.weight - 4096, type = f32, size = 0.016 MB
[ 18/ 291] layers.1.feed_forward.w1.weight - 4096 x 11008, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 19/ 291] layers.1.feed_forward.w2.weight - 11008 x 4096, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.037 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 20/ 291] layers.1.feed_forward.w3.weight - 4096 x 11008, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 21/ 291] layers.1.ffn_norm.weight - 4096, type = f32, size = 0.016 MB
[ 22/ 291] layers.2.attention.wq.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020
[ 23/ 291] layers.2.attention.wk.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.038 0.024 0.020
[ 24/ 291] layers.2.attention.wv.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021
[ 25/ 291] layers.2.attention.wo.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.113 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020
[ 26/ 291] layers.2.attention_norm.weight - 4096, type = f32, size = 0.016 MB
[ 27/ 291] layers.2.feed_forward.w1.weight - 4096 x 11008, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 28/ 291] layers.2.feed_forward.w2.weight - 11008 x 4096, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 29/ 291] layers.2.feed_forward.w3.weight - 4096 x 11008, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 30/ 291] layers.2.ffn_norm.weight - 4096, type = f32, size = 0.016 MB
[ 31/ 291] layers.3.attention.wq.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.096 0.076 0.056 0.038 0.025 0.021
[ 32/ 291] layers.3.attention.wk.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020
[ 33/ 291] layers.3.attention.wv.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021
[ 34/ 291] layers.3.attention.wo.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021
[ 35/ 291] layers.3.attention_norm.weight - 4096, type = f32, size = 0.016 MB
[ 36/ 291] layers.3.feed_forward.w1.weight - 4096 x 11008, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 37/ 291] layers.3.feed_forward.w2.weight - 11008 x 4096, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 38/ 291] layers.3.feed_forward.w3.weight - 4096 x 11008, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 39/ 291] layers.3.ffn_norm.weight - 4096, type = f32, size = 0.016 MB
[ 40/ 291] layers.4.attention.wq.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021
[ 41/ 291] layers.4.attention.wk.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020
[ 42/ 291] layers.4.attention.wv.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021
[ 43/ 291] layers.4.attention.wo.weight - 4096 x 4096, type = f16, quantizing .. size = 32.00 MB -> 9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021
[ 44/ 291] layers.4.attention_norm.weight - 4096, type = f32, size = 0.016 MB
[ 45/ 291] layers.4.feed_forward.w1.weight - 4096 x 11008, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 46/ 291] layers.4.feed_forward.w2.weight - 11008 x 4096, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 47/ 291] layers.4.feed_forward.w3.weight - 4096 x 11008, type = f16, quantizing .. size = 86.00 MB -> 24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021
[ 48/ 291] layers.4.ffn_norm.weight - 4096, type = f32, size = 0.016 MB
.
.
. Third question. I have not tried inference from the middle tier yet and will ask that question to you again in the future. Best regards. |
Well, you should read up on what the steering experiment was about after all, it is not a normal operation for llama.cpp. But I brought it up because the way it works is it first reads from some arbitrary layer output (or input, depends on the perspective) and it also writes it back later. The steering works with multiple passes, first the positive and negative steering strings are processed, at this time the interceptor reads the embedding vectors from the layer input (layer number configurable from command line). They are added to the same output vector (when processing the positive string, multiplied by +1.0 and vice-versa). This gives one vector which is the "steering vector" that is written into the .bin file. Then the program runs normally as it used to before, except now the steering vector is being read from, multiplied by a user-definable coefficient (theoretically, more positive: more steering effect, negative: opposite effect, zero: no effect). We also experimented with injecting the steering into a different layer than it was extracted from, which may give some different effect (didn't really have time to study this). But the main idea of how to mess with the vectors:
If you don't need to read and write at the same time some steps are optional. For example, embedding.cpp only reads data so it is simpler. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Uh oh!
There was an error while loading. Please reload this page.
Prerequisites
Expected Behavior
I am interested in the difference between the feature vectors of the intermediate layer of the
llama.cpp
and PyTorch versions of the LLaMa model.For this purpose, I would like to know how I can get the feature vectors of the middle layer, such as
torchvision.models.feature_extraction.create_feature_extractor
andregister_forward_hook
method in PyTorch.Current Behavior
I browsed C++ programs but could not figure out how to get the feature vector.
The text was updated successfully, but these errors were encountered: