Question: How to access feature vector of the intermediate layer of network? #2047

sohta94 · 2023-06-29T06:31:32Z

Prerequisites

[Yes] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[Yes] I carefully followed the README.md.
[Yes] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[Yes] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

I am interested in the difference between the feature vectors of the intermediate layer of the llama.cpp and PyTorch versions of the LLaMa model.
For this purpose, I would like to know how I can get the feature vectors of the middle layer, such as
torchvision.models.feature_extraction.create_feature_extractor and register_forward_hook method in PyTorch.

Current Behavior

I browsed C++ programs but could not figure out how to get the feature vector.

The text was updated successfully, but these errors were encountered:

SlyEcho · 2023-06-29T11:57:47Z

You can see some example of how to extract the vector in the embedding example, but it only extracts after the last layer and only the last vector.

There is also my experiment #1472 where I extract the input to an arbitrary layer. It's a little more complext because it extracts it by multiplying it with a coefficient (like +1.0 or -1.0) and can add. It also supports adding it back later on inference.

sohta94 · 2023-06-29T18:16:16Z

Thank you for your very informative comments.
With reference to #1472, I cloned steering branch and try steering options as follows.

$ ./main -m ./models/7B/ggml-model-q4_0.bin --seed 123 -n 64   --steering-add "Love"   --steering-sub "Hate"   --steering-source 4   --steering-layer 4   --steering-mul 5   --prompt "I hate you because "
main: build = 0 (unknown)
main: seed  = 123
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =   0.07 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
main: steering: ('Love' - 'Hate') * 5.000000
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 64, n_keep = 0


 I hate you because 1) You are the best thing that has ever happened to me and 2) I'm going to be in your life for a long time.
Love this one so much!
I wish you had said "You are the only woman who has ever been in my life" instead of "You are
llama_print_timings:        load time = 60662.62 ms
llama_print_timings:      sample time =   134.56 ms /    64 runs   (    2.10 ms per token)
llama_print_timings: prompt eval time =  4729.86 ms /    12 tokens (  394.15 ms per token)
llama_print_timings:        eval time = 74006.17 ms /    63 runs   ( 1174.70 ms per token)
llama_print_timings:       total time = 139246.77 ms

I have additional questions about steering branch.

I cannot find the NumPy file as indicated in "~/src/llama.cpp/build/steering.bin". I would like to know how to generate .bin NumPy file of feature vector of intermediate layer.
I thought --steering-layer 4 was an operation on the output of the 4th layer, but where is the order of layers of the model defined? I would like to know how to identify what layer is layer n. Does it match the results output by print(model) in the PyTorch version?
I am interested in It also supports adding it back later on inference. Given a NumPy file intermediate_feature.npy, I would like to know how to do inference by entering it on layer n.

Best regards.

SlyEcho · 2023-06-29T19:28:11Z

The .bin file is just a dump of the floating point number, I think it's somewhere there but commented out. It can be read in easily with Numpy.

The steering processes the same model as normal and the layers are processed in a loop. The ggml models only have the weights of the models, the model definition is inside code only, in llama.cpp, the function llama_eval_internal(). As far as I understand the names of the weights don't match exactly with Pytorch models, you can see in convert.py how they are mapped.

If you want to add some numbers on a specific layer, you have to have a condition inside the loop that checks the layer number, then you add your numbers to the inpL, inpSA, cur or whatever vector you need using ggml operations. That means you have to read you data into a ggml vector as well in the beginning.

Anyway, this is a rough explanation.

sohta94 · 2023-06-30T11:27:12Z

Thank you for your explanation.

As you have taught me, I found lines to save the .bin file and uncommented out and executed.
Now I can get steering.bin and display it in Python with np.fromfile(’steering.bin’, dtype=np.float32).

>>> vec = np.fromfile('steering.bin', dtype=np.float32)
>>> vec
array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)
>>> vec.shape
(2097152,)
>>> a.shape[0] / 512
4096.0

First question.
Since the LLaMa 7B model can handle up to 512 tokens in length, does this mean that an array of 512 tokens is defined in advance, and the NumPy array is used up to the current token length with the rest filled with zeros in the C++ implement?

Second question.
Is it correct that --steering-layer 4 is the layers.4 feature vector of the layers displayed during quantization as follows?

ubuntu@ubuntu:~/dlbr/llama.cpp-steering$ ./quantize ./models/7B/ggml-model-f16.bin ./models/7B/ggml-model-q4_0.bin q4_0
main: build = 0 (unknown)
main: quantizing './models/7B/ggml-model-f16.bin' to './models/7B/ggml-model-q4_0.bin' as q4_0
llama.cpp: loading model from ./models/7B/ggml-model-f16.bin
llama.cpp: saving model to ./models/7B/ggml-model-q4_0.bin
[   1/ 291]                tok_embeddings.weight -     4096 x 32000, type =    f16, quantizing .. size =   250.00 MB ->    70.31 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[   2/ 291]                          norm.weight -             4096, type =    f32, size =    0.016 MB
[   3/ 291]                        output.weight -     4096 x 32000, type =    f16, quantizing .. size =   250.00 MB ->    70.31 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020 
[   4/ 291]         layers.0.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.035 0.012 0.019 0.030 0.047 0.069 0.097 0.129 0.152 0.129 0.098 0.070 0.047 0.031 0.019 0.016 
[   5/ 291]         layers.0.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.035 0.012 0.020 0.032 0.049 0.072 0.098 0.125 0.139 0.125 0.099 0.072 0.050 0.033 0.021 0.017 
[   6/ 291]         layers.0.attention.wv.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.037 0.055 0.075 0.096 0.114 0.124 0.114 0.096 0.075 0.055 0.038 0.024 0.020 
[   7/ 291]         layers.0.attention.wo.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.013 0.021 0.033 0.051 0.073 0.099 0.123 0.133 0.123 0.099 0.073 0.051 0.033 0.021 0.018 
[   8/ 291]       layers.0.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[   9/ 291]      layers.0.feed_forward.w1.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  10/ 291]      layers.0.feed_forward.w2.weight -    11008 x  4096, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  11/ 291]      layers.0.feed_forward.w3.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  12/ 291]             layers.0.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  13/ 291]         layers.1.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.113 0.120 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
[  14/ 291]         layers.1.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.037 0.024 0.020 
[  15/ 291]         layers.1.attention.wv.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.037 0.055 0.076 0.097 0.114 0.122 0.115 0.097 0.076 0.055 0.037 0.024 0.020 
[  16/ 291]         layers.1.attention.wo.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.013 0.021 0.033 0.050 0.072 0.098 0.124 0.136 0.124 0.098 0.072 0.050 0.033 0.021 0.018 
[  17/ 291]       layers.1.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  18/ 291]      layers.1.feed_forward.w1.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  19/ 291]      layers.1.feed_forward.w2.weight -    11008 x  4096, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.056 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  20/ 291]      layers.1.feed_forward.w3.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  21/ 291]             layers.1.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  22/ 291]         layers.2.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[  23/ 291]         layers.2.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.024 0.038 0.055 0.076 0.097 0.114 0.121 0.114 0.097 0.076 0.055 0.038 0.024 0.020 
[  24/ 291]         layers.2.attention.wv.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.076 0.096 0.112 0.119 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  25/ 291]         layers.2.attention.wo.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.077 0.097 0.113 0.118 0.112 0.097 0.077 0.056 0.038 0.025 0.020 
[  26/ 291]       layers.2.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  27/ 291]      layers.2.feed_forward.w1.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  28/ 291]      layers.2.feed_forward.w2.weight -    11008 x  4096, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  29/ 291]      layers.2.feed_forward.w3.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  30/ 291]             layers.2.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  31/ 291]         layers.3.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.096 0.076 0.056 0.038 0.025 0.021 
[  32/ 291]         layers.3.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.113 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020 
[  33/ 291]         layers.3.attention.wv.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  34/ 291]         layers.3.attention.wo.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.116 0.111 0.097 0.077 0.057 0.039 0.025 0.021 
[  35/ 291]       layers.3.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  36/ 291]      layers.3.feed_forward.w1.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  37/ 291]      layers.3.feed_forward.w2.weight -    11008 x  4096, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.036 0.016 0.025 0.039 0.057 0.077 0.097 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  38/ 291]      layers.3.feed_forward.w3.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  39/ 291]             layers.3.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
[  40/ 291]         layers.4.attention.wq.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.076 0.097 0.112 0.118 0.112 0.096 0.077 0.056 0.039 0.025 0.021 
[  41/ 291]         layers.4.attention.wk.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.038 0.056 0.076 0.097 0.112 0.119 0.112 0.097 0.076 0.056 0.038 0.025 0.020 
[  42/ 291]         layers.4.attention.wv.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.016 0.025 0.039 0.056 0.077 0.096 0.112 0.118 0.112 0.096 0.076 0.056 0.039 0.025 0.021 
[  43/ 291]         layers.4.attention.wo.weight -     4096 x  4096, type =    f16, quantizing .. size =    32.00 MB ->     9.00 MB | hist: 0.036 0.015 0.025 0.039 0.056 0.077 0.097 0.112 0.117 0.112 0.097 0.077 0.056 0.039 0.025 0.021 
[  44/ 291]       layers.4.attention_norm.weight -             4096, type =    f32, size =    0.016 MB
[  45/ 291]      layers.4.feed_forward.w1.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.116 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  46/ 291]      layers.4.feed_forward.w2.weight -    11008 x  4096, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  47/ 291]      layers.4.feed_forward.w3.weight -     4096 x 11008, type =    f16, quantizing .. size =    86.00 MB ->    24.19 MB | hist: 0.037 0.016 0.025 0.039 0.057 0.077 0.096 0.111 0.117 0.111 0.096 0.077 0.057 0.039 0.025 0.021 
[  48/ 291]             layers.4.ffn_norm.weight -             4096, type =    f32, size =    0.016 MB
.
.
.

Third question.
I also have a question about the timing of generation.
I thought that while the model was generating text, steering.bin would be output every time a token was generated, i.e., every time the model inferred, but in fact it was only generated once at the beginning. In fact, the values in steering.bin were all 0 after 3. What is wrong with my thinking?

I have not tried inference from the middle tier yet and will ask that question to you again in the future.

Best regards.

SlyEcho · 2023-06-30T19:58:35Z

Well, you should read up on what the steering experiment was about after all, it is not a normal operation for llama.cpp. But I brought it up because the way it works is it first reads from some arbitrary layer output (or input, depends on the perspective) and it also writes it back later.

The steering works with multiple passes, first the positive and negative steering strings are processed, at this time the interceptor reads the embedding vectors from the layer input (layer number configurable from command line). They are added to the same output vector (when processing the positive string, multiplied by +1.0 and vice-versa). This gives one vector which is the "steering vector" that is written into the .bin file.

Then the program runs normally as it used to before, except now the steering vector is being read from, multiplied by a user-definable coefficient (theoretically, more positive: more steering effect, negative: opposite effect, zero: no effect). We also experimented with injecting the steering into a different layer than it was extracted from, which may give some different effect (didn't really have time to study this).

But the main idea of how to mess with the vectors:

Create a big enough storage object in the llama context object (like a C++ vector of floats with the size n_ctx * n_embd)
Create a ggml tensor in the beginning of the evaluation for using the data in the llama code. It doesn't have to cover the whole context, only the N elements of the batch.
Copy the existing data into the vector using memcpy(), from the appropriate place, so the context location is n_past and size is N.
In the llama evaluation code you can do operations with tensors using the ggml functions. ggml does not immediately calculate the numbers, it creates a graph first and then it is calculated later (this allows doing training and optimization like the big boy libraries Tensorflow or Pytorch do).
The result has to be copied to the tensor created in step 2, then ggml_build_forward_expand() called on it.
After the graph is computed at ggml_graph_compute(), copy the data into your storage created at step 1 at the appropriate location.

If you don't need to read and write at the same time some steps are optional. For example, embedding.cpp only reads data so it is simpler.

github-actions · 2024-04-09T01:08:35Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

thiswillbeyourgithub mentioned this issue May 13, 2025

Debug intermediate layers of tensor compute values #3325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: How to access feature vector of the intermediate layer of network? #2047

Question: How to access feature vector of the intermediate layer of network? #2047

sohta94 commented Jun 29, 2023 •

edited

Loading

SlyEcho commented Jun 29, 2023

Uh oh!

sohta94 commented Jun 29, 2023 •

edited

Loading

Uh oh!

SlyEcho commented Jun 29, 2023

Uh oh!

sohta94 commented Jun 30, 2023 •

edited

Loading

Uh oh!

SlyEcho commented Jun 30, 2023

Uh oh!

github-actions bot commented Apr 9, 2024

Uh oh!

Question: How to access feature vector of the intermediate layer of network? #2047

Question: How to access feature vector of the intermediate layer of network? #2047

Comments

sohta94 commented Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Prerequisites

Expected Behavior

Current Behavior

SlyEcho commented Jun 29, 2023

Uh oh!

sohta94 commented Jun 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SlyEcho commented Jun 29, 2023

Uh oh!

sohta94 commented Jun 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SlyEcho commented Jun 30, 2023

Uh oh!

github-actions bot commented Apr 9, 2024

Uh oh!

sohta94 commented Jun 29, 2023 •

edited

Loading

sohta94 commented Jun 29, 2023 •

edited

Loading

sohta94 commented Jun 30, 2023 •

edited

Loading