-
Notifications
You must be signed in to change notification settings - Fork 20
Description
I plan to PR today, though it depends on final progress.
The computation speed is slow because we currently have no mulmat kernel with interleaving broadcast support yet, so tests are time consuming.
Falcon has twice the vocabulary than llama, in practice that means that Falcon naturally has a performance benefit of 30-40% on english text and about 20-25% on code and foreign languages.
This also means that 50 tokens/sec flacon speed is about as fast as 70 tokens/sec on llama in terms of language throughput.
So a 8k context window on Falcon is equivalent to ~12k context on llama.
The task: Pre-processing a large input such as a book chapter, complex code, a tutorial or a transcription of a meeting
Now I want to be able to interview Falcon about this huge text to work with it, extend it or transform it
For the current work I copied the entire falcon_eval_internal() function from current libfalcon.cpp, that's 20kb of source code and quite exactly 7k falcon tokens and the question asked is "<|prompter|>Write a summary of 10 sentences covering everything this function does<|endoftext|><|assistant|>"
I'm processing this on a high quality quantization: The 40B Q5_K (OpenAssistant).
Default
Normal Falcon result on the above question and libfalcon.cpp input:
" when, as. and for are a to:, the by , use for a and on that: a,. for it and in, this you from is. for ,,.
.' of.рен if( you they with,"
What is going on ? If we look below the surface of how the model understands text, the most essential part for the relationship between tokens is the positional encoding done through "ROPE". Sounds super compilcated but actually all it is is a 2d rotation of each token based on it's position in the total context.
Visualized this rotation of one embedding:
This is how the model was trained to understand relationships between tokens and sequences within a 2048 token context. I am not entirely sure why this quite tight rotation is being used, I assume (hope) someone mathed those parameters out.
Beyond that 2048 context it happens quite fast that the model does not calculate proper attention anymore, at 7k context it's completely braindead.
But by adapting the angle of rotation we can push it back into reality.
For example 8k context with a fixed scaled rotation angle:
The model output now:
"<|prompter|>Write a summary of 10 sentences covering everything this function does<|endoftext|><|assistant|>"
Sure, here's a summary of the function:
- Initialize the context struct falcon with various parameters, such as the model, input tensor, output tensor, etc.
- Check if broadcasting is enabled and whether to use it. If so, set up g_repeat accordingly for the first 3 dimensions.
- Load input tensor from disk into a tensor using gml_tensor_from_file() or create an empty tensor with gml_zeros().
- Create output tensor for embedding using gml_tensor_new_d() and initialize it with zeros if necessary.
- Initialize the current layer to zero and set its input to the previous one, i.e., self.layer = -1.
- Loop over each attention head in a sequence of length n_head:
- Load token tensor from disk into a tensor using gml_tensor_from_file() or create an empty tensor with gml_zeros().
- Normalize the token tensor using gnorm() to get the embeddings, and store it as self.data.
- Compute scores for every token in sequence using llama_forward() and store them in self.scores.
- Repeat the above steps for each token until end of sequence is reached.
- Store the scores for all tokens in a matrix with gml_tensor_mul().
- Normalize the matrix by dividing it by the sum of squares of squares, add one to avoid division by zero.
- Softmax the matrix and store result as self.data. This is the token representation.
- If embedding is enabled, load embeddings from the model.<|endoftext|>
Here is another variant:
Sure, here's a summary of what this function does:
The function performs the self-attention operation for one step in the transformer model. It takes in the input embeddings from the previous layer (inpL), the current token mask, and the query vectors for each head (Q) and computes the attention weight matrix (K). The attention weight matrix is used to compute the weighted sum of the embeddings from the previous step, scaled by sqrt(n_embd/head), and then softmaxed. This result is then multiplied with the value vector V to produce the updated embeddings for the current token, which are stored in KV. The attention is applied over the entire input sequence (N) and the final output is added to the value for each head. If broad casting is enabled, the attention weight matrix is broadcasted across all heads.
The function then takes in the memory from the previous layer (LL), rotates it by shifting the rows by (n_past-1)*head_dims and adds it to the current token embeddings to create the final output LLK. This is stored in cur and returned as the result of the function. If LLK is empty, it returns a single value from K for the first token.
The function then takes in the past tokens (pL), the attention mask, and the query vectors for each head (Q) and computes the attention weight matrix (KV). The attention weight matrix is used to compute the weighted sum of embeddings from the previous step, scaled by sqrt(n_embd/head), and then softmaxed. This result is added to the value for each token, which are stored in Q. The past tokens are also rotated by (n_past-1)*head_dims and added to K for the first token. If broad casting is enabled, the weight matrix is broadcast across all heads.
The function then takes in the attention output LLK from the previous layer, adds it to the result V, and adds the memory from LLK-1 to produce the final output LLK_merged. This is stored in cur and returned as the result of the function.
This is WIP. I currently have a bunch of different variants running that all perform a bit different.
The amount of hallucination is striking.
The benchmark is the best OpenAI currently has to offer, of course they not only have good parameters but also were fine tuned for that purpose. Fine tuning is something we can do once the Falcon large context parameters are chosen.
Turbo-16k
This function evaluates the Falcon neural network model on a given sequence of tokens.
First, it initializes various variables and parameters related to the model and the context.
Then, it creates a ggml_context and ggml_cgraph to perform the computation.
Next, it processes each layer of the model using a loop. For each layer, it performs self-attention, applies a linear transformation, and computes the output.
After all the layers have been processed, the function applies batch normalization and computes the embeddings.
Then, it applies the language modeling head to the embeddings.
Finally, it computes the logits and updates the kv token count and storage buffer.
The function also performs optional offloading to the GPU and prints timing information for each ggml operation.
It returns the logits and embeddings as output.
GPT4 at 8k:
The function
falcon_eval_internal
is a critical piece of the Falcon Natural Language Processing model. It processes tokens through the trained Falcon model for evaluation and returns the generated logits. The function begins by initializing various values and parameters based on the inputs and the model configurations. It then sets up a computational graph and loops through all the layers in the model. In each layer iteration, the function performs a series of mathematical operations, such as matrix multiplications, rearrangements, and normalizations, to perform self-attention on the provided tokens. For each layer, the function also manages memory buffers and uses API calls to calculate embeddings and language model heads. After the layer loop, normalization is performed on the input layer, and the language model head is computed. Before final logits can be returned, the function checks if all versus only last token logits are required and manages memory accordingly. The function concludes by measuring and tracking the time taken for execution.
Overall Turbo as well as GPT4 provide a definitely better roundup, especially regarding hallucinations, not super convincing in all cases which is also caused by the code being above the understanding of any llm today.