Skip to content

Commit abd8e2c

Browse files
committed
Docs for lower smaller models to mps/coreml/qnn
Differential Revision: [D56340028](https://our.internmc.facebook.com/intern/diff/D56340028/) ghstack-source-id: 223154810 Pull Request resolved: #3146
1 parent 1eed125 commit abd8e2c

File tree

1 file changed

+12
-2
lines changed

1 file changed

+12
-2
lines changed

examples/models/llama2/README.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,9 +17,9 @@ Please note that the models are subject to the [acceptable use policy](https://g
1717

1818
# Results
1919

20-
Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
20+
Since 7B Llama2 model needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
2121

22-
For Llama3, we can use the same process. Note that it's only supported in the ExecuTorch main branch.
22+
For Llama3, we can use the same process. Note that it's only supported in the ExecuTorch main branch.
2323

2424
## Quantization:
2525
We employed 4-bit groupwise per token dynamic quantization of all the linear layers of the model. Dynamic quantization refers to quantizating activations dynamically, such that quantization parameters for activations are calculated, from min/max range, at runtime. Here we quantized activations with 8bits (signed integer). Furthermore, weights are statically quantized. In our case weights were per-channel groupwise quantized with 4bit signed integer. For more information refer to this [page](https://github.com/pytorch-labs/ao/).
@@ -243,6 +243,16 @@ Please refer to [this tutorial](https://pytorch.org/executorch/main/llm/llama-de
243243
### Android
244244
Please refer to [this tutorial](https://pytorch.org/executorch/main/llm/llama-demo-android.html) to for full instructions on building the Android LLAMA Demo App.
245245
246+
## Optional: Smaller models delegated to other backends
247+
Currently we supported lowering the stories model to other backends, including, CoreML, MPS and QNN. Please refer to the instruction
248+
for each backend ([CoreML](https://pytorch.org/executorch/main/build-run-coreml.html), [MPS](https://pytorch.org/executorch/main/build-run-mps.html), [QNN](https://pytorch.org/executorch/main/build-run-qualcomm.html)) before trying to lower them. After the backend library is installed, the script to export a lowered model is
249+
250+
- Lower to CoreML: `python -m examples.models.llama2.export_llama -kv --coreml -c stories110M.pt -p params.json`
251+
- MPS: `python -m examples.models.llama2.export_llama -kv --MPS -c stories110M.pt -p params.json`
252+
- QNN: `python -m examples.models.llama2.export_llama -kv --qnn -c stories110M.pt -p params.json`
253+
254+
The iOS LLAMA app supports the CoreML and MPS model and the Android LLAMA app supports the QNN model. On Android, it also allow to cross compiler the llama runner binary, push to the device and run.
255+
246256
# What is coming next?
247257
## Quantization
248258
- Enabling FP16 model to leverage smaller groupsize for 4-bit quantization.

0 commit comments

Comments
 (0)