You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> Note: Quantization is currently experimental, especially on multi-GPU machines. You may encounter issues when using this feature in multi-GPU environments.
197
-
> Note: Our current capabilities do not support pre-quantization or CPU offloading, which is why other quantization methods are not available at the moment.
198
-
199
-
ServerlessLLM currently supports `bitsandbytes` quantization through `transformers`.
200
-
201
-
For further information, consult the [HuggingFace Documentation for Quantization](https://huggingface.co/docs/transformers/en/main_classes/quantization)
202
-
203
-
### Usage
204
-
To use quantization, create a quantization config object with your desired settings using the `transformers` format:
205
-
206
-
```python
207
-
from transformers import BitsAndBytesConfig
208
-
import torch
209
-
210
-
# For 8-bit quantization
211
-
quantization_config = BitsAndBytesConfig(
212
-
load_in_8bit=True
213
-
)
214
-
215
-
# For 4-bit quantization (NF4)
216
-
quantization_config = BitsAndBytesConfig(
217
-
load_in_4bit=True,
218
-
bnb_4bit_quant_type="nf4"
219
-
)
220
-
221
-
# For 4-bit quantization (FP4)
222
-
quantization_config = BitsAndBytesConfig(
223
-
load_in_4bit=True,
224
-
bnb_4bit_quant_type="fp4"
225
-
)
226
-
227
-
# Then load your model with the config
228
-
model = load_model(
229
-
"facebook/opt-1.3b",
230
-
device_map="auto",
231
-
torch_dtype=torch.float16,
232
-
storage_path="./models/",
233
-
fully_parallel=True,
234
-
quantization_config=quantization_config,
235
-
)
236
-
```
237
-
A full example can be found [here](https://github.com/ServerlessLLM/ServerlessLLM/blob/main/sllm_store/examples/load_quantized_transformers_model.py).
238
-
239
-
For users with multi-GPU setups, ensure that the number of CUDA visible devices are the same on both the store server and the user environment via `export CUDA_VISIBLE_DEVICES=<num_gpus>`.
240
-
241
194
# Fine-tuning
242
195
ServerlessLLM currently supports LoRA fine-tuning using peft through the Hugging Face Transformers PEFT.
0 commit comments