Skip to content

Commit 4409b0b

Browse files
committed
Update documentation from main repository
1 parent 919547d commit 4409b0b

File tree

1 file changed

+0
-47
lines changed

1 file changed

+0
-47
lines changed

docs/stable/store/quickstart.md

Lines changed: 0 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -191,53 +191,6 @@ for output in outputs:
191191
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
192192
```
193193

194-
## Quantization
195-
196-
> Note: Quantization is currently experimental, especially on multi-GPU machines. You may encounter issues when using this feature in multi-GPU environments.
197-
> Note: Our current capabilities do not support pre-quantization or CPU offloading, which is why other quantization methods are not available at the moment.
198-
199-
ServerlessLLM currently supports `bitsandbytes` quantization through `transformers`.
200-
201-
For further information, consult the [HuggingFace Documentation for Quantization](https://huggingface.co/docs/transformers/en/main_classes/quantization)
202-
203-
### Usage
204-
To use quantization, create a quantization config object with your desired settings using the `transformers` format:
205-
206-
```python
207-
from transformers import BitsAndBytesConfig
208-
import torch
209-
210-
# For 8-bit quantization
211-
quantization_config = BitsAndBytesConfig(
212-
load_in_8bit=True
213-
)
214-
215-
# For 4-bit quantization (NF4)
216-
quantization_config = BitsAndBytesConfig(
217-
load_in_4bit=True,
218-
bnb_4bit_quant_type="nf4"
219-
)
220-
221-
# For 4-bit quantization (FP4)
222-
quantization_config = BitsAndBytesConfig(
223-
load_in_4bit=True,
224-
bnb_4bit_quant_type="fp4"
225-
)
226-
227-
# Then load your model with the config
228-
model = load_model(
229-
"facebook/opt-1.3b",
230-
device_map="auto",
231-
torch_dtype=torch.float16,
232-
storage_path="./models/",
233-
fully_parallel=True,
234-
quantization_config=quantization_config,
235-
)
236-
```
237-
A full example can be found [here](https://github.com/ServerlessLLM/ServerlessLLM/blob/main/sllm_store/examples/load_quantized_transformers_model.py).
238-
239-
For users with multi-GPU setups, ensure that the number of CUDA visible devices are the same on both the store server and the user environment via `export CUDA_VISIBLE_DEVICES=<num_gpus>`.
240-
241194
# Fine-tuning
242195
ServerlessLLM currently supports LoRA fine-tuning using peft through the Hugging Face Transformers PEFT.
243196

0 commit comments

Comments
 (0)