Update documentation from main repository

future-xy · future-xy · commit 4409b0b9d518 · 2025-07-30T21:49:32.000Z
diff --git a/docs/stable/store/quickstart.md b/docs/stable/store/quickstart.md
@@ -191,53 +191,6 @@ for output in outputs:
     print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
 ```
 
-## Quantization
-
-> Note: Quantization is currently experimental, especially on multi-GPU machines. You may encounter issues when using this feature in multi-GPU environments.
-> Note: Our current capabilities do not support pre-quantization or CPU offloading, which is why other quantization methods are not available at the moment.
-
-ServerlessLLM currently supports `bitsandbytes` quantization through `transformers`.
-
-For further information, consult the [HuggingFace Documentation for Quantization](https://huggingface.co/docs/transformers/en/main_classes/quantization)
-
-### Usage
-To use quantization, create a quantization config object with your desired settings using the `transformers` format:
-
-```python
-from transformers import BitsAndBytesConfig
-import torch
-
-# For 8-bit quantization
-quantization_config = BitsAndBytesConfig(
-    load_in_8bit=True
-)
-
-# For 4-bit quantization (NF4)
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="nf4"
-)
-
-# For 4-bit quantization (FP4)
-quantization_config = BitsAndBytesConfig(
-    load_in_4bit=True,
-    bnb_4bit_quant_type="fp4"
-)
-
-# Then load your model with the config
-model = load_model(
-    "facebook/opt-1.3b",
-    device_map="auto",
-    torch_dtype=torch.float16,
-    storage_path="./models/",
-    fully_parallel=True,
-    quantization_config=quantization_config,
-)
-```
-A full example can be found [here](https://github.com/ServerlessLLM/ServerlessLLM/blob/main/sllm_store/examples/load_quantized_transformers_model.py).
-
-For users with multi-GPU setups, ensure that the number of CUDA visible devices are the same on both the store server and the user environment via `export CUDA_VISIBLE_DEVICES=<num_gpus>`.
-
 # Fine-tuning
 ServerlessLLM currently supports LoRA fine-tuning using peft through the Hugging Face Transformers PEFT.