You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* add `extension` property to QuantizeConfig + EoRA Extension/Config
* test shihyang push
* match/validate correct kernel to extension
* model.quantize return the quantized weight now for EoRA
* allow test_perplexity to run without buffered_fwd arg
* limit test to only 1 for fast debug
* reduce verbosity of logs (meant for debug)
* fix python 3.10 compat
* finish eora first version(not optimize might only work for llama type)
* dummy (non-working) eora torch kernel
* add `BACKEND.EORA_TORCH` and correctly register the eora_torch kernel
* fix eora torch backend selection
* fix typo causing dtype mismatch
* trying to get the eora loading but fail
* refractor eora config/loading
* refractor eora config
* add `test_eora.py`, loading not fixed yet
* fix config loading, and quant model loading (non-lora weighs) with eroa config.
* load A and B weights
* fix transposed tensors for inference
* move a/b to correct device
* rename `extension` to `adapter`
* half-way done with eora
* eora bug device mismatch
* fix eora v2 generation code(non-concatenated version)
* added GPTQ-eora kernel based off exllama vllm GPTQ implementation
* refractor adapter a/b load and math inside EoRA adapter and out of kernel
* fix adapter not copied causing shape errors since all adapters are the same instance
* fix loader cache ci bug
* create eora_load_and_infer.py at root to avoid recompiling
* use local model dir
* load local datasets
* fix setting CUDA_DEVICE_ORDER
* add local model path
* fix merge error
* move adapter code adapter.py
* rename EoRA to Lora
* rename `lora.path_or_id` to `lora.path`
* added sweep test for different k and r that conform to condition: (128 * r / k) is an integer >= 1
* relaxed r to be any rank < k
* add default value for pack_dtype & adapter
* Revert "add default value for pack_dtype & adapter"
This reverts commit e56b86a.
* add pack_dtype & adapter for hf_select_quant_linear
* set adapter to None
* remove unexpected char
* default None for name and set it with kernel name
* 1. use dict for model args. 2. accept extra args
* use dict for model args
* add lm eval tests
* use triton backend
* optimization: reordering for loop to have unrolled inner for loops
* add test_kernel_output.py
* cleanup test
Signed-off-by: Qubitium <[email protected]>
* fix eora kernel must be on gptq_v1 format and not on the internal v2 format for other kernels with zeropoint offset fix
Signed-off-by: Qubitium <[email protected]>
* also skip v1 to v2 for marlin
Signed-off-by: Qubitium <[email protected]>
* enabled fused eora kernel
Signed-off-by: Qubitium <[email protected]>
* remove `eora_torch` backend (useless)
Signed-off-by: Qubitium <[email protected]>
* add test_kernel_output_with_lora()
* remove FORMAT_FIELD_COMPAT_MARLIN
* test add BACKEND.CUDA
* EXLLAMA-EORA add SUPPORTS_BITS 2,3
* fix gptq_marlin error
Signed-off-by: ZX-ModelCloud <[email protected]>
* merge changes from main and fix v2_to_v1 conversion should bypass marlin + eora kernel
Signed-off-by: Qubitium <[email protected]>
* wrong eq check
* fixing kernel bug
* .
* add test file for eora kernel
* fix the eora_kernel buggit add .
* format
Signed-off-by: Qubitium <[email protected]>
# Conflicts:
# tests/test_kernel_output.py
* format
Signed-off-by: Qubitium <[email protected]>
* format
Signed-off-by: Qubitium <[email protected]>
---------
Signed-off-by: Qubitium <[email protected]>
Signed-off-by: ZX-ModelCloud <[email protected]>
Co-authored-by: Qubitium <[email protected]>
Co-authored-by: shihyangl <[email protected]>
Co-authored-by: nbasyl <[email protected]>
Co-authored-by: Maksim Khadkevich <[email protected]>
Co-authored-by: CSY-ModelCloud <[email protected]>
f"Exllama v2 kernel requires a float16 input activation, while {x.dtype} was passed. Casting to float16.\nMake sure you loaded your model with torch_dtype=torch.float16, that the model definition does not inadvertently cast to float32, or disable AMP Autocast that may produce float32 intermediate activations in the model."
160
+
f"Exllama EoRA kernel requires a float16 input activation, while {x.dtype} was passed. Casting to float16.\nMake sure you loaded your model with torch_dtype=torch.float16, that the model definition does not inadvertently cast to float32, or disable AMP Autocast that may produce float32 intermediate activations in the model."
0 commit comments