Skip to content

Commit 7939d1a

Browse files
QubitiumCSY-ModelCloudZX-ModelCloudnbasyl
authored
Eora (#1302)
* fix override * simplify * fix missing `modules` item * breaking: fix module.state update * fix state should contain both W and WQ * fix no super() for class obj * remove get attr * call LoopProcessor.post_process() Signed-off-by: ZX-ModelCloud <[email protected]> * call processor.finalize * Correctly call methods from self.gptq_model Signed-off-by: ZX-ModelCloud <[email protected]> * rename to calibration_data * cleanup pack()..no need to clone weights..use T instead of t() * LoopProcessor add model_finalize() Signed-off-by: ZX-ModelCloud <[email protected]> * cleanup pack()..rename var for clarity * pop wq from state * clean code..de-indent logic * add safety code to store original in/out features of W in NamedModule state since the weight will be heavily changed during quant * add stats() api and stats fields to processor * ruff * Fix circular import Signed-off-by: ZX-ModelCloud <[email protected]> * add license * add clearml back * fix NamedModule.__getattr__() error Signed-off-by: ZX-ModelCloud <[email protected]> * add `require_fwd` property to processor * simplify * fix canot set weight.data to None * fix the error that tasks is empty Signed-off-by: ZX-ModelCloud <[email protected]> * add todo * fix parameter position & name * fix import * fix named module override * fix __dict__ name error Signed-off-by: ZX-ModelCloud <[email protected]> * fix module type error Signed-off-by: ZX-ModelCloud <[email protected]> * fix layer_inputs index out of range Signed-off-by: ZX-ModelCloud <[email protected]> * rename * add lm_head quantize config Signed-off-by: ZX-ModelCloud <[email protected]> * pop `w` at submodule finalize * simplify...quantize should only be called once * release quantizer for module on post_process * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * refractor * cleanup * fix circular import Signed-off-by: ZX-ModelCloud <[email protected]> * refractor quantize() args and override * Fix GPTQProcessor log Signed-off-by: ZX-ModelCloud <[email protected]> * fix wrong damp_percent returned * return log Signed-off-by: ZX-ModelCloud <[email protected]> * fix hf api compat * use const, not str * rename to `finalize` * fix import * rename quantize() to quantize_old() Signed-off-by: ZX-ModelCloud <[email protected]> * fix import * If calibration_dataset is None or Empty, the input_cache of the previous processor is used Signed-off-by: ZX-ModelCloud <[email protected]> * add fixme for hf api compat of fasterquant * add EoraConfig Signed-off-by: ZX-ModelCloud <[email protected]> * remove .module * add eora processor * fix misc * fix misc * fix isinstance can't check subclass * fix lora config storage * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * change name to class method * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * format * fix adapter.name() should be classmethod * fix eora logging * move all eora test code into eora_test (pending removal) * move eora algorithm to nvidia licensed eora file * remove unused * fix hf api compat for quantize() * use EoraProcessor() Signed-off-by: ZX-ModelCloud <[email protected]> * fix processor.num_batches setting Signed-off-by: ZX-ModelCloud <[email protected]> * async move wq to cpu * fix not a python package * fix exllama was not compiled * add async move for gptq processor * move prepare_dataset() to LoopProcessor Signed-off-by: ZX-ModelCloud <[email protected]> * add release_calibration_dataset() Signed-off-by: ZX-ModelCloud <[email protected]> * update error for lm_head and model with tied_weights=True * consolidate dynamic skipped logic * Fix eigen_scaling_diag_matrix not initialized Signed-off-by: ZX-ModelCloud <[email protected]> * Fix subset repeated quantization Signed-off-by: ZX-ModelCloud <[email protected]> * add processed_subset Signed-off-by: ZX-ModelCloud <[email protected]> * Fix the error that the type of wq obtained is tuple Signed-off-by: ZX-ModelCloud <[email protected]> * fix weight.data should not be moved to cpu for process code * del and overwrite is the same for gc * Fix layer_inputs where the last layer is emtpy Signed-off-by: ZX-ModelCloud <[email protected]> * cleanup * use Lora.name() class method for mapping * fix adapter save and load Signed-off-by: ZX-ModelCloud <[email protected]> * move `quant_result` from gptq_process to base loop_process as `_results` * add `stream: bool` toggle in `move_to` r Tensors type only * format * compat: make sure lora key can found for all HF AutoModel api * save eora and test * fix streaming * fix compat loading for hf names * fix BitBLASQuantLinear's adapter argument error Signed-off-by: ZX-ModelCloud <[email protected]> * fix ugly mess in lm_eval integration, vars mismatch, type mis-match * remove util.eval calls.. always use GPTQModel.eval() * rename eval backend to llm_backend and add real gptqmodel specific backend var * add gen_kwargs * use ellama v2 for lm-eval and use acc_norm only * use ellama v2 for lm-eval and use acc_norm only * fix ci test * comment out special kernels * fix Lora.apply() error when batched generate Signed-off-by: ZX-ModelCloud <[email protected]> * fix compile * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * fix `generate()` not applying correct pad_token_id from tokenizer * protect against null (Optinoal) tokenizer * cleanup compile * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * fix cuda kernel * disable eora kernels except for torch * add `adapter` control/override in `quantize()` * remove quantize_config.eora_dataset property * patch evalplus to allow passing a model directly * change test to pass adapter on GPTQModel.load(). Since `adapter` config is not saved in model config.json and quantize_config.json, we need to always pass `adapter` to enable gptq/lora/eora * Fix module.bias not being able to be assigned Signed-off-by: ZX-ModelCloud <[email protected]> * comment * print Adapter loaded post-init so user knows adapter is correctly loaded from disk * fix evalplus oom * fix ci tests..random seed consolidated into one var * fix ci tests * disable streaming and fix ci test * add base vs eora arc-challenge benchmarks to eora test * fix module.compile overriding nn.module compile. rename to `g_compile` * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * rename `g_compile` to `opimize` * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * refactor eora_generate() Signed-off-by: ZX-ModelCloud <[email protected]> * fix argument error Signed-off-by: ZX-ModelCloud <[email protected]> * add `kernels()` api to use so which kernels have been loaded at end of model load * add DequantizeProcessor * add DequantizeProcessor * refractor add `retrain_w` option to GPTQProcessor * cleanup * comments * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * Fix Assignment Error Signed-off-by: ZX-ModelCloud <[email protected]> * DequantizeProcessor does not perform any operations on dataset Signed-off-by: ZX-ModelCloud <[email protected]> * refractor: upcast w to float32 before delta calculation in case of bfloat16 and float16 mismatch * fix wrong assert (reversed) * cleanup * fix summary log Signed-off-by: ZX-ModelCloud <[email protected]> * call eora_save() Signed-off-by: ZX-ModelCloud <[email protected]> * fix argument name error Signed-off-by: ZX-ModelCloud <[email protected]> * add code for assert eora weight Signed-off-by: ZX-ModelCloud <[email protected]> * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * add test_eora_post_quant() Signed-off-by: ZX-ModelCloud <[email protected]> * clean up `test_quant_erao` so we have config at top and print config before lm-eval results # Conflicts: # tests/test_quant_and_eora.py * add test_eora_post_quant.py Signed-off-by: ZX-ModelCloud <[email protected]> * default to group_size 128 for test. group_size 64 has strange regression * rename * refractor api to `GPTQModel.adapter.generate` * cleanup * cleanup * avoid converting to scalar via item() as torch.compile doesn't like it * try to speed things for eora gen with compile * increase cache and disable scalar captures * use local model path * revert making adapter a module * use torch_compile helper instead torch.compile * use torch_compile helper instead torch.compile * move dequantize_weight() to PackableQuantLinear Signed-off-by: ZX-ModelCloud <[email protected]> * bump intel_extension_for_pytorch to 2.6.0 & remove pack() for ipex & remove xpu check for fp16 * Revert "move dequantize_weight() to PackableQuantLinear" This reverts commit b5d311d. * merge main's eval() changes * push `wf` and dequantize code into packable. refractor ipex to be based on torch kernel # Conflicts: # gptqmodel/nn_modules/qlinear/ipex.py * eora has been moved to eora-copy branch * fix test didn't pass any model * add register_buffers to init * remove unused args * revert register_buffers changes * revert deleting eora dir * remove eora test code * update eora license to apache and attribute nvidia/arxiv * Eora_main branch merge to Eora (#1301) * fix type hint * update warning msg * update eora license to apache and attribute nvidia/arxiv * remove early eora test files * ipex doesn't need to pass register_buffers to Torch * refractor ipex * refractor ipex2 * fix typo * make ipex packable & add missing register_buffers * cleanup ipex, add lora + bias check * remove duplicated codes * ignore two folders for pytest * fix test lora. fix wrong tokenizer type * compile adapter * Fix `generation_config.json` not auto-saved (#1292) * Fix `generation_config.json` not auto-saved * Update writer.py * update transformers 4.49.0 * [CI] update ci for requirements installation * [CI] don't update intel_extension_for_pytorch for now * [CI] remove ipex * correct name backend to exllama_eora * use hf save hack to fix config saves * fix param name changed * [SAVE] Save config files with empty state dict (#1293) * Save model and config files with empty state dict * cleanup * cleanup * print lora adapter loaded count vs total number of of quantized modules * print lora adapter loaded count vs total number of of quantized modules * fix wrong model.save * Test GSM8K * patch __repr__ for evalplus * Save processor related config files. For example: preprocessor_config.json, chat_template.json (#1295) * Fix adapter/eora for ipex kernel * Fix eora for ipex/marlin * Clean eora for exllama v1/v2 * fix shape does not match in Backend.Marlin * add comment * type hint use torch.dtype instead of torch.float32 * get _supports_flash_attn_2 from transformers * fix prepare_dataset() error * add color to logs * fix ci: lm_head test * fix pb and logging conflicting on output * refractor logging/pb * move wf_ buffer to post_init * fix logger + pb compat * rename pb.set_description to pb.info * fix progressbar padding so cli ui width is stable * add progressbar test * fix progressbar display at close()/end * todo fixme for pb * fix pb display at end of iterable * fix pb: reserve 1 char for cursor and remove external dependency * fix pb: render end * fix minicpm layer_modules error Signed-off-by: ZX-ModelCloud <[email protected]> * fix sharded models were deleted * fix wrong order of config save causing sharded tensors to be removed (#1297) * fix wrong order of config save causing zero tensors * add processor to config block * check for ProcessorMixin before calling save * sync with main..fix save * clean logs * [CI] install color log * fix hf is doing config validation on save which cause model save failure * [FIX] not pack when group_size=-1 (#1298) * Fix skipping pack() when group_size = -1 * assert len(qModules) > 0 * Update __init__.py * Update __init__.py --------- Co-authored-by: Qubitium-ModelCloud <[email protected]> * disable eora kernel until validated * [CI] clean evalplus cache * [CI] fix colorlog for xpu * fix merge error * ruff --------- Signed-off-by: ZX-ModelCloud <[email protected]> Co-authored-by: CSY <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> * remove unused eora kernel Signed-off-by: Qubitium <[email protected]> * remove unused eora kernel Signed-off-by: Qubitium <[email protected]> * apply bias after eora adapter Signed-off-by: Qubitium <[email protected]> * add new bits test * revert bad commit. cannot use logic true/false on self.bias directly since boolean tensor (multi-value) is not supported (conflicting) Signed-off-by: Qubitium <[email protected]> * revert bad commit. cannot use logic true/false on self.bias directly since boolean tensor (multi-value) is not supported (conflicting) Signed-off-by: Qubitium <[email protected]> * not do pad * fix var name not exists * missed pad code removal Signed-off-by: Qubitium <[email protected]> * removing padding code like torch kernel for triton Signed-off-by: Qubitium <[email protected]> * fix var rename Signed-off-by: Qubitium <[email protected]> * start deprecation of DynamicCuda kernel. Do not allow it to be auto-selected. Signed-off-by: Qubitium <[email protected]> * do not log too verbose json result on cli Signed-off-by: Qubitium <[email protected]> * Fix `do_sample` config errors on load (also fixed config save) Fix `generation_config.json` is not loaded post-quantization Signed-off-by: Qubitium <[email protected]> * log only class simple name Signed-off-by: Qubitium <[email protected]> * fix old transformer compat Signed-off-by: Qubitium <[email protected]> * fix vllm doesn't have can_generate * refract: hf auto config fix Signed-off-by: Qubitium <[email protected]> * log txt changes Signed-off-by: Qubitium <[email protected]> * disable auto-padding in exllama kernels Signed-off-by: Qubitium <[email protected]> * falcon is merged into HF, does not need trust_remote=True Signed-off-by: Qubitium <[email protected]> * fix deepseek2-lite ci test, add `layer_modules_strict: bool` control to model defs Signed-off-by: Qubitium <[email protected]> * fix deepseek v2-lite again: do not process already processed module Signed-off-by: Qubitium <[email protected]> * merge deepseek v2 possible layer_modules into single def Signed-off-by: Qubitium <[email protected]> * revert partil looper change now that deepseek v2 layer_modules are merged Signed-off-by: Qubitium <[email protected]> * set default data size to 256 * fix self.in_features was not set * [CI] use latest CI docker image * [CI] install colorlog * Correctly use torch.no_grad() to avoid OOM when quantize VL Model * fix vllm doesn't have named_children() * [CI] pass exclusive for gpu service * revert module check for vllm * if model is not a nn.Module, skip finding * fix checking * fix env must be before torch imports Signed-off-by: Qubitium <[email protected]> * move PYTORCH_ENABLE_MPS_FALLBACK to top * ovis model require transformers<=4.48.3 * print expected value * [CI] fix names * [CI] fix xpu env reinstalled torch * torch kernel will enable compile optimizations by default for torch 2.6.0 Signed-off-by: Qubitium <[email protected]> * fix transformers compat Signed-off-by: Qubitium <[email protected]> * disable exllama kernel from quantization (remove from packable) Signed-off-by: Qubitium <[email protected]> * fix evalplus try toString a Decoder * replace subprocess run by raising an error * fix ci test_dynamic scores Signed-off-by: Qubitium <[email protected]> * cleanup eora test Signed-off-by: Qubitium <[email protected]> * fix sglang' transformers error * OVIS is compatible with transformers v4.49.0 * move ipex to new test files * Update ovis.py * decrease batch to 16 * format Signed-off-by: Qubitium <[email protected]> * logs Signed-off-by: Qubitium <[email protected]> * fix ci lora config test Signed-off-by: Qubitium <[email protected]> * fix ci: dynamic Signed-off-by: Qubitium <[email protected]> * fix ci: opt expects exllama when triton is used for quant Signed-off-by: Qubitium <[email protected]> * fix ci: transformers test oom Signed-off-by: Qubitium <[email protected]> * Add some comments to eora.py * add comments to eora.py --------- Signed-off-by: ZX-ModelCloud <[email protected]> Signed-off-by: Qubitium <[email protected]> Co-authored-by: CSY <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: LIU, Shih-Yang <[email protected]>
1 parent 728d593 commit 7939d1a

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

88 files changed

+4505
-977
lines changed

.github/workflows/unit_tests.yml

Lines changed: 61 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,7 @@ env:
6161
PYTORCH_CUDA_ALLOC_CONF: 'expandable_segments:True'
6262
MAX_JOBS: 8
6363
RUNNER: 10.0.13.31
64-
TRANSFORMERS_DIFF_TESTS: "models/test_internlm.py,models/test_internlm2_5.py,models/test_xverse.py"
65-
TORCH_2_5_TESTS: "test_evalplus.py,test_perplexity.py,test_q4_ipex.py,test_ipex_xpu.py,test_save_loaded_quantized_model.py,test_quant_formats.py,models/test_hymba.py"
64+
LEGACY_TESTS: "models/test_internlm.py,models/test_internlm2_5.py,models/test_xverse.py"
6665
IGNORED_TEST_FILES: "test_tgi.py,test_gptneox.py,models/test_mixtral.py,models/test_phi_3_moe.py"
6766
GPTQMODEL_FORCE_BUILD: 1
6867
repo: ${{ github.event.inputs.repo || github.repository }}
@@ -139,15 +138,15 @@ jobs:
139138
import os
140139
import re
141140
142-
TRANSFORMERS_DIFF_TESTS = '${TRANSFORMERS_DIFF_TESTS}'
141+
LEGACY_TESTS = '${LEGACY_TESTS}'
143142
IGNORED_TEST_FILES = '${IGNORED_TEST_FILES}'
144143
145144
TEST_NAMES='${{ github.event.inputs.test_names }}'
146145
TEST_REGEX='${{ github.event.inputs.test_regex }}'
147146
148147
input_test_files_list = [f.strip().removesuffix('.py') for f in TEST_NAMES.split(',') if f.strip()]
149148
150-
transformers_test_files = [f.strip().removesuffix('.py') for f in f'{TRANSFORMERS_DIFF_TESTS}'.split(',') if f.strip()]
149+
transformers_test_files = [f.strip().removesuffix('.py') for f in f'{LEGACY_TESTS}'.split(',') if f.strip()]
151150
transformers_test_files = [f for f in transformers_test_files if not input_test_files_list or f in input_test_files_list]
152151
153152
all_tests = [f.removesuffix('.py') for f in os.listdir('tests/') if f.startswith('test_') and f.endswith('.py') and f.strip().removesuffix('py') not in f'{IGNORED_TEST_FILES}']
@@ -190,8 +189,8 @@ jobs:
190189
191190
echo "Conditions:"
192191
echo "will build run: ${{ github.event.inputs.m4-only != 'true' && needs.list-test-files.outputs.torch-files != '[]' && needs.list-test-files.outputs.transformers-files != '[]' && !(needs.list-test-files.outputs.m4-files == '[]' && needs.list-test-files.outputs.m4-files == '[]') }}"
193-
echo "will transformers_diff run: ${{ (needs.build.result == 'success' || github.event.inputs.artifact_id != '') && github.event.inputs.m4-only != 'true' && needs.list-test-files.outputs.transformers-files != '[]' }}"
194-
echo "will torch2_5 run: ${{ (needs.build.result == 'success' || github.event.inputs.artifact_id != '') && github.event.inputs.m4-only != 'true' && needs.list-test-files.outputs.torch-files != '[]' }}"
192+
echo "will legacy run: ${{ (needs.build.result == 'success' || github.event.inputs.artifact_id != '') && github.event.inputs.m4-only != 'true' && needs.list-test-files.outputs.transformers-files != '[]' }}"
193+
echo "will torch run: ${{ (needs.build.result == 'success' || github.event.inputs.artifact_id != '') && github.event.inputs.m4-only != 'true' && needs.list-test-files.outputs.torch-files != '[]' }}"
195194
echo "will m4 run: ${{ (github.event.inputs.test_names == '' || contains(github.event.inputs.test_names, 'apple') || contains(github.event.inputs.test_names, 'mlx') ) && (needs.list-test-files.outputs.m4-files != '' || needs.list-test-files.outputs.m4-files != '[]') }}"
196195
197196
build:
@@ -201,7 +200,13 @@ jobs:
201200
- list-test-files
202201
if: github.event.inputs.m4-only != 'true' && (needs.list-test-files.outputs.torch-files != '[]' || needs.list-test-files.outputs.transformers-files != '[]')
203202
container:
204-
image: ${{ needs.check-vm.outputs.ip }}:5000/modelcloud/gptqmodel:github-ci-v5
203+
image: ${{ needs.check-vm.outputs.ip }}:5000/modelcloud/gptqmodel:github-ci-v7
204+
options: --device /dev/dri --ipc=host --runtime=nvidia --gpus all
205+
volumes:
206+
- /dev/dri/by-path:/dev/dri/by-path
207+
- /home/ci/models:/monster/data/model
208+
- /home/ci/models/huggingface:/github/home/.cache/huggingface
209+
205210
steps:
206211
- name: Checkout Codes
207212
uses: actions/checkout@v4
@@ -286,15 +291,15 @@ jobs:
286291
if: always()
287292
run: pip cache purge && uv cache clean && rm -rf ./* ./.*
288293

289-
transformers_diff:
294+
legacy:
290295
needs:
291296
- build
292297
- list-test-files
293298
- check-vm
294299
runs-on: [ self-hosted, xeon5 ]
295300
if: always() && !cancelled() && (needs.build.result == 'success' || github.event.inputs.artifact_id != '') && github.event.inputs.m4-only != 'true' && needs.list-test-files.outputs.transformers-files != '[]'
296301
container:
297-
image: ${{ needs.check-vm.outputs.ip }}:5000/modelcloud/gptqmodel:github-ci-v5
302+
image: ${{ needs.check-vm.outputs.ip }}:5000/modelcloud/gptqmodel:github-ci-v7
298303
volumes:
299304
- /home/ci/models:/monster/data/model
300305
- /home/ci/models/huggingface:/github/home/.cache/huggingface
@@ -383,7 +388,7 @@ jobs:
383388

384389
- name: Install wheel
385390
run: |
386-
uv pip install git+https://github.com/ModelCloud/Tokenicer -U
391+
uv pip install colorlog git+https://github.com/ModelCloud/Tokenicer -U
387392
echo "===== install optimum bitblas parameterized uvicorn ====="
388393
uv pip install optimum bitblas==0.0.1.dev13 parameterized uvicorn -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }} --extra-index-url https://pypi.org/simple
389394
echo "===== install dist/whl ====="
@@ -407,10 +412,10 @@ jobs:
407412
gpu_id=-1
408413
409414
while [ "$gpu_id" -lt 0 ]; do
410-
gpu_id=$(curl -s "http://${{ needs.check-vm.outputs.ip }}/gpu/get?id=${{ github.run_id }}&timestamp=$timestamp&test=${{ matrix.test_script }}&runner=${RUNNER_NAME}")
415+
gpu_id=$(curl -s "http://${{ needs.check-vm.outputs.ip }}/gpu/get?id=${{ github.run_id }}&timestamp=$timestamp&test=${{ matrix.test_script }}&runner=${RUNNER_NAME}&exclusive=${{ github.event.inputs.exclusive-gpu }}")
411416
412417
if [ "$gpu_id" -lt 0 ]; then
413-
echo "http://${{ needs.check-vm.outputs.ip }}/gpu/get?id=${{ github.run_id }}&timestamp=$timestamp&test=${{ matrix.test_script }}&runner=${RUNNER_NAME} returned $gpu_id"
418+
echo "http://${{ needs.check-vm.outputs.ip }}/gpu/get?id=${{ github.run_id }}&timestamp=$timestamp&test=${{ matrix.test_script }}&runner=${RUNNER_NAME}&exclusive=${{ github.event.inputs.exclusive-gpu }} returned $gpu_id"
414419
echo "No available GPU, waiting 5 seconds..."
415420
sleep 5
416421
else
@@ -441,15 +446,15 @@ jobs:
441446
if: always()
442447
run: pip cache purge && uv cache clean && rm -rf ./* ./.*
443448

444-
torch2_5:
449+
torch:
445450
needs:
446451
- build
447452
- list-test-files
448453
- check-vm
449454
runs-on: [ self-hosted, xeon5 ]
450455
if: always() && !cancelled() && (needs.build.result == 'success' || github.event.inputs.artifact_id != '') && github.event.inputs.m4-only != 'true' && needs.list-test-files.outputs.torch-files != '[]'
451456
container:
452-
image: ${{ needs.check-vm.outputs.ip }}:5000/modelcloud/gptqmodel:github-ci-v5
457+
image: ${{ needs.check-vm.outputs.ip }}:5000/modelcloud/gptqmodel:github-ci-v7
453458
options: --device /dev/dri --ipc=host --runtime=nvidia --gpus all
454459
volumes:
455460
- /dev/dri/by-path:/dev/dri/by-path
@@ -541,52 +546,75 @@ jobs:
541546

542547
- name: Install wheel
543548
run: |
544-
if [ "${{ matrix.test_script }}" == "test_quant_formats" ] || [ "${{ matrix.test_script }}" == "test_perplexity" ]; then
545-
echo "===== install auto_round ====="
546-
uv pip install auto_round -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }} --extra-index-url https://pypi.org/simple
549+
uv pip install -U transformers colorlog
550+
if [ "${{ matrix.test_script }}" == "test_quant_formats" ] || [ "${{ matrix.test_script }}" == "test_perplexity" ] || [ "${{ matrix.test_script }}" == "test_q4_bitblas" ]; then
551+
echo "===== install auto_round bitblas==0.0.1.dev13 ====="
552+
uv pip install auto_round bitblas==0.0.1.dev13 -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }} --extra-index-url https://pypi.org/simple
547553
fi
554+
548555
if [ "${{ matrix.test_script }}" == "models/test_cohere2" ] || [ "${{ matrix.test_script }}" == "models/test_gemma" ]; then
549556
echo "===== install transformers from git ====="
550-
uv pip install -U git+https://github.com/huggingface/transformers.git -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }} --extra-index-url https://pypi.org/simple
557+
uv pip install -U transformers -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }} --extra-index-url https://pypi.org/simple
551558
fi
559+
552560
if [[ "${{ matrix.test_script }}" == *xpu* ]]; then
561+
echo "===== switching to xpu env ====="
553562
source /etc/profile.d/pyenv.sh && pyenv activate xpu
563+
uv pip install colorlog
564+
fi
565+
566+
if [[ "${{ matrix.test_script }}" == "test_sglang.py" ]]; then
567+
uv pip install transformers==4.48.3
568+
fi
569+
570+
if [[ "${{ matrix.test_script }}" == *ipex* ]] && [[ "${{ matrix.test_script }}" != *xpu* ]]; then
571+
uv pip uninstall torchvision torch flash_attn # fix ipex can't be used with torch+cu126
572+
uv pip install torchvision torch
573+
uv pip install -U intel_extension_for_pytorch -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }} --extra-index-url https://pypi.org/simple
554574
fi
555575
556576
if [[ "${{ matrix.test_script }}" == *"mlx"* ]]; then
557577
uv pip install mlx_lm --no-build-isolation -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }} --extra-index-url https://pypi.org/simple
558578
fi
579+
559580
if [[ "${{ matrix.test_script }}" == "test_modelscope" ]]; then
581+
echo "===== installing modelscope ====="
560582
uv pip install modelscope --no-build-isolation -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }} --extra-index-url https://pypi.org/simple
561583
fi
562584
563-
echo "===== install dist/whl ====="
564585
uv pip install git+https://github.com/ModelCloud/Tokenicer -U
565-
uv pip install dist/*.whl -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }} --extra-index-url https://pypi.org/simple
586+
587+
# ipex doesn't need to compile kernels. xpu can't install cuda package
588+
if [[ "${{ matrix.test_script }}" != *ipex* && "${{ matrix.test_script }}" != *xpu* ]]; then
589+
echo "===== install dist/whl ====="
590+
uv pip install dist/*.whl -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }} --extra-index-url https://pypi.org/simple
591+
else
592+
echo "===== install with local files for xpu env ====="
593+
export CUDA_VISIBLE_DEVICES=""
594+
unset TORCH_CUDA_ARCH_LIST
595+
uv pip install . --no-build-isolation
596+
fi
566597
567598
if [ "${{ matrix.test_script }}" == "test_transformers" ]; then
568599
echo "===== install optimum from git ====="
569600
uv pip install -U git+https://github.com/huggingface/optimum.git -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }}
570-
echo "===== install transformers from git ====="
571-
uv pip install -U git+https://github.com/huggingface/transformers.git -i http://${{ needs.check-vm.outputs.ip }}/simple/ --trusted-host ${{ needs.check-vm.outputs.ip }}
572-
uv pip install torch==2.5.1 # fix optimum will install torch 2.6.0
573601
fi
574602
575603
if [[ "${{ matrix.test_script }}" == "test_sglang" ]]; then
576604
uv pip install numpy==1.26.3
577605
fi
578606
579607
- name: Find suitable GPU
580-
if: ${{ !contains(matrix.test_script, 'ipex') && !cancelled() }}
608+
if: ${{ !contains(matrix.test_script, 'ipex') && !contains(matrix.test_script, 'xpu') && !cancelled() }}
581609
run: |
582610
timestamp=$(date +%s%3N)
583611
gpu_id=-1
584612
585613
while [ "$gpu_id" -lt 0 ]; do
586-
gpu_id=$(curl -s "http://${{ needs.check-vm.outputs.ip }}/gpu/get?id=${{ github.run_id }}&timestamp=$timestamp&test=${{ matrix.test_script }}&runner=${RUNNER_NAME}")
614+
gpu_id=$(curl -s "http://${{ needs.check-vm.outputs.ip }}/gpu/get?id=${{ github.run_id }}&timestamp=$timestamp&test=${{ matrix.test_script }}&runner=${RUNNER_NAME}&exclusive=${{ github.event.inputs.exclusive-gpu }}")
587615
588616
if [ "$gpu_id" -lt 0 ]; then
589-
echo "http://${{ needs.check-vm.outputs.ip }}/gpu/get?id=${{ github.run_id }}&timestamp=$timestamp&test=${{ matrix.test_script }}&runner=${RUNNER_NAME} returned $gpu_id"
617+
echo "http://${{ needs.check-vm.outputs.ip }}/gpu/get?id=${{ github.run_id }}&timestamp=$timestamp&test=${{ matrix.test_script }}&runner=${RUNNER_NAME}&exclusive=${{ github.event.inputs.exclusive-gpu }} returned $gpu_id"
590618
echo "No available GPU, waiting 5 seconds..."
591619
sleep 5
592620
else
@@ -617,21 +645,23 @@ jobs:
617645
curl "http://${{ needs.check-vm.outputs.ip }}/gpu/log_test_vram?id=${{ github.run_id }}&gpu=${{ env.CUDA_VISIBLE_DEVICES }}&range=$execution_time&unit=second&test=${{ matrix.test_script }}"
618646
619647
- name: Release GPU
620-
if: always() && !contains(matrix.test_script, 'ipex')
648+
if: always() && !contains(matrix.test_script, 'ipex') && !contains(matrix.test_script, 'xpu')
621649
run: curl -X GET "http://${{ needs.check-vm.outputs.ip }}/gpu/release?id=${{ github.run_id }}&gpu=${{ env.CUDA_VISIBLE_DEVICES }}&timestamp=${{ env.STEP_TIMESTAMP }}&test=${{ matrix.test_script }}&runner=${RUNNER_NAME}"
622-
650+
623651
- name: Clean cache
624652
if: always()
625-
run: pip cache purge && uv cache clean && rm -rf ./* ./.*
653+
run: |
654+
rm ~/.cache/evalplus/*pkl || true
655+
pip cache purge && uv cache clean && rm -rf ./* ./.*
626656
627657
show-statistics:
628658
runs-on: [ self-hosted, xeon5 ]
629659
if: github.event.inputs.exclusive-gpu != 'true'
630660
container:
631661
image: modelcloud/gptqmodel:alpine-ci-v1
632662
needs:
633-
- transformers_diff
634-
- torch2_5
663+
- legacy
664+
- torch
635665
steps:
636666
- name: Print statistics
637667
run: curl "http://10.0.14.248/gpu/get_vram_logs?id=${{ github.run_id }}"

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313

1414
## News
1515
* 02/12/2025 [1.9.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.9.0): ⚡ Offload `tokenizer` fixes to [Toke(n)icer](https://github.com/modelcloud/tokenicer) pkg. Optimized `lm_head` quant time and vram usage.
16-
Optimized `DeekSeek v3/R1` model quant vram usage. Fixed `Optimum` compat regresion in `v1.8.1`. 3x speed-up for `Torch` kernel when using Pytorch >= 2.5.0 with `model.compile()`. New `calibration_dataset_concat_size` option to enable calibration data `concat` mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like `wikitext2`.
16+
Optimized `DeekSeek v3/R1` model quant vram usage. Fixed `Optimum` compat regresion in `v1.8.1`. 3x speed-up for `Torch` kernel when using Pytorch >= 2.5.0 with `model.optimize()`. New `calibration_dataset_concat_size` option to enable calibration data `concat` mode to mimic original GPTQ data packing strategy which may improve quant speed and accuracy for datasets like `wikitext2`.
1717
* 02/08/2025 [1.8.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.8.1): ⚡ `DeekSeek v3/R1` model support. New flexible weight `packing`: allow quantized weights to be packed to `[int32, int16, int8]` dtypes.
1818
`Triton` and `Torch` kernels supports full range of new `QuantizeConfig.pack_dtype`.
1919
New `auto_gc: bool` control in `quantize()` which can reduce quantization time for small model with no chance of oom.

examples/benchmark/generation_speed.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -195,8 +195,8 @@ def load_model_tokenizer(
195195
def benchmark_generation_speed(model, tokenizer, examples, generation_config):
196196
generation_time_list = []
197197
num_generated_tokens_list = []
198-
progress_bar = ProgressBar(examples)
199-
for example in progress_bar:
198+
pb = ProgressBar(examples)
199+
for example in pb:
200200
input_ids = example["input_ids"].to(model.device)
201201

202202
start = time.time()
@@ -217,7 +217,7 @@ def benchmark_generation_speed(model, tokenizer, examples, generation_config):
217217
)
218218
num_generated_tokens_list.append(num_generated_tokens)
219219

220-
progress_bar.set_postfix(
220+
pb.set_postfix(
221221
num_tokens=num_generated_tokens_list[-1],
222222
time=generation_time_list[-1],
223223
speed=f"{num_generated_tokens_list[-1] / generation_time_list[-1]:.3f} tokens/s",

format/format.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
cd "$(dirname "$0")" || exit
44

55
# force ruff/isort to be same version as setup.py
6-
pip install -U ruff==0.9.5 isort==6.0.0
6+
pip install -U gptqmodel["quality"]
77

88
ruff check ../gptqmodel/models ../gptqmodel/nn_modules ../gptqmodel/quantization ../gptqmodel/utils ../gptqmodel/__init__.py ../examples ../tests ../setup.py --fix --unsafe-fixes
99
ruff_status=$?

gptqmodel/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,14 @@
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
1616

17+
import os
18+
1719
from .models import GPTQModel, get_best_device
1820
from .quantization import BaseQuantizeConfig, QuantizeConfig
1921
from .utils import BACKEND
2022
from .utils.exllama import exllama_set_max_input_length
2123
from .version import __version__
2224

23-
import os
2425
if os.getenv('GPTQMODEL_USE_MODELSCOPE', 'False').lower() in ['true', '1']:
2526
try:
2627
from modelscope.utils.hf_util.patcher import patch_hub

gptqmodel/adapter/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)