-
Notifications
You must be signed in to change notification settings - Fork 124
[SAVE] Save config files with empty state dict #1293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Qubitium
added a commit
that referenced
this pull request
Feb 19, 2025
* fix type hint * update warning msg * update eora license to apache and attribute nvidia/arxiv * remove early eora test files * ipex doesn't need to pass register_buffers to Torch * refractor ipex * refractor ipex2 * fix typo * make ipex packable & add missing register_buffers * cleanup ipex, add lora + bias check * remove duplicated codes * ignore two folders for pytest * fix test lora. fix wrong tokenizer type * compile adapter * Fix `generation_config.json` not auto-saved (#1292) * Fix `generation_config.json` not auto-saved * Update writer.py * update transformers 4.49.0 * [CI] update ci for requirements installation * [CI] don't update intel_extension_for_pytorch for now * [CI] remove ipex * correct name backend to exllama_eora * use hf save hack to fix config saves * fix param name changed * [SAVE] Save config files with empty state dict (#1293) * Save model and config files with empty state dict * cleanup * cleanup * print lora adapter loaded count vs total number of of quantized modules * print lora adapter loaded count vs total number of of quantized modules * fix wrong model.save * Test GSM8K * patch __repr__ for evalplus * Save processor related config files. For example: preprocessor_config.json, chat_template.json (#1295) * Fix adapter/eora for ipex kernel * Fix eora for ipex/marlin * Clean eora for exllama v1/v2 * fix shape does not match in Backend.Marlin * add comment * type hint use torch.dtype instead of torch.float32 * get _supports_flash_attn_2 from transformers * fix prepare_dataset() error * add color to logs * fix ci: lm_head test * fix pb and logging conflicting on output * refractor logging/pb * move wf_ buffer to post_init * fix logger + pb compat * rename pb.set_description to pb.info * fix progressbar padding so cli ui width is stable * add progressbar test * fix progressbar display at close()/end * todo fixme for pb * fix pb display at end of iterable * fix pb: reserve 1 char for cursor and remove external dependency * fix pb: render end * fix minicpm layer_modules error Signed-off-by: ZX-ModelCloud <[email protected]> * fix sharded models were deleted * fix wrong order of config save causing sharded tensors to be removed (#1297) * fix wrong order of config save causing zero tensors * add processor to config block * check for ProcessorMixin before calling save * sync with main..fix save * clean logs * [CI] install color log * fix hf is doing config validation on save which cause model save failure * [FIX] not pack when group_size=-1 (#1298) * Fix skipping pack() when group_size = -1 * assert len(qModules) > 0 * Update __init__.py * Update __init__.py --------- Co-authored-by: Qubitium-ModelCloud <[email protected]> * disable eora kernel until validated * [CI] clean evalplus cache * [CI] fix colorlog for xpu * fix merge error * ruff --------- Signed-off-by: ZX-ModelCloud <[email protected]> Co-authored-by: CSY <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]>
Qubitium
added a commit
that referenced
this pull request
Feb 21, 2025
* fix override * simplify * fix missing `modules` item * breaking: fix module.state update * fix state should contain both W and WQ * fix no super() for class obj * remove get attr * call LoopProcessor.post_process() Signed-off-by: ZX-ModelCloud <[email protected]> * call processor.finalize * Correctly call methods from self.gptq_model Signed-off-by: ZX-ModelCloud <[email protected]> * rename to calibration_data * cleanup pack()..no need to clone weights..use T instead of t() * LoopProcessor add model_finalize() Signed-off-by: ZX-ModelCloud <[email protected]> * cleanup pack()..rename var for clarity * pop wq from state * clean code..de-indent logic * add safety code to store original in/out features of W in NamedModule state since the weight will be heavily changed during quant * add stats() api and stats fields to processor * ruff * Fix circular import Signed-off-by: ZX-ModelCloud <[email protected]> * add license * add clearml back * fix NamedModule.__getattr__() error Signed-off-by: ZX-ModelCloud <[email protected]> * add `require_fwd` property to processor * simplify * fix canot set weight.data to None * fix the error that tasks is empty Signed-off-by: ZX-ModelCloud <[email protected]> * add todo * fix parameter position & name * fix import * fix named module override * fix __dict__ name error Signed-off-by: ZX-ModelCloud <[email protected]> * fix module type error Signed-off-by: ZX-ModelCloud <[email protected]> * fix layer_inputs index out of range Signed-off-by: ZX-ModelCloud <[email protected]> * rename * add lm_head quantize config Signed-off-by: ZX-ModelCloud <[email protected]> * pop `w` at submodule finalize * simplify...quantize should only be called once * release quantizer for module on post_process * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * refractor * cleanup * fix circular import Signed-off-by: ZX-ModelCloud <[email protected]> * refractor quantize() args and override * Fix GPTQProcessor log Signed-off-by: ZX-ModelCloud <[email protected]> * fix wrong damp_percent returned * return log Signed-off-by: ZX-ModelCloud <[email protected]> * fix hf api compat * use const, not str * rename to `finalize` * fix import * rename quantize() to quantize_old() Signed-off-by: ZX-ModelCloud <[email protected]> * fix import * If calibration_dataset is None or Empty, the input_cache of the previous processor is used Signed-off-by: ZX-ModelCloud <[email protected]> * add fixme for hf api compat of fasterquant * add EoraConfig Signed-off-by: ZX-ModelCloud <[email protected]> * remove .module * add eora processor * fix misc * fix misc * fix isinstance can't check subclass * fix lora config storage * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * change name to class method * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * format * fix adapter.name() should be classmethod * fix eora logging * move all eora test code into eora_test (pending removal) * move eora algorithm to nvidia licensed eora file * remove unused * fix hf api compat for quantize() * use EoraProcessor() Signed-off-by: ZX-ModelCloud <[email protected]> * fix processor.num_batches setting Signed-off-by: ZX-ModelCloud <[email protected]> * async move wq to cpu * fix not a python package * fix exllama was not compiled * add async move for gptq processor * move prepare_dataset() to LoopProcessor Signed-off-by: ZX-ModelCloud <[email protected]> * add release_calibration_dataset() Signed-off-by: ZX-ModelCloud <[email protected]> * update error for lm_head and model with tied_weights=True * consolidate dynamic skipped logic * Fix eigen_scaling_diag_matrix not initialized Signed-off-by: ZX-ModelCloud <[email protected]> * Fix subset repeated quantization Signed-off-by: ZX-ModelCloud <[email protected]> * add processed_subset Signed-off-by: ZX-ModelCloud <[email protected]> * Fix the error that the type of wq obtained is tuple Signed-off-by: ZX-ModelCloud <[email protected]> * fix weight.data should not be moved to cpu for process code * del and overwrite is the same for gc * Fix layer_inputs where the last layer is emtpy Signed-off-by: ZX-ModelCloud <[email protected]> * cleanup * use Lora.name() class method for mapping * fix adapter save and load Signed-off-by: ZX-ModelCloud <[email protected]> * move `quant_result` from gptq_process to base loop_process as `_results` * add `stream: bool` toggle in `move_to` r Tensors type only * format * compat: make sure lora key can found for all HF AutoModel api * save eora and test * fix streaming * fix compat loading for hf names * fix BitBLASQuantLinear's adapter argument error Signed-off-by: ZX-ModelCloud <[email protected]> * fix ugly mess in lm_eval integration, vars mismatch, type mis-match * remove util.eval calls.. always use GPTQModel.eval() * rename eval backend to llm_backend and add real gptqmodel specific backend var * add gen_kwargs * use ellama v2 for lm-eval and use acc_norm only * use ellama v2 for lm-eval and use acc_norm only * fix ci test * comment out special kernels * fix Lora.apply() error when batched generate Signed-off-by: ZX-ModelCloud <[email protected]> * fix compile * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * fix `generate()` not applying correct pad_token_id from tokenizer * protect against null (Optinoal) tokenizer * cleanup compile * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * fix cuda kernel * disable eora kernels except for torch * add `adapter` control/override in `quantize()` * remove quantize_config.eora_dataset property * patch evalplus to allow passing a model directly * change test to pass adapter on GPTQModel.load(). Since `adapter` config is not saved in model config.json and quantize_config.json, we need to always pass `adapter` to enable gptq/lora/eora * Fix module.bias not being able to be assigned Signed-off-by: ZX-ModelCloud <[email protected]> * comment * print Adapter loaded post-init so user knows adapter is correctly loaded from disk * fix evalplus oom * fix ci tests..random seed consolidated into one var * fix ci tests * disable streaming and fix ci test * add base vs eora arc-challenge benchmarks to eora test * fix module.compile overriding nn.module compile. rename to `g_compile` * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * rename `g_compile` to `opimize` * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * refactor eora_generate() Signed-off-by: ZX-ModelCloud <[email protected]> * fix argument error Signed-off-by: ZX-ModelCloud <[email protected]> * add `kernels()` api to use so which kernels have been loaded at end of model load * add DequantizeProcessor * add DequantizeProcessor * refractor add `retrain_w` option to GPTQProcessor * cleanup * comments * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * Fix Assignment Error Signed-off-by: ZX-ModelCloud <[email protected]> * DequantizeProcessor does not perform any operations on dataset Signed-off-by: ZX-ModelCloud <[email protected]> * refractor: upcast w to float32 before delta calculation in case of bfloat16 and float16 mismatch * fix wrong assert (reversed) * cleanup * fix summary log Signed-off-by: ZX-ModelCloud <[email protected]> * call eora_save() Signed-off-by: ZX-ModelCloud <[email protected]> * fix argument name error Signed-off-by: ZX-ModelCloud <[email protected]> * add code for assert eora weight Signed-off-by: ZX-ModelCloud <[email protected]> * cleanup Signed-off-by: ZX-ModelCloud <[email protected]> * add test_eora_post_quant() Signed-off-by: ZX-ModelCloud <[email protected]> * clean up `test_quant_erao` so we have config at top and print config before lm-eval results # Conflicts: # tests/test_quant_and_eora.py * add test_eora_post_quant.py Signed-off-by: ZX-ModelCloud <[email protected]> * default to group_size 128 for test. group_size 64 has strange regression * rename * refractor api to `GPTQModel.adapter.generate` * cleanup * cleanup * avoid converting to scalar via item() as torch.compile doesn't like it * try to speed things for eora gen with compile * increase cache and disable scalar captures * use local model path * revert making adapter a module * use torch_compile helper instead torch.compile * use torch_compile helper instead torch.compile * move dequantize_weight() to PackableQuantLinear Signed-off-by: ZX-ModelCloud <[email protected]> * bump intel_extension_for_pytorch to 2.6.0 & remove pack() for ipex & remove xpu check for fp16 * Revert "move dequantize_weight() to PackableQuantLinear" This reverts commit b5d311d. * merge main's eval() changes * push `wf` and dequantize code into packable. refractor ipex to be based on torch kernel # Conflicts: # gptqmodel/nn_modules/qlinear/ipex.py * eora has been moved to eora-copy branch * fix test didn't pass any model * add register_buffers to init * remove unused args * revert register_buffers changes * revert deleting eora dir * remove eora test code * update eora license to apache and attribute nvidia/arxiv * Eora_main branch merge to Eora (#1301) * fix type hint * update warning msg * update eora license to apache and attribute nvidia/arxiv * remove early eora test files * ipex doesn't need to pass register_buffers to Torch * refractor ipex * refractor ipex2 * fix typo * make ipex packable & add missing register_buffers * cleanup ipex, add lora + bias check * remove duplicated codes * ignore two folders for pytest * fix test lora. fix wrong tokenizer type * compile adapter * Fix `generation_config.json` not auto-saved (#1292) * Fix `generation_config.json` not auto-saved * Update writer.py * update transformers 4.49.0 * [CI] update ci for requirements installation * [CI] don't update intel_extension_for_pytorch for now * [CI] remove ipex * correct name backend to exllama_eora * use hf save hack to fix config saves * fix param name changed * [SAVE] Save config files with empty state dict (#1293) * Save model and config files with empty state dict * cleanup * cleanup * print lora adapter loaded count vs total number of of quantized modules * print lora adapter loaded count vs total number of of quantized modules * fix wrong model.save * Test GSM8K * patch __repr__ for evalplus * Save processor related config files. For example: preprocessor_config.json, chat_template.json (#1295) * Fix adapter/eora for ipex kernel * Fix eora for ipex/marlin * Clean eora for exllama v1/v2 * fix shape does not match in Backend.Marlin * add comment * type hint use torch.dtype instead of torch.float32 * get _supports_flash_attn_2 from transformers * fix prepare_dataset() error * add color to logs * fix ci: lm_head test * fix pb and logging conflicting on output * refractor logging/pb * move wf_ buffer to post_init * fix logger + pb compat * rename pb.set_description to pb.info * fix progressbar padding so cli ui width is stable * add progressbar test * fix progressbar display at close()/end * todo fixme for pb * fix pb display at end of iterable * fix pb: reserve 1 char for cursor and remove external dependency * fix pb: render end * fix minicpm layer_modules error Signed-off-by: ZX-ModelCloud <[email protected]> * fix sharded models were deleted * fix wrong order of config save causing sharded tensors to be removed (#1297) * fix wrong order of config save causing zero tensors * add processor to config block * check for ProcessorMixin before calling save * sync with main..fix save * clean logs * [CI] install color log * fix hf is doing config validation on save which cause model save failure * [FIX] not pack when group_size=-1 (#1298) * Fix skipping pack() when group_size = -1 * assert len(qModules) > 0 * Update __init__.py * Update __init__.py --------- Co-authored-by: Qubitium-ModelCloud <[email protected]> * disable eora kernel until validated * [CI] clean evalplus cache * [CI] fix colorlog for xpu * fix merge error * ruff --------- Signed-off-by: ZX-ModelCloud <[email protected]> Co-authored-by: CSY <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> * remove unused eora kernel Signed-off-by: Qubitium <[email protected]> * remove unused eora kernel Signed-off-by: Qubitium <[email protected]> * apply bias after eora adapter Signed-off-by: Qubitium <[email protected]> * add new bits test * revert bad commit. cannot use logic true/false on self.bias directly since boolean tensor (multi-value) is not supported (conflicting) Signed-off-by: Qubitium <[email protected]> * revert bad commit. cannot use logic true/false on self.bias directly since boolean tensor (multi-value) is not supported (conflicting) Signed-off-by: Qubitium <[email protected]> * not do pad * fix var name not exists * missed pad code removal Signed-off-by: Qubitium <[email protected]> * removing padding code like torch kernel for triton Signed-off-by: Qubitium <[email protected]> * fix var rename Signed-off-by: Qubitium <[email protected]> * start deprecation of DynamicCuda kernel. Do not allow it to be auto-selected. Signed-off-by: Qubitium <[email protected]> * do not log too verbose json result on cli Signed-off-by: Qubitium <[email protected]> * Fix `do_sample` config errors on load (also fixed config save) Fix `generation_config.json` is not loaded post-quantization Signed-off-by: Qubitium <[email protected]> * log only class simple name Signed-off-by: Qubitium <[email protected]> * fix old transformer compat Signed-off-by: Qubitium <[email protected]> * fix vllm doesn't have can_generate * refract: hf auto config fix Signed-off-by: Qubitium <[email protected]> * log txt changes Signed-off-by: Qubitium <[email protected]> * disable auto-padding in exllama kernels Signed-off-by: Qubitium <[email protected]> * falcon is merged into HF, does not need trust_remote=True Signed-off-by: Qubitium <[email protected]> * fix deepseek2-lite ci test, add `layer_modules_strict: bool` control to model defs Signed-off-by: Qubitium <[email protected]> * fix deepseek v2-lite again: do not process already processed module Signed-off-by: Qubitium <[email protected]> * merge deepseek v2 possible layer_modules into single def Signed-off-by: Qubitium <[email protected]> * revert partil looper change now that deepseek v2 layer_modules are merged Signed-off-by: Qubitium <[email protected]> * set default data size to 256 * fix self.in_features was not set * [CI] use latest CI docker image * [CI] install colorlog * Correctly use torch.no_grad() to avoid OOM when quantize VL Model * fix vllm doesn't have named_children() * [CI] pass exclusive for gpu service * revert module check for vllm * if model is not a nn.Module, skip finding * fix checking * fix env must be before torch imports Signed-off-by: Qubitium <[email protected]> * move PYTORCH_ENABLE_MPS_FALLBACK to top * ovis model require transformers<=4.48.3 * print expected value * [CI] fix names * [CI] fix xpu env reinstalled torch * torch kernel will enable compile optimizations by default for torch 2.6.0 Signed-off-by: Qubitium <[email protected]> * fix transformers compat Signed-off-by: Qubitium <[email protected]> * disable exllama kernel from quantization (remove from packable) Signed-off-by: Qubitium <[email protected]> * fix evalplus try toString a Decoder * replace subprocess run by raising an error * fix ci test_dynamic scores Signed-off-by: Qubitium <[email protected]> * cleanup eora test Signed-off-by: Qubitium <[email protected]> * fix sglang' transformers error * OVIS is compatible with transformers v4.49.0 * move ipex to new test files * Update ovis.py * decrease batch to 16 * format Signed-off-by: Qubitium <[email protected]> * logs Signed-off-by: Qubitium <[email protected]> * fix ci lora config test Signed-off-by: Qubitium <[email protected]> * fix ci: dynamic Signed-off-by: Qubitium <[email protected]> * fix ci: opt expects exllama when triton is used for quant Signed-off-by: Qubitium <[email protected]> * fix ci: transformers test oom Signed-off-by: Qubitium <[email protected]> * Add some comments to eora.py * add comments to eora.py --------- Signed-off-by: ZX-ModelCloud <[email protected]> Signed-off-by: Qubitium <[email protected]> Co-authored-by: CSY <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: ZX-ModelCloud <[email protected]> Co-authored-by: LIU, Shih-Yang <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.