For very large models, multiple GPU may be needed for quantization but max_memory arg appears to be broken. Everything should be handled by accelerate and there should be no need for this arg. Investigate.
delete max_memory=max_memory can run.
Originally posted by @Xu-Chen in #48 (comment)