Multi-GPU inference is essential for small VRAM GPU. 13B llama model cannot fit in a single 3090 unless using quantization. llama.cpp yesterday merge multi gpu branch, which help us using small VRAM GPUS to deploy LLM. ggerganov/llama.cpp#1703 Hope llama-cpp-python can support multi GPU inference in the future. Many thanks!!!