You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Im looking for a way to be able to load multiple models at diffrent times but im looking for way to unload the model from GPU/RAM when Im done using in the main process.
For example, in your main process you call llama-cpp-python you call your model and you interface with the model.
However say you need to switch to another model via the main process, typically you suppose to unload the model from ram and then load the next model but I can seem to figure out a way to beable to gracefully tell llama-cpp to shutdown and unload from memory so I dont run into OOM issues.
Anyone have any ideas on how to do this?
The text was updated successfully, but these errors were encountered:
There's an open bug upstream with llama.cpp about it not cleaning up GPU VRAM. More details in #223.
AFAIK, the Python garbage collector should clean-up a model object that resides in CPU RAM when there's no references to it. Or you can explicitly call del(llama_obj) to destroy it in the case where you immediately need the RAM freed in order to create a new instance in the current scope.
More experienced Python programmers are invited to correct me here 😄
This should all be fixed now, once the Llama class is garbage collected you should be able to easily load a new model without running out of memory. To do this either set llama = None or del llama (assuming your reference to the model is llama).
Hello World,
Im looking for a way to be able to load multiple models at diffrent times but im looking for way to unload the model from GPU/RAM when Im done using in the main process.
For example, in your main process you call
llama-cpp-python
you call your model and you interface with the model.However say you need to switch to another model via the main process, typically you suppose to unload the model from ram and then load the next model but I can seem to figure out a way to beable to gracefully tell
llama-cpp
to shutdown and unload from memory so I dont run into OOM issues.Anyone have any ideas on how to do this?
The text was updated successfully, but these errors were encountered: