[docs] add notes for stateful model changes (#3252)

williamberman · pcuenca · web-flow · commit 256e6960cbe8 · 2023-04-27T11:05:08.000-07:00
* [docs] add notes for stateful model changes

* Update docs/source/en/optimization/fp16.mdx

Co-authored-by: Pedro Cuenca &lt;pedro@huggingface.co&gt;

* link to accelerate docs for discarding hooks

---------

Co-authored-by: Pedro Cuenca &lt;pedro@huggingface.co&gt;
diff --git a/docs/source/en/optimization/fp16.mdx b/docs/source/en/optimization/fp16.mdx
@@ -202,6 +202,8 @@ image = pipe(prompt).images[0]
 
 **Note**: When using `enable_sequential_cpu_offload()`, it is important to **not** move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal. See [this issue](https://github.com/huggingface/diffusers/issues/1934) for more information.
 
+**Note**: `enable_sequential_cpu_offload()` is a stateful operation that installs hooks on the models.
+
 
 <a name="model_offloading"></a>
 ## Model offloading for fast inference and memory savings
@@ -251,6 +253,11 @@ image = pipe(prompt).images[0]
 This feature requires `accelerate` version 0.17.0 or larger.
 </Tip>
 
+**Note**: `enable_model_cpu_offload()` is a stateful operation that installs hooks on the models and state on the pipeline. In order to properly offload
+models after they are called, it is required that the entire pipeline is run and models are called in the order the pipeline expects them to be. Exercise caution
+if models are re-used outside the context of the pipeline after hooks have been installed. See [accelerate](https://huggingface.co/docs/accelerate/v0.18.0/en/package_reference/big_modeling#accelerate.hooks.remove_hook_from_module)
+for further docs on removing hooks.
+
 ## Using Channels Last memory format
 
 Channels last memory format is an alternative way of ordering NCHW tensors in memory preserving dimensions ordering. Channels last tensors ordered in such a way that channels become the densest dimension (aka storing images pixel-per-pixel). Since not all operators currently support channels last format it may result in a worst performance, so it's better to try it and see if it works for your model.