Removing autocast from optimization markdown

MatthieuToulemont · MatthieuToulemont · commit be50082470f8 · 2022-11-01T02:23:31.000+01:00
diff --git a/docs/source/optimization/fp16.mdx b/docs/source/optimization/fp16.mdx
@@ -296,7 +296,7 @@ with torch.inference_mode():
 ## Memory Efficient Attention
 Recent work on optimizing the bandwitdh in the attention block have generated huge speed ups and gains in GPU memory usage. The most recent being Flash Attention (from @tridao, [code](https://github.com/HazyResearch/flash-attention), [paper](https://arxiv.org/pdf/2205.14135.pdf)) .
 Here are the speedups we obtain on a few Nvidia GPUs when running the inference at 512x512 with a batch size of 1 (one prompt):
- 
+
 | GPU              	| Base Attention FP16 	| Memory Efficient Attention FP16 	|
 |------------------	|---------------------	|---------------------------------	|
 | NVIDIA Tesla T4  	| 3.5it/s             	| 5.5it/s                         	|
@@ -323,7 +323,7 @@ pipe = StableDiffusionPipeline.from_pretrained(
 
 pipe.enable_xformers_memory_efficient_attention()
 
-with torch.inference_mode(), torch.autocast("cuda", dtype=torch.float16):
+with torch.inference_mode():
     sample = pipe("a small cat")
 
 # optional: You can disable it via