Skip to content

Commit be50082

Browse files
Removing autocast from optimization markdown
1 parent b7b0637 commit be50082

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

docs/source/optimization/fp16.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -296,7 +296,7 @@ with torch.inference_mode():
296296
## Memory Efficient Attention
297297
Recent work on optimizing the bandwitdh in the attention block have generated huge speed ups and gains in GPU memory usage. The most recent being Flash Attention (from @tridao, [code](https://github.com/HazyResearch/flash-attention), [paper](https://arxiv.org/pdf/2205.14135.pdf)) .
298298
Here are the speedups we obtain on a few Nvidia GPUs when running the inference at 512x512 with a batch size of 1 (one prompt):
299-
299+
300300
| GPU | Base Attention FP16 | Memory Efficient Attention FP16 |
301301
|------------------ |--------------------- |--------------------------------- |
302302
| NVIDIA Tesla T4 | 3.5it/s | 5.5it/s |
@@ -323,7 +323,7 @@ pipe = StableDiffusionPipeline.from_pretrained(
323323

324324
pipe.enable_xformers_memory_efficient_attention()
325325

326-
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.float16):
326+
with torch.inference_mode():
327327
sample = pipe("a small cat")
328328

329329
# optional: You can disable it via

0 commit comments

Comments
 (0)