From 6289de68be01bc57f4744258c6713a37fddd6ca1 Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Mon, 10 Mar 2025 13:02:35 -0700
Subject: [PATCH 01/14] nightly_fixed_aiter_integration_final_20250305 README
 update (perf results only)

---
 docs/dev-docker/README.md | 82 +++++++++++++++++++--------------------
 1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 5b79fb18dff..4ae288d2a29 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -43,14 +43,14 @@ The table below shows performance data where a local inference client is fed req
 
 | Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
 |-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
-| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15105 |
-|       |           |         | 128   | 4096   | 1500        | 1500         | 10505                 |
-|       |           |         | 500   | 2000   | 2000        | 2000         | 12664                 |
-|       |           |         | 2048  | 2048   | 1500        | 1500         | 8239                  |
-| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4065 |
-|       |           |         | 128   | 4096   | 1500        | 1500         | 3171                  |
-|       |           |         | 500   | 2000   | 2000        | 2000         | 2985                  |
-|       |           |         | 2048  | 2048   | 500         | 500          | 1999                  |
+| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15919.0  |
+|       |           |         | 128   | 4096   | 1500        | 1500         | 12053.3               |
+|       |           |         | 500   | 2000   | 2000        | 2000         | 13089.0               |
+|       |           |         | 2048  | 2048   | 1500        | 1500         | 8352.4                |
+| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4219.7 |
+|       |           |         | 128   | 4096   | 1500        | 1500         | 3328.7                |
+|       |           |         | 500   | 2000   | 2000        | 2000         | 3109.3                |
+|       |           |         | 2048  | 2048   | 500         | 500          | 2121.7                |
 
 *TP stands for Tensor Parallelism.*
 
@@ -58,40 +58,40 @@ The table below shows performance data where a local inference client is fed req
 
 The table below shows latency measurement, which typically involves assessing the time from when the system receives an input to when the model produces a result.
 
-| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (ms) |
+| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
 |-------|-----------|----------|------------|--------|---------|-------------------|
-| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 19088.59 |
-| | | | 2 | 128 | 2048 | 19610.46 |
-| | | | 4 | 128 | 2048 | 19911.30 |
-| | | | 8 | 128 | 2048 | 21858.80 |
-| | | | 16 | 128 | 2048 | 23537.59 |
-| | | | 32 | 128 | 2048 | 25342.94 |
-| | | | 64 | 128 | 2048 | 32548.19 |
-| | | | 128 | 128 | 2048 | 45216.37 |
-| | | | 1 | 2048 | 2048 | 19154.43 |
-| | | | 2 | 2048 | 2048 | 19670.60 |
-| | | | 4 | 2048 | 2048 | 19976.32 |
-| | | | 8 | 2048 | 2048 | 22485.63 |
-| | | | 16 | 2048 | 2048 | 25246.27 |
-| | | | 32 | 2048 | 2048 | 28967.08 |
-| | | | 64 | 2048 | 2048 | 39920.41 |
-| | | | 128 | 2048 | 2048 | 59514.25 |
-| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 51739.70 |
-| | | | 2 | 128 | 2048 | 52769.15 |
-| | | | 4 | 128 | 2048 | 54557.07 |
-| | | | 8 | 128 | 2048 | 56901.86 |
-| | | | 16 | 128 | 2048 | 60432.12 |
-| | | | 32 | 128 | 2048 | 67353.01 |
-| | | | 64 | 128 | 2048 | 81085.33 |
-| | | | 128 | 128 | 2048 | 116138.51 |
-| | | | 1 | 2048 | 2048 | 52217.76 |
-| | | | 2 | 2048 | 2048 | 53227.47 |
-| | | | 4 | 2048 | 2048 | 55512.44 |
-| | | | 8 | 2048 | 2048 | 59931.41 |
-| | | | 16 | 2048 | 2048 | 66890.14 |
-| | | | 32 | 2048 | 2048 | 80687.64 |
-| | | | 64 | 2048 | 2048 | 108503.12 |
-| | | | 128 | 2048 | 2048 | 168845.50 |
+| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.654 |
+| | | | 2 | 128 | 2048 | 18.269 |
+| | | | 4 | 128 | 2048 | 18.561 |
+| | | | 8 | 128 | 2048 | 20.180  |
+| | | | 16 | 128 | 2048 | 22.541 |
+| | | | 32 | 128 | 2048 | 25.454 |
+| | | | 64 | 128 | 2048 | 33.666 |
+| | | | 128 | 128 | 2048 | 48.466 |
+| | | | 1 | 2048 | 2048 | 17.771 |
+| | | | 2 | 2048 | 2048 | 18.304 |
+| | | | 4 | 2048 | 2048 | 19.173 |
+| | | | 8 | 2048 | 2048 | 21.326 |
+| | | | 16 | 2048 | 2048 | 24.375 |
+| | | | 32 | 2048 | 2048 | 29.284 |
+| | | | 64 | 2048 | 2048 | 40.200 |
+| | | | 128 | 2048 | 2048 | 62.420 |
+| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 46.632 |
+| | | | 2 | 128 | 2048 | 47.370 |
+| | | | 4 | 128 | 2048 | 49.945 |
+| | | | 8 | 128 | 2048 | 53.010 |
+| | | | 16 | 128 | 2048 | 56.348 |
+| | | | 32 | 128 | 2048 | 65.222 |
+| | | | 64 | 128 | 2048 | 82.688 |
+| | | | 128 | 128 | 2048 | 115.980 |
+| | | | 1 | 2048 | 2048 | 46.918 |
+| | | | 2 | 2048 | 2048 | 48.132 |
+| | | | 4 | 2048 | 2048 | 52.281 |
+| | | | 8 | 2048 | 2048 | 55.874 |
+| | | | 16 | 2048 | 2048 | 61.822 |
+| | | | 32 | 2048 | 2048 | 76.925 |
+| | | | 64 | 2048 | 2048 | 105.400 |
+| | | | 128 | 2048 | 2048 | 162.503 |
 
 *TP stands for Tensor Parallelism.*
 

From bb968f58fc5499c1fcd7093ae98fdaaa9207687a Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Mon, 10 Mar 2025 13:36:36 -0700
Subject: [PATCH 02/14] Update Docker Manifest git hash

---
 docs/dev-docker/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 4ae288d2a29..33ff5246988 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -482,7 +482,7 @@ To reproduce the release docker:
 ```bash
     git clone https://github.com/ROCm/vllm.git
     cd vllm
-    git checkout c24ea633f928d77582bc85aff922d07f3bca9d78
+    git checkout c0dd5adf68dd997d7d2c3f04da785d7ef9415e36
     docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
 ```
 

From 245c94c48781b48bad6c0c0bca4f1dc8e448d582 Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Mon, 10 Mar 2025 13:41:42 -0700
Subject: [PATCH 03/14] Update Docker Manifest and added
 nightly_fixed_aiter_integration_final_20250305

---
 docs/dev-docker/README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 33ff5246988..c2b28215b18 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -20,6 +20,8 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
 
 ## What is New
 
+nightly_fixed_aiter_integration_final_20250305:
+- Performance improvement
 20250207_aiter:
 - More performant AITER
 - Bug fixes
@@ -483,7 +485,7 @@ To reproduce the release docker:
     git clone https://github.com/ROCm/vllm.git
     cd vllm
     git checkout c0dd5adf68dd997d7d2c3f04da785d7ef9415e36
-    docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
+    docker build -f Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
 ```
 
 ### AITER

From 4883d6b752a9819c48cece0d9628e6c07a760876 Mon Sep 17 00:00:00 2001
From: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Date: Mon, 10 Mar 2025 14:17:13 -0700
Subject: [PATCH 04/14] some more updates

---
 docs/dev-docker/README.md | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index c2b28215b18..f6a3012c170 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -11,7 +11,7 @@ This documentation includes information for running the popular Llama 3.1 series
 The pre-built image includes:
 
 - ROCm™ 6.3.1
-- vLLM 0.6.6
+- vLLM 0.7.3
 - PyTorch 2.7dev (nightly)
 
 ## Pull latest Docker Image
@@ -20,18 +20,25 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
 
 ## What is New
 
-nightly_fixed_aiter_integration_final_20250305:
-- Performance improvement
+20250305_aiter:
+- vllm 0.7.3
+- HipblasLT 0.13
+- AITER improvements
+- Support for FP8 skinny GEMM
+
 20250207_aiter:
 - More performant AITER
 - Bug fixes
+
 20250205_aiter:
 - [AITER](https://github.com/ROCm/aiter) support
 - Performance improvement for custom paged attention
 - Reduced memory overhead bug fix
+
 20250124:
 - Fix accuracy issue with 405B FP8 Triton FA
 - Fixed accuracy issue with TP8
+
 20250117:
 - [Experimental DeepSeek-V3 and DeepSeek-R1 support](#running-deepseek-v3-and-deepseek-r1)
 
@@ -359,7 +366,7 @@ docker run -it --rm --ipc=host --network=host --group-add render \
     --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
     --device=/dev/kfd --device=/dev/dri --device=/dev/mem \
     -e VLLM_USE_TRITON_FLASH_ATTN=0 \
-    -e VLLM_FP8_PADDING=0 \
+    -e  VLLM_MLA_DISABLE=1 \
     rocm/vllm-dev:main
 # Online serving
 vllm serve deepseek-ai/DeepSeek-V3 \

From 721f350574d2e5a7532bcc14da07d83cecff391f Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Mon, 10 Mar 2025 17:49:48 -0700
Subject: [PATCH 05/14] Update AITER section with example

---
 docs/dev-docker/README.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index f6a3012c170..bf4f3d19643 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -458,6 +458,11 @@ There is a published release candidate image at `rocm/vllm-dev:nightly_aiter_int
 To enable the feature make sure the following environment is set: `VLLM_USE_AITER=1`.  
 The default value is `0` in vLLM, but is set to `1` in the aiter docker.
 
+```bash
+export VLLM_USE_AITER=1
+python3 /appl/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B --max-model-len 26720 -tp 8 --batch-size 1 --input-len 1024 --output-len 128
+```
+
 ## MMLU_PRO_Biology Accuracy Evaluation
 
 ### FP16

From eaed222784d5c74317add2501d18cda3c581d6d0 Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Mon, 10 Mar 2025 17:52:49 -0700
Subject: [PATCH 06/14] Updated AITER command with larger batch size and model
 name

---
 docs/dev-docker/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index bf4f3d19643..52855e357f3 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -460,7 +460,7 @@ The default value is `0` in vLLM, but is set to `1` in the aiter docker.
 
 ```bash
 export VLLM_USE_AITER=1
-python3 /appl/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B --max-model-len 26720 -tp 8 --batch-size 1 --input-len 1024 --output-len 128
+python3 /appl/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV --max-model-len 26720 -tp 8 --batch-size 128 --input-len 1024 --output-len 128
 ```
 
 ## MMLU_PRO_Biology Accuracy Evaluation

From e6974fed163d6fbe874fd0e27dc53396633633cd Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Mon, 10 Mar 2025 17:54:14 -0700
Subject: [PATCH 07/14] Fixing typo

---
 docs/dev-docker/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 52855e357f3..92d01a3bd1f 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -460,7 +460,7 @@ The default value is `0` in vLLM, but is set to `1` in the aiter docker.
 
 ```bash
 export VLLM_USE_AITER=1
-python3 /appl/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV --max-model-len 26720 -tp 8 --batch-size 128 --input-len 1024 --output-len 128
+python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV --max-model-len 26720 -tp 8 --batch-size 128 --input-len 1024 --output-len 128
 ```
 
 ## MMLU_PRO_Biology Accuracy Evaluation

From 8f69b22bdfa44e6babdd0094a2fc2d5bf160e613 Mon Sep 17 00:00:00 2001
From: Mcirino1 <57415822+Mcirino1@users.noreply.github.com>
Date: Mon, 10 Mar 2025 17:57:45 -0700
Subject: [PATCH 08/14] Removed --max-model-len in AITER command

---
 docs/dev-docker/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 92d01a3bd1f..aab59f78fee 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -460,7 +460,7 @@ The default value is `0` in vLLM, but is set to `1` in the aiter docker.
 
 ```bash
 export VLLM_USE_AITER=1
-python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV --max-model-len 26720 -tp 8 --batch-size 128 --input-len 1024 --output-len 128
+python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 128 --input-len 1024 --output-len 128
 ```
 
 ## MMLU_PRO_Biology Accuracy Evaluation

From a2ceb33477734f89ea35a74ccc4d0daed3403e8a Mon Sep 17 00:00:00 2001
From: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Date: Mon, 10 Mar 2025 18:09:10 -0700
Subject: [PATCH 09/14] Updating AITER instructions

---
 docs/dev-docker/README.md | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index aab59f78fee..af7e3802c0f 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -450,17 +450,17 @@ python /app/vllm/benchmarks/benchmark_latency.py --model amd/Llama-3.1-405B-Inst
 
 You should see some performance improvement about the e2e latency.
 
-### AITER
+### AITER use cases 
 
-To get [AITER](https://github.com/ROCm/aiter) kernels support, follow the [Docker build steps](#Docker-manifest) using the [aiter_intergration_final](https://github.com/ROCm/vllm/tree/aiter_intergration_final) branch  
-There is a published release candidate image at `rocm/vllm-dev:nightly_aiter_intergration_final_20250130`
+`rocm/vllm-dev:main` image has experimental [AITER](https://github.com/ROCm/aiter) support, and can yield siginficant performance increase for some model/input/output/batch size configurations. To enable the feature make sure the following environment is set: `VLLM_USE_AITER=1`, the default value is `0`. When building your own image follow the [Docker build steps](#Docker-manifest) using the [aiter_intergration_final](https://github.com/ROCm/vllm/tree/aiter_intergration_final) branch.
 
-To enable the feature make sure the following environment is set: `VLLM_USE_AITER=1`.  
-The default value is `0` in vLLM, but is set to `1` in the aiter docker.
+Some use cases include:
+- amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
+- amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
 
 ```bash
 export VLLM_USE_AITER=1
-python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 128 --input-len 1024 --output-len 128
+python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 256 --input-len 1024 --output-len 128
 ```
 
 ## MMLU_PRO_Biology Accuracy Evaluation
@@ -500,13 +500,13 @@ To reproduce the release docker:
     docker build -f Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
 ```
 
-### AITER
+### Building AITER Image
 
 Use Aiter release candidate branch instead:
 
 ```bash
     git clone https://github.com/ROCm/vllm.git
     cd vllm
-    git checkout aiter_intergration_final
-    docker build -f Dockerfile.rocm -t <your_tag> --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 .
+    git checkout aiter_integration_final
+    docker build -f Dockerfile.rocm -t <your_tag> --build-arg --build-arg USE_CYTHON=1 .
 ```

From 52a332189ad05e3e0be52e6e56815a2e9e080ac2 Mon Sep 17 00:00:00 2001
From: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Date: Mon, 10 Mar 2025 18:09:45 -0700
Subject: [PATCH 10/14] typo

---
 docs/dev-docker/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index af7e3802c0f..e194c956afe 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -502,7 +502,7 @@ To reproduce the release docker:
 
 ### Building AITER Image
 
-Use Aiter release candidate branch instead:
+Use AITER release candidate branch instead:
 
 ```bash
     git clone https://github.com/ROCm/vllm.git

From 67bc4c94617ab196d54df812a78a1c17355d655a Mon Sep 17 00:00:00 2001
From: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Date: Mon, 10 Mar 2025 18:10:26 -0700
Subject: [PATCH 11/14] Another typo

---
 docs/dev-docker/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index e194c956afe..ae7ef36678f 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -452,7 +452,7 @@ You should see some performance improvement about the e2e latency.
 
 ### AITER use cases 
 
-`rocm/vllm-dev:main` image has experimental [AITER](https://github.com/ROCm/aiter) support, and can yield siginficant performance increase for some model/input/output/batch size configurations. To enable the feature make sure the following environment is set: `VLLM_USE_AITER=1`, the default value is `0`. When building your own image follow the [Docker build steps](#Docker-manifest) using the [aiter_intergration_final](https://github.com/ROCm/vllm/tree/aiter_intergration_final) branch.
+`rocm/vllm-dev:main` image has experimental [AITER](https://github.com/ROCm/aiter) support, and can yield siginficant performance increase for some model/input/output/batch size configurations. To enable the feature make sure the following environment is set: `VLLM_USE_AITER=1`, the default value is `0`. When building your own image follow the [Docker build steps](#Docker-manifest) using the [aiter_integration_final](https://github.com/ROCm/vllm/tree/aiter_integration_final) branch.
 
 Some use cases include:
 - amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV

From 8890306f2ced422e3d4a592491713866734fee7b Mon Sep 17 00:00:00 2001
From: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Date: Mon, 10 Mar 2025 18:21:35 -0700
Subject: [PATCH 12/14] Whitespace

---
 docs/dev-docker/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index ae7ef36678f..151b02b7116 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -450,7 +450,7 @@ python /app/vllm/benchmarks/benchmark_latency.py --model amd/Llama-3.1-405B-Inst
 
 You should see some performance improvement about the e2e latency.
 
-### AITER use cases 
+### AITER use cases
 
 `rocm/vllm-dev:main` image has experimental [AITER](https://github.com/ROCm/aiter) support, and can yield siginficant performance increase for some model/input/output/batch size configurations. To enable the feature make sure the following environment is set: `VLLM_USE_AITER=1`, the default value is `0`. When building your own image follow the [Docker build steps](#Docker-manifest) using the [aiter_integration_final](https://github.com/ROCm/vllm/tree/aiter_integration_final) branch.
 

From b309ea79418c6fc9058a6b2dec312375d3c75753 Mon Sep 17 00:00:00 2001
From: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Date: Mon, 10 Mar 2025 18:22:28 -0700
Subject: [PATCH 13/14] modifying whats new section

---
 docs/dev-docker/README.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 151b02b7116..5e26344bccd 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -11,6 +11,7 @@ This documentation includes information for running the popular Llama 3.1 series
 The pre-built image includes:
 
 - ROCm™ 6.3.1
+- HipblasLT 0.13
 - vLLM 0.7.3
 - PyTorch 2.7dev (nightly)
 
@@ -21,8 +22,6 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
 ## What is New
 
 20250305_aiter:
-- vllm 0.7.3
-- HipblasLT 0.13
 - AITER improvements
 - Support for FP8 skinny GEMM
 

From 0c5c5a978cd50466dad0dcf2be24549b146f856f Mon Sep 17 00:00:00 2001
From: arakowsk-amd <182798202+arakowsk-amd@users.noreply.github.com>
Date: Tue, 11 Mar 2025 09:04:37 -0700
Subject: [PATCH 14/14] Another typo

---
 docs/dev-docker/README.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md
index 5e26344bccd..8b339bf5fd0 100644
--- a/docs/dev-docker/README.md
+++ b/docs/dev-docker/README.md
@@ -507,5 +507,5 @@ Use AITER release candidate branch instead:
     git clone https://github.com/ROCm/vllm.git
     cd vllm
     git checkout aiter_integration_final
-    docker build -f Dockerfile.rocm -t <your_tag> --build-arg --build-arg USE_CYTHON=1 .
+    docker build -f Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
 ```