kvcache-ai · Azure-Tang · Mar 14, 2025 · Feb 12, 2025 · Feb 12, 2025 · Feb 12, 2025
diff --git a/README.md b/README.md
@@ -23,6 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
 
 <h2 id="Updates">🔥 Updates</h2>
 
+* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
 * **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context) for DeepSeek-V3 and R1 in 24GB VRAM.
 * **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
 * **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed （+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).

diff --git a/doc/README.md b/doc/README.md
@@ -22,6 +22,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
 
 <h2 id="Updates">🔥 Updates</h2>
 
+* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
 * **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context) for DeepSeek-V3 and R1 in 24GB VRAM.
 * **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
 * **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./en/DeepseekR1_V3_tutorial.md).

diff --git a/doc/SUMMARY.md b/doc/SUMMARY.md
@@ -10,6 +10,7 @@
 - [Injection Tutorial](en/injection_tutorial.md)
 - [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
 - [Use FP8 GPU Kernel](en/fp8_kernel.md)
+- [Use AMD GPU](en/ROCm.md)
 # Server
   - [Server](en/api/server/server.md)
   - [Website](en/api/server/website.md)

diff --git a/doc/en/ROCm.md b/doc/en/ROCm.md
@@ -0,0 +1,96 @@
+# ROCm Support for ktransformers (Beta)
+
+## Introduction
+
+### Overview
+In our effort to expand GPU architecture support beyond NVIDIA, we are excited to introduce **AMD GPU support through ROCm** in ktransformers (Beta release). This implementation has been tested and developed using EPYC 9274F processors and AMD Radeon 7900xtx GPUs.
+
+## Installation Guide
+
+### 1. Install ROCm Driver
+Begin by installing the ROCm drivers for your AMD GPU:
+- [Official ROCm Installation Guide for Radeon GPUs](https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/native_linux/install-radeon.html)
+
+### 2. Set Up Conda Environment
+We recommend using Miniconda3/Anaconda3 for environment management:
+
+```bash
+# Download Miniconda
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+
+# Create environment
+conda create --name ktransformers python=3.11
+conda activate ktransformers
+
+# Install required libraries
+conda install -c conda-forge libstdcxx-ng
+
+# Verify GLIBCXX version (should include 3.4.32)
+strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX
+```
+
+> **Note:** Adjust the Anaconda path if your installation directory differs from `~/anaconda3`
+
+### 3. Install PyTorch for ROCm
+Install PyTorch with ROCm 6.2.4 support:
+
+```bash
+pip3 install torch torchvision torchaudio \
+  --index-url https://download.pytorch.org/whl/rocm6.2.4
+pip3 install packaging ninja cpufeature numpy
+```
+
+> **Tip:** For other ROCm versions, visit [PyTorch Previous Versions](https://pytorch.org/get-started/previous-versions/)
+
+### 4. Build ktransformers
+
+```bash
+# Clone repository
+git clone https://github.com/kvcache-ai/ktransformers.git
+cd ktransformers
+git submodule update --init
+
+# Optional: Compile web interface
+# See: api/server/website.md
+
+# Install dependencies
+bash install.sh
+```
+
+## Running DeepSeek-R1 Models
+
+### Configuration for 24GB VRAM GPUs
+Use our optimized configuration for constrained VRAM:
+
+```bash
+python ktransformers/local_chat.py \
+  --model_path deepseek-ai/DeepSeek-R1 \
+  --gguf_path <path_to_gguf_files> \
+  --optimize_config_path ktransformers/optimize/optimize_rules/rocm/DeepSeek-V3-Chat.yaml \
+  --cpu_infer <cpu_cores + 1>
+```
+
+> **Beta Note:** Current Q8 linear implementation (Marlin alternative) shows suboptimal performance. Expect optimizations in future releases.
+
+### Configuration for 40GB+ VRAM GPUs
+For better performance on high-VRAM GPUs:
+
+1. Modify `DeepSeek-V3-Chat.yaml`:
+   ```yaml
+   # Replace all instances of:
+   KLinearMarlin → KLinearTorch
+   ```
+
+2. Execute with:
+   ```bash
+   python ktransformers/local_chat.py \
+     --model_path deepseek-ai/DeepSeek-R1 \
+     --gguf_path <path_to_gguf_files> \
+     --optimize_config_path <modified_yaml_path> \
+     --cpu_infer <cpu_cores + 1>
+   ```
+> **Tip:** If you got 2 * 24GB AMD GPUS, you may also do the same modify and run `ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` instead.
+
+## Known Limitations
+- Marlin operations not supported on ROCm platform
+- Current Q8 linear implementation shows reduced performance (Beta limitation)
diff --git a/ktransformers/ktransformers_ext/CMakeLists.txt b/ktransformers/ktransformers_ext/CMakeLists.txt
@@ -32,6 +32,7 @@ endif()
 option(LLAMA_AVX512_FANCY_SIMD               "llama: enable AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-VNNI"                        OFF)
 option(KTRANSFORMERS_USE_CUDA                "ktransformers: use CUDA"                          OFF)
 option(KTRANSFORMERS_USE_MUSA                "ktransformers: use MUSA"                          OFF)
+option(KTRANSFORMERS_USE_ROCM                "ktransformers: use ROCM"                          OFF)
 
 # Architecture specific
 # TODO: probably these flags need to be tweaked on some architectures
@@ -201,6 +202,31 @@ endif()
 #     message(STATUS "Can't found CUDA lib")
 # endif()
 
+if (NOT EXISTS $ENV{ROCM_PATH})
+    if (NOT EXISTS /opt/rocm)
+        set(ROCM_PATH /usr)
+    else()
+        set(ROCM_PATH /opt/rocm)
+    endif()
+else()
+    set(ROCM_PATH $ENV{ROCM_PATH})
+endif()
+
+list(APPEND CMAKE_PREFIX_PATH  ${ROCM_PATH})
+list(APPEND CMAKE_PREFIX_PATH "${ROCM_PATH}/lib64/cmake")
+
+if (NOT EXISTS $ENV{MUSA_PATH})
+    if (NOT EXISTS /opt/musa)
+        set(MUSA_PATH /usr/local/musa)
+    else()
+        set(MUSA_PATH /opt/musa)
+    endif()
+else()
+    set(MUSA_PATH $ENV{MUSA_PATH})
+endif()
+
+list(APPEND CMAKE_MODULE_PATH "${MUSA_PATH}/cmake")
+
 add_compile_options("$<$<COMPILE_LANGUAGE:CXX>:${ARCH_FLAGS}>")
 add_compile_options("$<$<COMPILE_LANGUAGE:C>:${ARCH_FLAGS}>")
 
@@ -218,6 +244,14 @@ elseif (UNIX)
         add_compile_definitions(KTRANSFORMERS_USE_CUDA=1)
     endif()
 
+    if (KTRANSFORMERS_USE_ROCM)
+        find_package(HIP REQUIRED)
+        if(HIP_FOUND)
+            include_directories("${HIP_INCLUDE_DIRS}")
+            add_compile_definitions(KTRANSFORMERS_USE_ROCM=1)
+        endif()
+    endif()
+
     if (KTRANSFORMERS_USE_MUSA)
         if (NOT EXISTS $ENV{MUSA_PATH})
             if (NOT EXISTS /opt/musa)
@@ -258,6 +292,11 @@ elseif(UNIX)
         endif()
         target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_HOME}/lib64/libcudart.so")
     endif()
+    if (KTRANSFORMERS_USE_ROCM)
+        add_compile_definitions(USE_HIP=1)
+        target_link_libraries(${PROJECT_NAME} PRIVATE "${ROCM_PATH}/lib/libamdhip64.so")
+        message(STATUS "Building for HIP")
+    endif()
     if(KTRANSFORMERS_USE_MUSA)
         target_link_libraries(${PROJECT_NAME} PRIVATE MUSA::musart)
     endif()

diff --git a/ktransformers/ktransformers_ext/cpu_backend/cpuinfer.h b/ktransformers/ktransformers_ext/cpu_backend/cpuinfer.h
@@ -7,79 +7,83 @@
  * @LastEditTime : 2024-08-07 09:47:43
  * @Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
  **/
-#ifndef CPUINFER_CPUINFER_H
-#define CPUINFER_CPUINFER_H
-
-#include <atomic>
-#include <condition_variable>
-#include <functional>
-#include <mutex>
-#include <queue>
-#include <thread>
-#include <vector>
-#ifdef KTRANSFORMERS_USE_CUDA
-#include "vendors/cuda.h"
-#elif KTRANSFORMERS_USE_MUSA
-#include "vendors/musa.h"
-#endif
-
-#include "backend.h"
-#include "task_queue.h"
-
-#include "llama.cpp/ggml-impl.h"
-
-class CPUInfer {
-   public:
-    CPUInfer(int thread_num) {
-        backend_ = new Backend(thread_num - 1);
-        task_queue_ = new TaskQueue();
-        for (int i = 0; i < (1 << 16); ++i) {
-            ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(i);
-        }
-    }
-
-    ~CPUInfer() {
-        delete backend_;
-        delete task_queue_;
-    }
-
-    template <typename Func, typename Obj, typename... Args>
-    void enqueue(Func f, Obj* obj, Args... args) {
-        task_queue_->enqueue([=]() {
-            std::invoke(f, *obj, args..., backend_);
-        });
-    }
-
-    void submit(std::pair<intptr_t, intptr_t> params) {
-        void (*func)(void*) = (void (*)(void*))params.first;
-        void* args = (void*)params.second;
-        *((CPUInfer**)args) = this;
-        func(args);
-    }
-
-    void sync() {
-        task_queue_->sync();
-    }
-
-    void submit_with_cuda_stream(intptr_t user_cuda_stream, std::pair<intptr_t, intptr_t> params) {
-        void (*func)(void*) = (void (*)(void*))params.first;
-        void* args = (void*)params.second;
-        *((CPUInfer**)args) = this;
-        cudaLaunchHostFunc((cudaStream_t)user_cuda_stream, (cudaHostFn_t)func, args);
-    }
-
-    static void sync_(void* cpu_infer_ptr) {
-        CPUInfer* cpuinfer = (CPUInfer*)cpu_infer_ptr;
-        cpuinfer->sync();
-    }
-
-    void sync_with_cuda_stream(intptr_t user_cuda_stream) {
-        cudaLaunchHostFunc((cudaStream_t)user_cuda_stream, (cudaHostFn_t)&sync_, (void*)this);
-    }
-
-   public:
-    Backend* backend_;
-    TaskQueue* task_queue_;
-};
-
-#endif
+ #ifndef CPUINFER_CPUINFER_H
+ #define CPUINFER_CPUINFER_H
+
+ #include <atomic>
+ #include <condition_variable>
+ #include <functional>
+ #include <mutex>
+ #include <queue>
+ #include <thread>
+ #include <vector>
+ #ifdef KTRANSFORMERS_USE_CUDA
+ #include "vendors/cuda.h"
+ #elif KTRANSFORMERS_USE_MUSA
+ #include "vendors/musa.h"
+ #elif KTRANSFORMERS_USE_ROCM
+ #define __HIP_PLATFORM_AMD__
+ #include "vendors/hip.h"
+ #endif
+
+ #include "backend.h"
+ #include "task_queue.h"
+ #include "../vendors/vendor.h"
+
+ #include "llama.cpp/ggml-impl.h"
+
+ class CPUInfer {
+    public:
+     CPUInfer(int thread_num) {
+         backend_ = new Backend(thread_num - 1);
+         task_queue_ = new TaskQueue();
+         for (int i = 0; i < (1 << 16); ++i) {
+             ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(i);
+         }
+     }
+
+     ~CPUInfer() {
+         delete backend_;
+         delete task_queue_;
+     }
+
+     template <typename Func, typename Obj, typename... Args>
+     void enqueue(Func f, Obj* obj, Args... args) {
+         task_queue_->enqueue([=]() {
+             std::invoke(f, *obj, args..., backend_);
+         });
+     }
+
+     void submit(std::pair<intptr_t, intptr_t> params) {
+         void (*func)(void*) = (void (*)(void*))params.first;
+         void* args = (void*)params.second;
+         *((CPUInfer**)args) = this;
+         func(args);
+     }
+
+     void sync() {
+         task_queue_->sync();
+     }
+
+     void submit_with_cuda_stream(intptr_t user_cuda_stream, std::pair<intptr_t, intptr_t> params) {
+         void (*func)(void*) = (void (*)(void*))params.first;
+         void* args = (void*)params.second;
+         *((CPUInfer**)args) = this;
+         cudaLaunchHostFunc((cudaStream_t)user_cuda_stream, (cudaHostFn_t)func, args);
+     }
+
+     static void sync_(void* cpu_infer_ptr) {
+         CPUInfer* cpuinfer = (CPUInfer*)cpu_infer_ptr;
+         cpuinfer->sync();
+     }
+
+     void sync_with_cuda_stream(intptr_t user_cuda_stream) {
+         cudaLaunchHostFunc((cudaStream_t)user_cuda_stream, (cudaHostFn_t)&sync_, (void*)this);
+     }
+
+    public:
+     Backend* backend_;
+     TaskQueue* task_queue_;
+ };
+
+ #endif
diff --git a/ktransformers/ktransformers_ext/cpu_backend/vendors/cuda.h b/ktransformers/ktransformers_ext/cpu_backend/vendors/cuda.h
@@ -1,3 +1,15 @@
 #pragma once
 
-#include <cuda_runtime.h>
+#include <cuda_runtime.h>
+#include <cuda.h>
+#include <cublas_v2.h>
+#include <cuda_bf16.h>
+#include <cuda_fp16.h>
+
+#if CUDART_VERSION < 11020
+#define CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED CU_DEVICE_ATTRIBUTE_VIRTUAL_ADDRESS_MANAGEMENT_SUPPORTED
+#define CUBLAS_TF32_TENSOR_OP_MATH CUBLAS_TENSOR_OP_MATH
+#define CUBLAS_COMPUTE_16F CUDA_R_16F
+#define CUBLAS_COMPUTE_32F CUDA_R_32F
+#define cublasComputeType_t cudaDataType_t
+#endif // CUDART_VERSION < 11020