Skip to content

[WIP] Port to ROCm/HIP #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 14, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin

<h2 id="Updates">🔥 Updates</h2>

* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context) for DeepSeek-V3 and R1 in 24GB VRAM.
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
* **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
Expand Down
1 change: 1 addition & 0 deletions doc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin

<h2 id="Updates">🔥 Updates</h2>

* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context) for DeepSeek-V3 and R1 in 24GB VRAM.
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. The detailed tutorial is [here](./en/DeepseekR1_V3_tutorial.md).
Expand Down
1 change: 1 addition & 0 deletions doc/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
- [Injection Tutorial](en/injection_tutorial.md)
- [Multi-GPU Tutorial](en/multi-gpu-tutorial.md)
- [Use FP8 GPU Kernel](en/fp8_kernel.md)
- [Use AMD GPU](en/ROCm.md)
# Server
- [Server](en/api/server/server.md)
- [Website](en/api/server/website.md)
Expand Down
96 changes: 96 additions & 0 deletions doc/en/ROCm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# ROCm Support for ktransformers (Beta)

## Introduction

### Overview
In our effort to expand GPU architecture support beyond NVIDIA, we are excited to introduce **AMD GPU support through ROCm** in ktransformers (Beta release). This implementation has been tested and developed using EPYC 9274F processors and AMD Radeon 7900xtx GPUs.

## Installation Guide

### 1. Install ROCm Driver
Begin by installing the ROCm drivers for your AMD GPU:
- [Official ROCm Installation Guide for Radeon GPUs](https://rocm.docs.amd.com/projects/radeon/en/latest/docs/install/native_linux/install-radeon.html)

### 2. Set Up Conda Environment
We recommend using Miniconda3/Anaconda3 for environment management:

```bash
# Download Miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Create environment
conda create --name ktransformers python=3.11
conda activate ktransformers

# Install required libraries
conda install -c conda-forge libstdcxx-ng

# Verify GLIBCXX version (should include 3.4.32)
strings ~/anaconda3/envs/ktransformers/lib/libstdc++.so.6 | grep GLIBCXX
```

> **Note:** Adjust the Anaconda path if your installation directory differs from `~/anaconda3`

### 3. Install PyTorch for ROCm
Install PyTorch with ROCm 6.2.4 support:

```bash
pip3 install torch torchvision torchaudio \
--index-url https://download.pytorch.org/whl/rocm6.2.4
pip3 install packaging ninja cpufeature numpy
```

> **Tip:** For other ROCm versions, visit [PyTorch Previous Versions](https://pytorch.org/get-started/previous-versions/)

### 4. Build ktransformers

```bash
# Clone repository
git clone https://github.com/kvcache-ai/ktransformers.git
cd ktransformers
git submodule update --init

# Optional: Compile web interface
# See: api/server/website.md

# Install dependencies
bash install.sh
```

## Running DeepSeek-R1 Models

### Configuration for 24GB VRAM GPUs
Use our optimized configuration for constrained VRAM:

```bash
python ktransformers/local_chat.py \
--model_path deepseek-ai/DeepSeek-R1 \
--gguf_path <path_to_gguf_files> \
--optimize_config_path ktransformers/optimize/optimize_rules/rocm/DeepSeek-V3-Chat.yaml \
--cpu_infer <cpu_cores + 1>
```

> **Beta Note:** Current Q8 linear implementation (Marlin alternative) shows suboptimal performance. Expect optimizations in future releases.

### Configuration for 40GB+ VRAM GPUs
For better performance on high-VRAM GPUs:

1. Modify `DeepSeek-V3-Chat.yaml`:
```yaml
# Replace all instances of:
KLinearMarlin → KLinearTorch
```

2. Execute with:
```bash
python ktransformers/local_chat.py \
--model_path deepseek-ai/DeepSeek-R1 \
--gguf_path <path_to_gguf_files> \
--optimize_config_path <modified_yaml_path> \
--cpu_infer <cpu_cores + 1>
```
> **Tip:** If you got 2 * 24GB AMD GPUS, you may also do the same modify and run `ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat-multi-gpu.yaml` instead.

## Known Limitations
- Marlin operations not supported on ROCm platform
- Current Q8 linear implementation shows reduced performance (Beta limitation)
39 changes: 39 additions & 0 deletions ktransformers/ktransformers_ext/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ endif()
option(LLAMA_AVX512_FANCY_SIMD "llama: enable AVX512-VL, AVX512-BW, AVX512-DQ, AVX512-VNNI" OFF)
option(KTRANSFORMERS_USE_CUDA "ktransformers: use CUDA" OFF)
option(KTRANSFORMERS_USE_MUSA "ktransformers: use MUSA" OFF)
option(KTRANSFORMERS_USE_ROCM "ktransformers: use ROCM" OFF)

# Architecture specific
# TODO: probably these flags need to be tweaked on some architectures
Expand Down Expand Up @@ -201,6 +202,31 @@ endif()
# message(STATUS "Can't found CUDA lib")
# endif()

if (NOT EXISTS $ENV{ROCM_PATH})
if (NOT EXISTS /opt/rocm)
set(ROCM_PATH /usr)
else()
set(ROCM_PATH /opt/rocm)
endif()
else()
set(ROCM_PATH $ENV{ROCM_PATH})
endif()

list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH})
list(APPEND CMAKE_PREFIX_PATH "${ROCM_PATH}/lib64/cmake")

if (NOT EXISTS $ENV{MUSA_PATH})
if (NOT EXISTS /opt/musa)
set(MUSA_PATH /usr/local/musa)
else()
set(MUSA_PATH /opt/musa)
endif()
else()
set(MUSA_PATH $ENV{MUSA_PATH})
endif()

list(APPEND CMAKE_MODULE_PATH "${MUSA_PATH}/cmake")

add_compile_options("$<$<COMPILE_LANGUAGE:CXX>:${ARCH_FLAGS}>")
add_compile_options("$<$<COMPILE_LANGUAGE:C>:${ARCH_FLAGS}>")

Expand All @@ -218,6 +244,14 @@ elseif (UNIX)
add_compile_definitions(KTRANSFORMERS_USE_CUDA=1)
endif()

if (KTRANSFORMERS_USE_ROCM)
find_package(HIP REQUIRED)
if(HIP_FOUND)
include_directories("${HIP_INCLUDE_DIRS}")
add_compile_definitions(KTRANSFORMERS_USE_ROCM=1)
endif()
endif()

if (KTRANSFORMERS_USE_MUSA)
if (NOT EXISTS $ENV{MUSA_PATH})
if (NOT EXISTS /opt/musa)
Expand Down Expand Up @@ -258,6 +292,11 @@ elseif(UNIX)
endif()
target_link_libraries(${PROJECT_NAME} PRIVATE "$ENV{CUDA_HOME}/lib64/libcudart.so")
endif()
if (KTRANSFORMERS_USE_ROCM)
add_compile_definitions(USE_HIP=1)
target_link_libraries(${PROJECT_NAME} PRIVATE "${ROCM_PATH}/lib/libamdhip64.so")
message(STATUS "Building for HIP")
endif()
if(KTRANSFORMERS_USE_MUSA)
target_link_libraries(${PROJECT_NAME} PRIVATE MUSA::musart)
endif()
Expand Down
156 changes: 80 additions & 76 deletions ktransformers/ktransformers_ext/cpu_backend/cpuinfer.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,79 +7,83 @@
* @LastEditTime : 2024-08-07 09:47:43
* @Copyright (c) 2024 by KVCache.AI, All Rights Reserved.
**/
#ifndef CPUINFER_CPUINFER_H
#define CPUINFER_CPUINFER_H

#include <atomic>
#include <condition_variable>
#include <functional>
#include <mutex>
#include <queue>
#include <thread>
#include <vector>
#ifdef KTRANSFORMERS_USE_CUDA
#include "vendors/cuda.h"
#elif KTRANSFORMERS_USE_MUSA
#include "vendors/musa.h"
#endif

#include "backend.h"
#include "task_queue.h"

#include "llama.cpp/ggml-impl.h"

class CPUInfer {
public:
CPUInfer(int thread_num) {
backend_ = new Backend(thread_num - 1);
task_queue_ = new TaskQueue();
for (int i = 0; i < (1 << 16); ++i) {
ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(i);
}
}

~CPUInfer() {
delete backend_;
delete task_queue_;
}

template <typename Func, typename Obj, typename... Args>
void enqueue(Func f, Obj* obj, Args... args) {
task_queue_->enqueue([=]() {
std::invoke(f, *obj, args..., backend_);
});
}

void submit(std::pair<intptr_t, intptr_t> params) {
void (*func)(void*) = (void (*)(void*))params.first;
void* args = (void*)params.second;
*((CPUInfer**)args) = this;
func(args);
}

void sync() {
task_queue_->sync();
}

void submit_with_cuda_stream(intptr_t user_cuda_stream, std::pair<intptr_t, intptr_t> params) {
void (*func)(void*) = (void (*)(void*))params.first;
void* args = (void*)params.second;
*((CPUInfer**)args) = this;
cudaLaunchHostFunc((cudaStream_t)user_cuda_stream, (cudaHostFn_t)func, args);
}

static void sync_(void* cpu_infer_ptr) {
CPUInfer* cpuinfer = (CPUInfer*)cpu_infer_ptr;
cpuinfer->sync();
}

void sync_with_cuda_stream(intptr_t user_cuda_stream) {
cudaLaunchHostFunc((cudaStream_t)user_cuda_stream, (cudaHostFn_t)&sync_, (void*)this);
}

public:
Backend* backend_;
TaskQueue* task_queue_;
};

#endif
#ifndef CPUINFER_CPUINFER_H
#define CPUINFER_CPUINFER_H

#include <atomic>
#include <condition_variable>
#include <functional>
#include <mutex>
#include <queue>
#include <thread>
#include <vector>
#ifdef KTRANSFORMERS_USE_CUDA
#include "vendors/cuda.h"
#elif KTRANSFORMERS_USE_MUSA
#include "vendors/musa.h"
#elif KTRANSFORMERS_USE_ROCM
#define __HIP_PLATFORM_AMD__
#include "vendors/hip.h"
#endif

#include "backend.h"
#include "task_queue.h"
#include "../vendors/vendor.h"

#include "llama.cpp/ggml-impl.h"

class CPUInfer {
public:
CPUInfer(int thread_num) {
backend_ = new Backend(thread_num - 1);
task_queue_ = new TaskQueue();
for (int i = 0; i < (1 << 16); ++i) {
ggml_table_f32_f16[i] = GGML_COMPUTE_FP16_TO_FP32(i);
}
}

~CPUInfer() {
delete backend_;
delete task_queue_;
}

template <typename Func, typename Obj, typename... Args>
void enqueue(Func f, Obj* obj, Args... args) {
task_queue_->enqueue([=]() {
std::invoke(f, *obj, args..., backend_);
});
}

void submit(std::pair<intptr_t, intptr_t> params) {
void (*func)(void*) = (void (*)(void*))params.first;
void* args = (void*)params.second;
*((CPUInfer**)args) = this;
func(args);
}

void sync() {
task_queue_->sync();
}

void submit_with_cuda_stream(intptr_t user_cuda_stream, std::pair<intptr_t, intptr_t> params) {
void (*func)(void*) = (void (*)(void*))params.first;
void* args = (void*)params.second;
*((CPUInfer**)args) = this;
cudaLaunchHostFunc((cudaStream_t)user_cuda_stream, (cudaHostFn_t)func, args);
}

static void sync_(void* cpu_infer_ptr) {
CPUInfer* cpuinfer = (CPUInfer*)cpu_infer_ptr;
cpuinfer->sync();
}

void sync_with_cuda_stream(intptr_t user_cuda_stream) {
cudaLaunchHostFunc((cudaStream_t)user_cuda_stream, (cudaHostFn_t)&sync_, (void*)this);
}

public:
Backend* backend_;
TaskQueue* task_queue_;
};

#endif
14 changes: 13 additions & 1 deletion ktransformers/ktransformers_ext/cpu_backend/vendors/cuda.h
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
#pragma once

#include <cuda_runtime.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <cublas_v2.h>
#include <cuda_bf16.h>
#include <cuda_fp16.h>

#if CUDART_VERSION < 11020
#define CU_DEVICE_ATTRIBUTE_VIRTUAL_MEMORY_MANAGEMENT_SUPPORTED CU_DEVICE_ATTRIBUTE_VIRTUAL_ADDRESS_MANAGEMENT_SUPPORTED
#define CUBLAS_TF32_TENSOR_OP_MATH CUBLAS_TENSOR_OP_MATH
#define CUBLAS_COMPUTE_16F CUDA_R_16F
#define CUBLAS_COMPUTE_32F CUDA_R_32F
#define cublasComputeType_t cudaDataType_t
#endif // CUDART_VERSION < 11020
Loading