Skip to content

Commit a4877f9

Browse files
authored
Add Christina Koutsou to contributors and add support for CUDA presentation and blog post (#238)
1 parent 8914a6f commit a4877f9

6 files changed

+136
-0
lines changed

_data/contributors.yml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -374,6 +374,31 @@
374374
mentors: Vassil Vassilev, Matheus Izvekov
375375
proposal: /assets/docs/Sahil_Patidar_Proposal_2024_ClangRepl.pdf
376376

377+
- name: "Christina Koutsou"
378+
photo: christina_koutsou.jpg
379+
info: "Google Summer of Code 2024 Contributor"
380+
381+
education: "Integrated Master's in Electrical and Computer Engineering, Aristotle University of Thessaloniki, Greece"
382+
github: "https://github.com/kchristin22"
383+
active: 1
384+
linkedin: "https://www.linkedin.com/in/christina-koutsou-69416b28a/"
385+
projects:
386+
- title: "Reverse-mode automatic differentiation of GPU (CUDA) kernels using Clad"
387+
status: Ongoing
388+
description: |
389+
Nowadays, the rise of AI has shed light into the power of GPUs. The notion of General Purpose GPU Programming is
390+
becoming more and more popular and it seems that the scientific community is increasingly favoring it over CPU Programming.
391+
Consequently, implementation of mathematics and operations needed for such projects are getting adjusted to GPU's architecture.
392+
Automatic differentiation is a notable concept in this context, finding applications across diverse domains from ML to Finance to Physics.
393+
Clad is a clang plugin for automatic differentiation that performs source-to-source transformation and produces a function capable of
394+
computing the derivatives of a given function at compile time. This project aims to widen Clad’s use range and audience by enabling
395+
the reverse-mode automatic differentiation of CUDA kernels. The total goal of the project is to support the differentiation of CUDA
396+
kernels that may also include typical CUDA built-in objects (e.g. threadIdx, blockDim etc.), which are employed to prevent race conditions,
397+
using Clad. These produced kernels will compute the derivative of an argument specified by the user as the output based on an input parameter
398+
of their choosing. In addition, the user must be able to call these kernels with a custom grid configuration.
399+
proposal: /assets/docs/Christina_Koutsou_GSoC_2024.pdf
400+
mentors: Vassil Vassilev, Parth Arora, Alexander Penev
401+
377402
- name: "This could be you!"
378403
photo: rock.jpg
379404
info: See <a href="/careers">openings</a> for more info

_data/standing_meetings.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@
2121
link: "[Slides](/assets/presentations/CaaS_Weekly_12_06_2024_Maksym_Andriichuk_Activity.pdf)"
2222
- title: Updates
2323
link: "[Meeting Notes](https://docs.google.com/document/d/1HMUtM8rRPeTYDqsWZVD-iQityjJj9_GKgu0FHHbiBr0)"
24+
- title: "Enabling reverse-mode automatic differentiation of GPU (CUDA) kernels using Clad"
25+
date: 2024-05-15 17:00:00 +0200
26+
speaker: "Christina Koutsou"
27+
link: "[Slides](/assets/presentations/CaaS_Weekly_15_05_2024_Christina_Enbale_reverse_mode_autodiff_for_kernels_first_presentation.pdf)"
2428
- title: "Integrate a large language model with the Xeus-cpp jupyter kernel"
2529
date: 2024-06-05 17:00:00 +0200
2630
speaker: "Tharun Anandh"
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
---
2+
title: "Enabling reverse-mode automatic differentiation of GPU kernels"
3+
layout: post
4+
excerpt: "Nowadays, the rise of General Purpose GPU programming has caused more and more tools used by the scientific community to adjust to GPU's architecture. This project aims to allow `Clad` to ride that tide and broaden its use-range by enabling reverse-mode automatic differentiation of CUDA kernels."
5+
sitemap: false
6+
author: Christina Koutsou
7+
permalink: blogs/gsoc24_christina_koutsou_project_introductory_blog/
8+
date: 2024-05-18
9+
10+
---
11+
### Overview of the project
12+
13+
Nowadays, the rise of General Purpose GPU programming has caused more and more tools used by the scientific community to adjust to GPU's architecture. This project aims to allow `Clad` to ride that tide and broaden its use-range by enabling reverse-mode automatic differentiation of CUDA kernels.
14+
15+
Mentors: Vassil Vassilev, Parth Arora, Alexander Penev
16+
[Proposal](/assets/docs/Christina_Koutsou_GSoC_2024.pdf)
17+
[Slides](/assets/presentations/CaaS_Weekly_15_05_2024_Christina_Enbale_reverse_mode_autodiff_for_kernels_first_presentation.pdf)
18+
19+
---
20+
### Current status and end goal
21+
22+
Currently, `Clad` supports differentiation of both host (functions executed by the CPU) and device functions (functions executed by the GPU). However, since the device function resides in the GPU's memory, there's a need for an overlap between the CPU and the GPU. Hence, a global function is used as an intermediary. The global function, also known as a kernel, is executed by the GPU and launched by the CPU. The following example illustrates that interaction:
23+
24+
```cpp
25+
__device__ double fn(double *a) {
26+
  return *a * *a;
27+
}
28+
29+
__global__ void compute(double* d_a, double* d_result) {
30+
  auto fn_grad = clad::gradient(fn, "a");
31+
  fn_grad.execute(d_x, d_result);
32+
}
33+
34+
int main(void) {
35+
  (...) // memory allocations and initializations
36+
  compute<<<1, 1>>>(d_a, d_result);
37+
  cudaDeviceSynchronize();
38+
  // copy back result to CPU
39+
  cudaMemcpy(result.data(), d_result, N * sizeof(double),
40+
                                      cudaMemcpyDeviceToHost);
41+
  return 0;
42+
}
43+
```
44+
45+
It is evident that if the device function was a global one instead, the program would be much more simple and efficient. Kernels are also more widely used in general.
46+
47+
```cpp
48+
__global__ void fn(double *a) {
49+
  *a *= *a;
50+
}
51+
52+
int main(void) {
53+
  (...) // memory allocations and initializations
54+
55+
  auto fn_grad = clad::gradient(fn, "a");
56+
  fn_grad.execute(d_x, d_result);
57+
  cudaDeviceSynchronize();
58+
  // copy back result to CPU
59+
  cudaMemcpy(result.data(), d_result, N * sizeof(double),
60+
                                      cudaMemcpyDeviceToHost);
61+
  return 0;
62+
}
63+
```
64+
65+
Thus, the end goal of this project would be to compute the gradient of a basic kernel function using Clad.
66+
67+
### Approach
68+
69+
In order to define what a basic kernel looks like and solidify the project's objectives, it is helpful to research some problems that utilize GPU kernels for their solution. Furthermore, since reverse-mode automatic differentiation is particularly beneficial when the number of inputs exceeds the number of outputs in a function, identifying such problems will be especially advantageous. Gathering such examples provide a concrete framework for the project's goals.
70+
However, it is possible to already identify some problems that will arise during the process. These issues can be broken down into two categories:
71+
1) General Support: The ability to produce and execute a kernel’s derivative function without it necessarily being correct. Essentially, this refers to the handling of the kernel calls.
72+
2) Support of kernel’s nature: Derive the kernel with respect to its characteristics. This refers to correct handling of the function's body and computation of its derivative.
73+
###### General Support: Kernel’s derivative compilation
74+
What differentiates CUDA kernels from device or host functions is that they cannot be called without a provided grid configuration to be used. Hence, any call to them should be accompanied by such an expression. When the function to be derived is identified as a kernel through its global attribute, a default configuration expression is to be created by calling `clang::Sema::ActOnCUDAExecConfigExpr()`. This is passed to the call expression of the derived function. The configuration is single-threaded to ensure that a single thread computes the gradient.
75+
###### General Support: Kernel’s derivative execution
76+
Similarly to the kernel's compilation, when the derived kernel is to be executed, it too needs to be provided with a grid configuration. For that purpose, the API can be modified to include an overloaded function of `execute()` and its nested calls, `execute_helper()` and `execute_with_default_args()`. The latter will eventually invoke the derived kernel like so:
77+
```cpp
78+
return f<<<grid_size, block_size, shared_mem_size, stream>>>
79+
(static_cast<Args>(args)..., static_cast<Rest>(nullptr)...);
80+
```
81+
###### General Support: Derivation of a kernel call
82+
To further enhance the support for kernel differentiation, we should ensure that deriving a function that includes a kernel invocation can successfully call the derived kernel. Thus, when the `CUDAKernelCallExpr` is visited during the top level of derivation, we should store its configuration to use for the pullback kernel call. Specifically, a Visit function for kernels calls will be written, that mimics the one used for typical CallExpr nodes. However, it will additionally include storing the specified configuration using `getConfig()` on the kernel node and then passing it to the creation of the pullback kernel call expression through `clang::Sema::ActOnCallExpr()`.
83+
###### Support of kernel’s characteristics: Specify output argument to derive
84+
Kernels are void functions, which means that derivation based on the return statement is not possible. As a result, we need to know the output of the function. This can be accomplished by the expansion of the API to include an overload of the `gradient()` function, where the user can specify the output argument of only void functions. This argument will be passed to the differentiation request of this function and stored to the variable list whose expressions are derived.
85+
###### Support of kernel’s characteristics: Account for write race conditions in computation of the derivative value
86+
To also account for the multithreaded environment of a kernel, the plus-assign operation to the derived function's output would result to incorrect results, as each thread would add the derivative value. Hence, this operation should be instead replaced with a plain assign operation. Furthermore, if array indexing is involved, the derived variable should also be considered as an array to avoid write race conditions in case the derivative array of the output is not initialized equally.
87+
```cpp
88+
__global__ void compute(double *in, double *out, double val)
89+
{
90+
  int index = threadIdx.x;
91+
  out[index] = in[index] + val;
92+
}
93+
94+
(derived function){
95+
...
96+
double _r_d0 = _d_out[index0];
97+
* _d_val += _r_d0; —> * _d_val[index0] = _r_d0;
98+
}
99+
```
100+
###### Support of kernel’s characteristics: Handling of CUDA built-in objects
101+
Another common characteristic of CUDA kernels is the use of built-in objects, e.g. `threadIdx`, as depicted above, to ensure that each thread computes its own share of the final result. These nodes should only be cloned when visited in the global initializations and be treated as integers when differentiated, or in other words their derivatives should be 0. In addition,  support for the shared memory macro must also be included in the project’s scope, by not discarding it when visiting a variable or array declaration.
102+
103+
---
104+
### About me
105+
106+
Hi, I'm Christina, an Electrical and Computer Engineering major in Aristotle University of Thessaloniki, Greece. My participation in Google Summer of Code was fueled by a deep appreciation for open-source projects, stemming from my two-year experience with an open-source student team that particpates in ESA Education's Fly Your Satellite program. Despite my passion for physics, it did not lead me down that career path, but I'm trying to channel my enthusiasm into developing tools for the scientific community. Clad offers a great opportunity to do just that and contribute to research advancements.
107+
648 KB
Binary file not shown.

images/team/christina_koutsou.jpg

63.6 KB
Loading

0 commit comments

Comments
 (0)