Performance roadmap

The evolution of performance in our atmosphere model has presented complex challenges for the team.

Initially, ClimateMachine.jl prioritized GPU performance, achieving good speed, but faced several issues:

 - Extending new kernels was very involved
 - Using these kernels required a fair bit of boiler plate (defining variable names of the state and fluxes that the kernel had to compute).
 - There were assumptions (like spectral elements in all directions) that were difficult to undo.
 - Unit testing was difficult, and required lots of boiler plate.

Enter ClimaCore. ClimaCore used and extended Julia broadcasting, which allowed us to write and easily extend / compose vertical finite difference operators, and horizontal spectral element operators. These individual operators could also be unit tested.

So, how did we end up with performance issues for complex models?

As we built increasingly complex models, starting with simpler ones like the dry baroclinic wave, we found that while simple models performed reasonably well, we overlooked a crucial aspect.

While individual operators, enabled by ClimaCore's power, aided rapid prototyping, they did not _compose performantly_. Breaking a single broadcast expression into two smaller ones might improve performance locally, but this pattern globally leads to issues.

Many serially executed broadcast expressions for intermediate quantities cause our compute graph to expand and contract, significantly increasing the bandwidth demands of our `step!` function—a key performance metric. For example, our dry baroclinic wave, with 5 state variables, requires many intermediate quantities. Storing the results of these computations into preallocated global memory, even for diagnostics (which should be computed only when needed, not during the solve) or tendency calculations, significantly increases memory bandwidth usage.

As of May 21, 2025, our compute graph looks like this:

```
(ρ,ρe_tot,uₕ,ᶠu₃)_state (4 vars, 5 components)
       |
       v
(ρ,ᶜΦ,ᶜu,ᶜp,T,h_tot,ρe_tot,e_tot,uₕ,ᶠu₃,ᶠu,ᶠu³,ᶜK,ᶜts.ρ,ᶜts.e_int)_cache (15 vars, 19 components)
       |
       v
(ρ,ρe_tot,uₕ,ᶠu₃)_tendencies (4 vars, 5 components)
```

The top row is a list of the tendencies (uₕ and ᶜu have 2 components, and ᶠu has 3 components).

These intermediate variables incur a huge bandwidth penalty (almost 4x, here), and the situation is significantly worse in the case of adding moisture, and worse yet for even more complex models.

The NVIDIA A100 GPU can [perform 9.7 terraflops](https://www.nvidia.com/en-us/data-center/a100/) on Float64. We are, for sure, bandwidth limited.

| Feature                      | A100 80GB PCIe                                   | A100 80GB SXM                                      |
| :--------------------------- | :----------------------------------------------- | :------------------------------------------------- |
| **FP64** | 9.7 TFLOPS                                       |                                                    |
| **FP64 Tensor Core** | 19.5 TFLOPS                                      |                                                    |
| **FP32** | 19.5 TFLOPS                                      |                                                    |
| **GPU Memory** | 80GB HBM2e                                       | 80GB HBM2e                                         |
| **GPU Memory Bandwidth** | 1,935 GB/s                                       | 2,039 GB/s                                         |



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance roadmap #2632

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature	A100 80GB PCIe	A100 80GB SXM
FP64	9.7 TFLOPS
FP64 Tensor Core	19.5 TFLOPS
FP32	19.5 TFLOPS
GPU Memory	80GB HBM2e	80GB HBM2e
GPU Memory Bandwidth	1,935 GB/s	2,039 GB/s

Performance roadmap #2632

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions