Skip to content

Performance roadmap #2632

@charleskawczynski

Description

@charleskawczynski

The evolution of performance in our atmosphere model has presented complex challenges for the team.

Initially, ClimateMachine.jl prioritized GPU performance, achieving good speed, but faced several issues:

  • Extending new kernels was very involved
  • Using these kernels required a fair bit of boiler plate (defining variable names of the state and fluxes that the kernel had to compute).
  • There were assumptions (like spectral elements in all directions) that were difficult to undo.
  • Unit testing was difficult, and required lots of boiler plate.

Enter ClimaCore. ClimaCore used and extended Julia broadcasting, which allowed us to write and easily extend / compose vertical finite difference operators, and horizontal spectral element operators. These individual operators could also be unit tested.

So, how did we end up with performance issues for complex models?

As we built increasingly complex models, starting with simpler ones like the dry baroclinic wave, we found that while simple models performed reasonably well, we overlooked a crucial aspect.

While individual operators, enabled by ClimaCore's power, aided rapid prototyping, they did not compose performantly. Breaking a single broadcast expression into two smaller ones might improve performance locally, but this pattern globally leads to issues.

Many serially executed broadcast expressions for intermediate quantities cause our compute graph to expand and contract, significantly increasing the bandwidth demands of our step! function—a key performance metric. For example, our dry baroclinic wave, with 5 state variables, requires many intermediate quantities. Storing the results of these computations into preallocated global memory, even for diagnostics (which should be computed only when needed, not during the solve) or tendency calculations, significantly increases memory bandwidth usage.

As of May 21, 2025, our compute graph looks like this:

(ρ,ρe_tot,uₕ,ᶠu₃)_state (4 vars, 5 components)
       |
       v
(ρ,ᶜΦ,ᶜu,ᶜp,T,h_tot,ρe_tot,e_tot,uₕ,ᶠu₃,ᶠu,ᶠu³,ᶜK,ᶜts.ρ,ᶜts.e_int)_cache (15 vars, 19 components)
       |
       v
(ρ,ρe_tot,uₕ,ᶠu₃)_tendencies (4 vars, 5 components)

The top row is a list of the tendencies (uₕ and ᶜu have 2 components, and ᶠu has 3 components).

These intermediate variables incur a huge bandwidth penalty (almost 4x, here), and the situation is significantly worse in the case of adding moisture, and worse yet for even more complex models.

The NVIDIA A100 GPU can perform 9.7 terraflops on Float64. We are, for sure, bandwidth limited.

Feature A100 80GB PCIe A100 80GB SXM
FP64 9.7 TFLOPS
FP64 Tensor Core 19.5 TFLOPS
FP32 19.5 TFLOPS
GPU Memory 80GB HBM2e 80GB HBM2e
GPU Memory Bandwidth 1,935 GB/s 2,039 GB/s

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions