Performance regression in cuDF merge benchmark

Running the cuDF benchmark with RAPIDS 22.06 results in the following:

<details><summary>RAPIDS 22.06 cuDF benchmark</summary>

```
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-16 08:21:54,375 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-16 08:21:54,382 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock     | Throughput
--------------------------------------------------------------------------------
20.70 s        | 294.80 MiB/s
17.62 s        | 346.49 MiB/s
39.32 s        | 155.22 MiB/s
================================================================================
Throughput     | 265.50 MiB +/- 80.79 MiB
Wall-Clock     | 25.88 s +/- 9.59 s
================================================================================
(w1,w2)        | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)        | 110.55 MiB/s 153.32 MiB/s 187.99 MiB/s (12.85 GiB)
(02,01)        | 147.30 MiB/s 173.17 MiB/s 187.13 MiB/s (12.85 GiB)
```

</details>

If we roll back one year, to RAPIDS 21.06 performance was substantially superior:

<details><summary>RAPIDS 21.06 cuDF benchmark</summary>

```
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
Merge benchmark
-------------------------------
backend        | dask
merge type     | gpu
rows-per-chunk | 100000000
base-chunks    | 2
other-chunks   | 2
broadcast      | default
protocol       | tcp
device(s)      | 1,2
rmm-pool       | True
frac-match     | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock     | Throughput
-------------------------------
15.40 s        | 396.40 MiB/s
7.35 s         | 830.55 MiB/s
8.80 s         | 693.83 MiB/s
===============================
(w1,w2)     | 25% 50% 75% (total nbytes)
-------------------------------
(01,02)     | 325.82 MiB/s 332.85 MiB/s 351.81 MiB/s (12.85 GiB)
(02,01)     | 296.46 MiB/s 321.66 MiB/s 333.66 MiB/s (12.85 GiB)
```

</details>

It isn't clear where this comes from, but potential candidates seem like Distributed, cuDF or Dask-CUDA itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance regression in cuDF merge benchmark #935

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance regression in cuDF merge benchmark #935

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions