-
Notifications
You must be signed in to change notification settings - Fork 102
Open
Labels
Description
Running the cuDF benchmark with RAPIDS 22.06 results in the following:
RAPIDS 22.06 cuDF benchmark
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
2022-06-16 08:21:54,375 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2022-06-16 08:21:54,382 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
Merge benchmark
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 100000000
base-chunks | 2
other-chunks | 2
broadcast | default
protocol | tcp
device(s) | 1,2
rmm-pool | True
frac-match | 0.3
data-processed | 5.96 GiB
================================================================================
Wall-clock | Throughput
--------------------------------------------------------------------------------
20.70 s | 294.80 MiB/s
17.62 s | 346.49 MiB/s
39.32 s | 155.22 MiB/s
================================================================================
Throughput | 265.50 MiB +/- 80.79 MiB
Wall-Clock | 25.88 s +/- 9.59 s
================================================================================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------------
(01,02) | 110.55 MiB/s 153.32 MiB/s 187.99 MiB/s (12.85 GiB)
(02,01) | 147.30 MiB/s 173.17 MiB/s 187.13 MiB/s (12.85 GiB)
If we roll back one year, to RAPIDS 21.06 performance was substantially superior:
RAPIDS 21.06 cuDF benchmark
$ python dask_cuda/benchmarks/local_cudf_merge.py -d 1,2 -c 100_000_000
Merge benchmark
-------------------------------
backend | dask
merge type | gpu
rows-per-chunk | 100000000
base-chunks | 2
other-chunks | 2
broadcast | default
protocol | tcp
device(s) | 1,2
rmm-pool | True
frac-match | 0.3
data-processed | 5.96 GiB
===============================
Wall-clock | Throughput
-------------------------------
15.40 s | 396.40 MiB/s
7.35 s | 830.55 MiB/s
8.80 s | 693.83 MiB/s
===============================
(w1,w2) | 25% 50% 75% (total nbytes)
-------------------------------
(01,02) | 325.82 MiB/s 332.85 MiB/s 351.81 MiB/s (12.85 GiB)
(02,01) | 296.46 MiB/s 321.66 MiB/s 333.66 MiB/s (12.85 GiB)
It isn't clear where this comes from, but potential candidates seem like Distributed, cuDF or Dask-CUDA itself.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status