scheduler.get_comm_cost a significant portion of runtime in merge benchmarks

I've been profiling distributed workflows in an effort to understand where there are potential performance improvements to be made (this is ongoing with @gjoseph92 amongst others). I'm particularly interested in scale-out scenarios, where the number of workers is large. As well as that scenario, I've also been looking at cases where the number of works is quite small, but dataframes have many partitions: this produces many tasks at a scale where debugging/profiling is a bit more manageable.

The benchmark setup I have builds two dataframes and then merges them on a key column with a specified matching fraction. Each worker gets P partitions with N rows per partition. I use 8 workers. I'm using cudf dataframes (so the merge itself is fast, which means that I notice sequential overheads sooner).

Attached two speedscope plots (and data) of py-spy based profiling of the scheduler in a scenario with eight workers, P=100, and N=500,000. In a shuffle, the total number of tasks peaks at about 150,000 per the dashboard. The second profile is very noisy since I'm using https://github.com/benfred/py-spy/pull/497 to avoid filtering out python builtins (so that we can see in more detail what is happening). Interestingly, at this scale we don't see much of a pause in GC (but I am happy to try out more scenarios that might be relevant to #4987).

In this scenario, a single merge takes around 90s, if I do the minimal thing of letting `Scheduler.get_comm_cost` `return 0` immediately, this drops to around 50s (using pandas it drops from 170s to around 130s). From the detailed profile, we can see that the majority of this time is spent in `set.difference`. I'm sure there's a more reasonable fix that isn't quite such a large hammer.

<img width="1459" alt="py-spy-scheduler-100-chunks-per-worker" src="https://user-images.githubusercontent.com/1126981/185178460-ec4546c2-d48b-4adc-9bb7-33f8e35927d6.png">

<img width="1438" alt="py-spy-scheduler-100-chunks-per-worker-detailed" src="https://user-images.githubusercontent.com/1126981/185178541-128263ef-2123-4305-8382-b8a18f418a8c.png">

[merge-scheduler-100-chunks-per-worker-no-filter.json.gz](https://github.com/dask/distributed/files/9364144/merge-scheduler-100-chunks-per-worker-no-filter.json.gz)
[merge-scheduler-100-chunks-per-worker.json.gz](https://github.com/dask/distributed/files/9364145/merge-scheduler-100-chunks-per-worker.json.gz)

(cc @pentschev, @quasiben, and @rjzamora)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

scheduler.get_comm_cost a significant portion of runtime in merge benchmarks #6899

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

scheduler.get_comm_cost a significant portion of runtime in merge benchmarks #6899

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions