Skip to content

Conversation

@wence-
Copy link
Contributor

@wence- wence- commented Jul 21, 2022

For GPU data, compression is worse rather than better because it
provokes device-to-host transfers when they are unnecessary.

This is a short-term fix for #935, in lieu of hooking up GPU-based
compression algorithms.

For GPU data, compression is worse rather than better because it
provokes device-to-host transfers when they are unnecessary.

This is a short-term fix for rapidsai#935, in lieu of hooking up GPU-based
compression algorithms.
@wence- wence- added 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Jul 21, 2022
@wence- wence- requested a review from a team as a code owner July 21, 2022 18:29
@github-actions github-actions bot added the python python code needed label Jul 21, 2022
@pentschev pentschev added the bug Something isn't working label Jul 21, 2022
@pentschev
Copy link
Member

@charlesbluca @beckernick @VibhuJawa @ayushdg @randerzander just so you're aware of this, in case this shows up somehow in any workflows where you may test TCP performance. I believe this should make things faster for Dask-CUDA workflows and not have any negative impact.

@jakirkham
Copy link
Member

Curious how was it determined compression was happening with GPU data? Asking as this is explicitly disabled in Distributed and has been this way for a while. Would be good to get a better understanding about what is happening here

@pentschev
Copy link
Member

Curious how was it determined compression was happening with GPU data? Asking as this is explicitly disabled in Distributed and has been this way for a while. Would be good to get a better understanding about what is happening here

I can't really explain that, is there any chance we're hitting another condition because TCP will force a D2H/H2D copy? Not sure if you've seen, but this is what this is the comment I left when I found that out: #935 (comment)

@jakirkham
Copy link
Member

jakirkham commented Jul 21, 2022

TCP currently requires host to device copying regardless of whether there is compression or not. So disabling compression wouldn't fix that. If we have some way to send device objects over TCP, let's discuss how we can enable this in Distributed.

@pentschev
Copy link
Member

TCP currently requires host to device copying regardless of whether there is compression or not. So disabling compression wouldn't fix that.

Yes, I know, and this is not just currently, but will always be the case as it needs to go over Ethernet because there's no GPUDirectRDMA in that case. But anyway this is not what I was trying to say, apologies if that was unclear. I'm only imagining this is happening because it hits some other path instead of https://github.com/dask/distributed/blob/3551d1574c9cd72d60197cc84dd75702ebcfec54/distributed/protocol/cuda.py#L28 that you mentioned earlier. The thing is lz4 gets installed by default in Dask now, and that's what caused the behavior change, so I can only imagine that CUDA-specific config is being ignored/overriden.

@wence-
Copy link
Contributor Author

wence- commented Jul 21, 2022

Curious how was it determined compression was happening with GPU data? Asking as this is explicitly disabled in Distributed and has been this way for a while. Would be good to get a better understanding about what is happening here

My commit message might be misleading. I got lost trying to follow where the compression was coming from, so it might not be host/device copies, but rather just that on a fast-ish network compressing big chunks of data is slower than just sending them over the wire.

@jakirkham
Copy link
Member

Gotcha ok that makes more sense. That said, users can still configure this themselves in those cases. Instead of having a different default (based on whether Distributed or Distributed + Dask-CUDA is used), which may be more confusing, why don't we document this and encourage users to explore different settings based on their needs?

@wence-
Copy link
Contributor Author

wence- commented Jul 22, 2022

Curious how was it determined compression was happening with GPU data? Asking as this is explicitly disabled in Distributed and has been this way for a while. Would be good to get a better understanding about what is happening here

I'll have a look, here are some profiled callgraphs of the two options (compression auto vs compression None)

Compression=auto

Compression=None

@wence-
Copy link
Contributor Author

wence- commented Jul 22, 2022

Curious how was it determined compression was happening with GPU data? Asking as this is explicitly disabled in Distributed and has been this way for a while. Would be good to get a better understanding about what is happening here

I'll have a look, here are some profiled callgraphs of the two options (compression auto vs compression None)

tl;dr: using the TCP protocol never calls cuda_dumps/loads so it's not "GPU data" any more, and so those overrides don't kick in.

OK, so that explicit disabling only kicks in when to_frames is called with a serializer argument that includes "cuda". This happens when the UCX comm backend is used which explicitly sets serializers in write and read. In that case, cuda_dumps is called which produces frames that have device buffers in them (which UCX can handle).

In the tcp case, serializers is always None on call, and so tcp.write calls to_frames which calls into dask_dumps (for which there are handlers registered to deal with cudf dataframes and call host_serialize). But now the header of the message doesn't contain compression overrides, with the consequence that the host dataframe buffers are now compressed.

@jakirkham
Copy link
Member

Right though disabling compression won't avoid the DtH/HtD transfers in the TCP case.

Compression is allowed in that case since everything is on host. It just follows Distributed's default.

Certainly users can disable this behavior on their own. We can also add this before our own benchmark scripts as well (if that is important to us).


Would caution against setting this as a default in Dask-CUDA because

  1. It has different behavior than Distributed
  2. Users will not realize adding Dask-CUDA has changed this default somehow
  3. This may not be the right choice for other workflows
  4. It can lead to lengthy debugging by developers building on Dask-CUDA

Here's a recent example of this kind of debugging due to a custom environment variable conda-forge added to scikit-build ( conda-forge/ctng-compiler-activation-feedstock#77 ) ( scikit-build/scikit-build#722 ).

@pentschev
Copy link
Member

Particularly I find frustrating those defaults that are hard to really know beforehand, the compression default itself is a great example, something changed in Dask (pulling lz4 by default) that had to be debugged so I could understand. So here we would be setting yet-another implicit default that may be difficult to debug too (now in two layers), so I agree with your point John.

I'm ok with just setting that as a default for benchmark scripts, for example, if Lawrence is ok with that too.

@pentschev
Copy link
Member

rerun tests

Comment on lines +34 to +37

# Until GPU-based compression is hooked up, turn off compression
# in communication protocols (see https://github.com/rapidsai/dask-cuda/issues/935)
dask.config.config["distributed"]["comm"]["compression"] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to the benchmark script then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with that, but we should document this in a good, visible manner. Lately there has been desire to make defaults more performance-friendly for newcomers, and this is a pretty significant drawback at least for this one workflow. This new behavior could cause users to try out Dask and immediately rule it out, as well as GPUs entirely, due to very bad performance that comes from this.

Perhaps people like @beckernick @VibhuJawa @randerzander @ayushdg could voice opinions on this matter too, especially if they are running other workflows without UCX lately that would potentially show if this change is indeed significant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to change the default, would recommend raising this in Distributed for the reasons already discussed above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am okay with changing the default in Dask-CUDA if it is well documented but we should make sure that we don't overwrite a non-default value!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to change the default, would recommend raising this in Distributed for the reasons already discussed above

I do not oppose to raising this in Distributed if someone is interested in driving the conversation. However, the problem with discussing this in a broader aspect is that the current default may make sense from a CPU standpoint, in which case we should still consider having a non-default value in Dask-CUDA if it makes sense for GPU workflows, which it currently seems to be the case.

I am okay with changing the default in Dask-CUDA if it is well documented but we should make sure that we don't overwrite a non-default value!

I agree, we must document it well and ensure we don't overwrite a user-defined value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me do some profiling. I tried to match the performance loss up with a performance model but couldn't make sense of it. Suppose that a point to point message takes $T_p(b) := \alpha + \beta b$ seconds to send $b$ bytes, and that compression takes $T_c(b) = \gamma + \nu b$ seconds to (de)compress $b$ bytes, with a compression factor of $c$. Then sending a raw message costs $T_p(b)$ and sending a compressed message costs $2T_c(b) + T_p(b/c)$. So it's worthwhile to compress whenever $T_p(b) > 2T_c(b) + T_p(b/c)$. So we have $\alpha + \beta b > 2T_c(b) + \alpha + \beta b/c \Leftrightarrow \beta b/c > 2(\gamma + \nu b)$.

Let's rearrange again to get $b(\beta/c - 2\nu) > 2\gamma \Leftrightarrow b > 2\gamma / (\beta/c - 2\nu)$ (If $\gamma = 0$ then compression makes sense if $\beta/c - 2\nu > 0$). In this latter case, that means that it's worthwhile to compress if the "compression bandwidth $C := 1/\nu$" and network bandwidth $B := 1/\beta$ are related by $C > 2 c B$. e.g. for a compression ratio of 2, if we can compress more than four times as fast as we can send over the network, it's worthwhile compressing.

I don't have numbers for $\alpha$, $\beta$, $\gamma$, and $\nu$ but if they were measured you could put an adaptive compression model in and use that.

@codecov-commenter
Copy link

codecov-commenter commented Jul 25, 2022

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Please upload report for BASE (branch-22.08@435dae8). Learn more about missing BASE report.

Files with missing lines Patch % Lines
dask_cuda/__init__.py 0.00% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@              Coverage Diff               @@
##             branch-22.08    #957   +/-   ##
==============================================
  Coverage                ?   0.00%           
==============================================
  Files                   ?      16           
  Lines                   ?    2107           
  Branches                ?       0           
==============================================
  Hits                    ?       0           
  Misses                  ?    2107           
  Partials                ?       0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link

github-actions bot commented Sep 4, 2022

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@github-actions
Copy link

github-actions bot commented Oct 4, 2022

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team bug Something isn't working inactive-30d non-breaking Non-breaking change python python code needed

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants