-
-
Notifications
You must be signed in to change notification settings - Fork 741
Closed
Description
Describe the issue:
Hi! I seem to be facing this (more_1, more_2) issue in a newer version of dask within a pretty specific use case.
I have a large dataset (~30 mil rows, 36 features), and I'm trying to use DaskLGBM for predictions. The error description is below. This happens with all versions of Dask from 2023.1 (I'm unable to check older ones due to dependency conflicts). Not sure if the problem is connected with dask or with LightGBM interface.
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/protocol/core.py", line 160, in loads
return msgpack.loads(
File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
ValueError: 2738752243 exceeds max_bin_len(2147483647)
2023-10-10 20:30:38,669 - distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/core.py", line 921, in _handle_comm
result = await result
File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/scheduler.py", line 5518, in add_client
await self.handle_stream(comm=comm, extra={"client": client})
File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/core.py", line 974, in handle_stream
msgs = await comm.read()
File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 253, in read
msg = await from_frames(
File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/comm/utils.py", line 100, in from_frames
res = _from_frames()
File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/comm/utils.py", line 83, in _from_frames
return protocol.loads(
File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/protocol/core.py", line 160, in loads
return msgpack.loads(
File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
ValueError: 2738752243 exceeds max_bin_len(2147483647)
Minimal Complete Verifiable Example:
The bug is easily reproducible on my system.
import pandas as pd
import numpy
import lightgbm as lgb
from dask.dataframe import from_pandas
from dask.distributed import Client, LocalCluster
cluster = LocalCluster()
client = Client(cluster)
data = pd.DataFrame(numpy.zeros((10_000_000, 40), dtype=int))
label = pd.DataFrame(numpy.zeros((5_000_000), dtype=int) + numpy.ones(5_000_000, dtype=int))
x_train = from_pandas(data, npartitions=4)
y_train = from_pandas(label, npartitions=4)
gbm_params = {"objective": "binary"}
clf = lgb.DaskLGBMRegressor(**gbm_params, tree_learner="voting")
clf.fit(x_train, y_train)
Environment:
I'm using jupyter lab 4.0.4.
- Dask version: 2023.7.1
- Python version: 3.10.12
- Operating System: Ubuntu 22.04.2 LTS (WSL2)
- Install method (conda, pip, source): pip
Metadata
Metadata
Assignees
Labels
No labels