Skip to content

exceeds max_bin_len from Distributed >= 2023.1 with LightGBM #8257

@tmvfb

Description

@tmvfb

Describe the issue:

Hi! I seem to be facing this (more_1, more_2) issue in a newer version of dask within a pretty specific use case.

I have a large dataset (~30 mil rows, 36 features), and I'm trying to use DaskLGBM for predictions. The error description is below. This happens with all versions of Dask from 2023.1 (I'm unable to check older ones due to dependency conflicts). Not sure if the problem is connected with dask or with LightGBM interface.

distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/protocol/core.py", line 160, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
ValueError: 2738752243 exceeds max_bin_len(2147483647)
2023-10-10 20:30:38,669 - distributed.core - ERROR - Exception while handling op register-client
Traceback (most recent call last):
  File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/core.py", line 921, in _handle_comm
    result = await result
  File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/scheduler.py", line 5518, in add_client
    await self.handle_stream(comm=comm, extra={"client": client})
  File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/core.py", line 974, in handle_stream
    msgs = await comm.read()
  File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/comm/tcp.py", line 253, in read
    msg = await from_frames(
  File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/comm/utils.py", line 100, in from_frames
    res = _from_frames()
  File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/comm/utils.py", line 83, in _from_frames
    return protocol.loads(
  File "/home/tmvfb/.local/lib/python3.10/site-packages/distributed/protocol/core.py", line 160, in loads
    return msgpack.loads(
  File "msgpack/_unpacker.pyx", line 194, in msgpack._cmsgpack.unpackb
ValueError: 2738752243 exceeds max_bin_len(2147483647)

Minimal Complete Verifiable Example:

The bug is easily reproducible on my system.

import pandas as pd
import numpy
import lightgbm as lgb
from dask.dataframe import from_pandas
from dask.distributed import Client, LocalCluster

cluster = LocalCluster()
client = Client(cluster)

data = pd.DataFrame(numpy.zeros((10_000_000, 40), dtype=int))
label = pd.DataFrame(numpy.zeros((5_000_000), dtype=int) + numpy.ones(5_000_000, dtype=int))
x_train = from_pandas(data, npartitions=4)
y_train = from_pandas(label, npartitions=4)

gbm_params = {"objective": "binary"}
clf = lgb.DaskLGBMRegressor(**gbm_params, tree_learner="voting")
clf.fit(x_train, y_train)

Environment:

I'm using jupyter lab 4.0.4.

  • Dask version: 2023.7.1
  • Python version: 3.10.12
  • Operating System: Ubuntu 22.04.2 LTS (WSL2)
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions