-
-
Notifications
You must be signed in to change notification settings - Fork 737
Closed
Description
I'm creating a dummy 80MB single-partition Dask distributed DataFrame, and attempting to convert it to a PyArrow Table.
Doing so causes a notebook to throw GC warnings, and takes consistently over 20 seconds.
Versions:
PyArrow: 0.12.0
Dask: 1.1.1
Repro:
from dask.distributed import Client, wait, LocalCluster
import pyarrow as pa
ip = '0.0.0.0'
cluster = LocalCluster(ip=ip)
client = Client(cluster)
import dask.array as da
import dask.dataframe as dd
n_rows = 5000000
n_keys = 5000000
ddf = dd.concat([
da.random.random(n_rows).to_dask_dataframe(columns='x'),
da.random.randint(0, n_keys, size=n_rows).to_dask_dataframe(columns='id'),
], axis=1).persist()
def get_arrow(df):
return pa.Table.from_pandas(df)
%time arrow_tables = ddf.map_partitions(get_arrow).compute()
Result:
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 24% CPU time recently (threshold: 10%)
distributed.utils_perf - WARNING - full garbage collections took 26% CPU time recently (threshold: 10%)
CPU times: user 20.6 s, sys: 1.17 s, total: 21.7 s
Wall time: 22.5 s
Metadata
Metadata
Assignees
Labels
No labels