Skip to content

DataFrame replace slow on DataFrames containing strings when using use_inf_as_null option #18176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spillz opened this issue Nov 8, 2017 · 3 comments
Labels
Benchmark Performance (ASV) benchmarks Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate replace replace method

Comments

@spillz
Copy link

spillz commented Nov 8, 2017

Probably an edge case but something that bit me

import pandas, numpy as np
df = pandas.DataFrame({'a':[0]*5000+[1]*5000, 'b':[2]*5000+[1]*5000 , 'c': ['a']*5000 + ['b']*5000})
%timeit df.replace(1,3, inplace=True)

df = pandas.DataFrame({'a':[0]*5000+[1]*5000, 'b':[2]*5000+[1]*5000 , 'c': ['a']*5000 + ['b']*5000})
def rep(df):
    for c in df.columns:
        df.loc[df[c]==1,c] = 3
    return df
%timeit rep(df)

pandas.set_option('use_inf_as_null', True)
df = pandas.DataFrame({'a':[0]*5000+[1]*5000, 'b':[2]*5000+[1]*5000, 'c': ['a']*5000 + ['b']*5000})
%timeit df.replace(1,3, inplace=True)

df = pandas.DataFrame({'a':[0]*5000+[1]*5000, 'b':[2]*5000+[1]*5000 , 'c': ['a']*5000 + ['b']*5000})
def rep(df):
    for c in df.columns:
        df.loc[df[c]==1,c] = 3
    return df
%timeit rep(df)

One of these things is not like the other!

1000 loops, best of 3: 1.77 ms per loop
100 loops, best of 3: 5.89 ms per loop
1 loop, best of 3: 2.24 s per loop
100 loops, best of 3: 5.92 ms per loop

Pandas 0.20.1

@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

in 0.21.0

this looks ok.

494 us +- 5.83 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)
8.1 ms +- 553 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
523 us +- 33.2 us per loop (mean +- std. dev. of 7 runs, 1000 loops each)
7.54 ms +- 212 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

can you give a try. also would take a PR with asv's for this.

@jreback
Copy link
Contributor

jreback commented Nov 8, 2017

note that the option changed in 0.21.0 (same just null -> na)

pandas.set_option('use_inf_as_na', True)

@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Performance Memory or execution speed performance labels Nov 8, 2017
@jbrockmendel jbrockmendel added the replace replace method label Sep 21, 2020
@mroeschke mroeschke added Benchmark Performance (ASV) benchmarks and removed Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions labels Jun 12, 2021
@mroeschke
Copy link
Member

Given that this option was deprecated in 2.1 #53494, going to close as a wont fix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate replace replace method
Projects
None yet
Development

No branches or pull requests

4 participants