Skip to content

Conversation

jbrockmendel
Copy link
Member

When frame.columns is non-unique, frame.iloc[n] goes through an unnecessary path that effectively creates frame.values and looking up [n] on that. That's a lot of casting to access just one row.

Luckily, that case is obsolete, so this rips it right out.

@jreback jreback added the Performance Memory or execution speed performance label Mar 26, 2020
@jreback
Copy link
Contributor

jreback commented Mar 27, 2020

this only hits the non-unique case I think. do we have any benchmarks?

@jbrockmendel
Copy link
Member Author

Just added a benchmark:

In [3]: arr = np.arange(10**7).reshape(-1, 10) 
In [4]: df = pd.DataFrame(arr)
In [5]: dtypes = ['u1', 'u2', 'u4', 'u8', 'i1', 'i2', 'i4', 'i8', 'f8', 'f4']                                                                              
In [6]: for i, d in enumerate(dtypes): 
   ...:         df[i] = df[i].astype(d) 

In [8]: %timeit df.iloc[10000]                                                                                                                             
126 µs ± 1.29 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- both

In [9]: df.columns = ["A", "A"] + list(df.columns[2:])     
                                                                                                
In [11]: %timeit df.iloc[10000]                                                                                                                            
17.5 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)   # <-- master
124 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)   # <-- PR

@jreback jreback added this to the 1.1 milestone Mar 29, 2020
@jreback jreback added the Indexing Related to indexing on series/frames, not to indexes themselves label Mar 29, 2020
@jreback jreback merged commit 99f2ccb into pandas-dev:master Mar 29, 2020
@jreback
Copy link
Contributor

jreback commented Mar 29, 2020

thanks

@jbrockmendel jbrockmendel deleted the perf-interleave branch March 29, 2020 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants