Skip to content

Conversation

KevsterAmp
Copy link
Contributor

@KevsterAmp KevsterAmp commented Aug 26, 2024

I added an alternative ndarray with the same length on _get_values_for_csv's output, used on write_csv_rows crude testing

Py_ssize_t i, j = 0, k = len(data_index), N = 100, ncols = len(cols)

Tests

Using the same code as the referenced issue:

import pandas as pd
import pyarrow as pa
import pyarrow.csv as csv
import time

NUM_ROWS = 10000000
NUM_COLS = 20

# Example Multi-Index DataFrame
df = pd.DataFrame(
    {
        f"col_{col_idx}": range(col_idx * NUM_ROWS, (col_idx + 1) * NUM_ROWS)
        for col_idx in range(NUM_COLS)
    }
)
df = df.set_index(["col_0", "col_1"], drop=False)

# Timing Operation A
start_time = time.time()
df.to_csv("file_A.csv", index=False)
end_time = time.time()
print(f"Operation A time: {end_time - start_time} seconds")

# Timing Operation B
start_time = time.time()
df_reset = df.reset_index(drop=True)
df_reset.to_csv("file_B.csv", index=False)
end_time = time.time()
print(f"Operation B time: {end_time - start_time} seconds")

Output before performance improvement

Operation A time: 869.2354643344879 seconds
Operation B time: 42.1906418800354 seconds

Output after performance improvement

Operation A time: 51.408071756362915 seconds
Operation B time: 45.78637385368347 seconds

Operation B is used for time comparison when resetting index, the change improves the performance on Operation A

@KevsterAmp KevsterAmp changed the title add alternative ix when self.nlevel is 0 PERF: Performance Improvement on DataFrame.to_csv() when index=False Aug 26, 2024
ix = (
self.data_index[slicer]._get_values_for_csv(**self._number_format)
if self.nlevels != 0
else np.full(end_i - start_i, None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use np.empty instead?

@mroeschke
Copy link
Member

For you benchmark could you show before and after timings

@mroeschke mroeschke added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Aug 26, 2024
@KevsterAmp
Copy link
Contributor Author

Output before performance improvement

Operation A time: 869.2354643344879 seconds
Operation B time: 42.1906418800354 seconds

Output (after performance improvement)

Operation A time: 51.408071756362915 seconds
Operation B time: 45.78637385368347 seconds

Operation B is used for time comparison when resetting index, the change improves the performance on Operation A

Added output times to the description as well

@mroeschke mroeschke added this to the 3.0 milestone Aug 27, 2024
@mroeschke mroeschke merged commit bd81fef into pandas-dev:main Aug 27, 2024
47 checks passed
@mroeschke
Copy link
Member

Thanks @KevsterAmp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: Significant Performance Difference in DataFrame.to_csv() with and without Index Reset
2 participants