Skip to content

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Jan 26, 2024

Pending validation but I believe this fixes #57039

Initially I misread this function was branching based off a return value that was always 0, but the actual value lies in the side effects

@WillAyd
Copy link
Member Author

WillAyd commented Jan 26, 2024

@akrherz do you have a quick way of checking this?

@akrherz
Copy link

akrherz commented Jan 26, 2024

@WillAyd I am sorry, I have never built pandas before, but will try it now. I certainly appreciate the help here.

@akrherz
Copy link

akrherz commented Jan 26, 2024

I was able to build pandas, whew.
main branch memray screenshot (showing RSS) with the reproducing leak script
Screenshot from 2024-01-26 09-14-05
running @WillAyd/fix-mem-leak branch
Screenshot from 2024-01-26 09-19-34

LGTM!

@WillAyd WillAyd added the IO CSV read_csv, to_csv label Jan 26, 2024
@mroeschke mroeschke added this to the 2.2.1 milestone Jan 26, 2024
@mroeschke
Copy link
Member

@WillAyd could you add a peakmem ASV benchmark for this case?

@WillAyd
Copy link
Member Author

WillAyd commented Jan 26, 2024

I've added, but in this case it unfortunately would not have made a difference because the extension uses the system free call instead of PyMem_Free, so the Python interpreter does not track this.

Maybe we should use PyMem_Free, although the downside to that would be that it doesn't play nicely with LSAN

@akrherz
Copy link

akrherz commented Jan 26, 2024

I don't wish to waste dev's time here, but I am curious why the introduction of np.nan in my code causes this memory leak to trigger? A terse response of "its complicated" is more than sufficient :) An automated test would not have caught this memory leak, perhaps...

@mroeschke mroeschke merged commit 84cd03a into pandas-dev:main Jan 26, 2024
@mroeschke
Copy link
Member

Thanks @WillAyd

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Jan 26, 2024
@WillAyd WillAyd deleted the fix-mem-leak branch January 26, 2024 18:58
@WillAyd
Copy link
Member Author

WillAyd commented Jan 26, 2024

I don't wish to waste dev's time here, but I am curious why the introduction of np.nan in my code causes this memory leak to trigger? A terse response of "its complicated" is more than sufficient :) An automated test would not have caught this memory leak, perhaps...

I think np.nan is a red herring - it is definitely a mistake in the code to get rid of those free blocks and LSAN detects the leak either way

Thanks again for the quick report!

@akrherz
Copy link

akrherz commented Jan 26, 2024

I think np.nan is a red herring

Thank you, but the memory leak does not happen with that line removed. Regardless, I am thrilled this is fixed :)

mroeschke pushed a commit that referenced this pull request Jan 26, 2024
Backport PR #57084: Fix mem leak in read_csv

Co-authored-by: William Ayd <[email protected]>
pmhatre1 pushed a commit to pmhatre1/pandas-pmhatre1 that referenced this pull request May 7, 2024
* Fix memory leak in read_csv

* whatsnew

* peakmem benchmark
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas 2.2 read_csv(engine="c") leaks memory when code uses np.nan
3 participants