-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: segfault when using read_csv and repeatedly accessing / renaming DataFrame.columns #46146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
you are modifying an Index which is an immutable object by accessing .values and setting it it is documented that you shouldn't do this |
@jreback , I've confirmed that using the statement I guess the confusion arose from the fact that I was modifying column names, not the index. However apparently it turns out the columns object itself is an index object... Question, though -- why is pandas not using an actual immutable object for storing index data? Right now it is just 'conceptually' immutable it seems (i.e. don't do that, but we won't stop you -- that doesn't fulfill the definition of immutable. It's rather, "should not be mutated"). Also, I do know when you perform certain ops in pandas (such as modifying a value using a slice), warnings will appear from pandas. Such warnings do not appear here. This would be appreciated at a minimum, although making objects deemed immutable actually so would be the best outcome. |
an Index is immutable full stop |
The definition of immutable means you can't modify it, not that you can and bad things will happen -- this is a consistent and well-accepted principle. Since an index is intended to be immutable, it should actually be implemented as such, and pandas should look into using actual immutable structures for index. |
@hans2520 certainly would take a PR to make things truly immutable (basically just need to use read only arrays) of course it's not so easy and we have many open issues so community help is most appreciated |
Uh oh!
There was an error while loading. Please reload this page.
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Crash always happens on
__contains__
method of df.columns repeated calls with column name reassignments. When this happens, simply calling df.columns enough times will cause the segfault.All variations of engine types for read_csv (c, python, and python-fwf) are reproducible with no observable difference between any engine type when using the outer for loop forcing thousands of iterations. Without the outer for loop, the python engine performs the best with the lowest chance of reproducing the crash. python-fwf and c perform significantly worse, with python-fwf marginally less likely to crash.
Initially found on python 3.6 pandas 1.1.5, also reproduced on python 3.8 and latest version of pandas 1.4.1
The file contents of the CSV seem to be irrelevant. In my case, it had four columns with one row of data. In this example, the columns will initially be changed to something else (on first iteration of outer loop), but subsequent loop executions will rename it to the same values. The outer loop is not necessary to reproduce the crash, but it makes it guaranteed to happen.
Note, replacing the inner for loop with the commented out lines will cause the issue to become unreproducible, even with the outer loop. On the other hand, simply commenting out the three column value assignments at the start of the outer loop will make the issue much harder to reproduce, but still possible (it can survive 10000 iterations about 80% of the time). When it does segfault in this case, it does so on the final line of the example.
Stack Trace:
Expected Behavior
No crash regardless of number of column re-assignments / df.column accesses
Installed Versions
INSTALLED VERSIONS
commit : 06d2301
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-1055-aws
Version : #58~18.04.1-Ubuntu SMP Wed Jul 28 03:04:50 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.1
setuptools : 56.0.0
Cython : None
pytest : 7.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.0.2
psycopg2 : None
jinja2 : 3.0.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : 1.0.8
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
zstandard : None
The text was updated successfully, but these errors were encountered: