Skip to content

BUG: segfault when using read_csv and repeatedly accessing / renaming DataFrame.columns #46146

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
hans2520 opened this issue Feb 25, 2022 · 5 comments
Closed
2 of 3 tasks
Labels
Index Related to the Index class or subclasses Usage Question

Comments

@hans2520
Copy link

hans2520 commented Feb 25, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

from datetime import date
UNSAFE_DB_CHARS = {i: '_' for i in b'!#%&()*+,-./:;<=>?@[]^~ "\\`'}

df = pd.read_csv(file_path)
for i in range(0, 10000):
	df['col1'] = datetime64(str(date.today()))
	df['col2'] = True
	df['col3'] = 'bob' or SYSTEM_USER
	for idx, col in enumerate(df.columns):
		display_name = col.lower()
		display_name = display_name.replace("’", "_").translate(UNSAFE_DB_CHARS)
		df.columns.values[idx] = display_name
	#df.columns.values[0] = 'col1'
	#df.columns.values[1] = 'col2'
	#df.columns.values[2] = 'col3'
	#df.columns.values[3] = 'col4'                 
	'bob' in df.columns

Issue Description

Crash always happens on __contains__ method of df.columns repeated calls with column name reassignments. When this happens, simply calling df.columns enough times will cause the segfault.

All variations of engine types for read_csv (c, python, and python-fwf) are reproducible with no observable difference between any engine type when using the outer for loop forcing thousands of iterations. Without the outer for loop, the python engine performs the best with the lowest chance of reproducing the crash. python-fwf and c perform significantly worse, with python-fwf marginally less likely to crash.

Initially found on python 3.6 pandas 1.1.5, also reproduced on python 3.8 and latest version of pandas 1.4.1

The file contents of the CSV seem to be irrelevant. In my case, it had four columns with one row of data. In this example, the columns will initially be changed to something else (on first iteration of outer loop), but subsequent loop executions will rename it to the same values. The outer loop is not necessary to reproduce the crash, but it makes it guaranteed to happen.

Note, replacing the inner for loop with the commented out lines will cause the issue to become unreproducible, even with the outer loop. On the other hand, simply commenting out the three column value assignments at the start of the outer loop will make the issue much harder to reproduce, but still possible (it can survive 10000 iterations about 80% of the time). When it does segfault in this case, it does so on the final line of the example.

Stack Trace:

platform linux -- Python 3.8.10, pytest-7.0.1, pluggy-0.13.1
rootdir: /home/ubuntu/GitHub/shared/testcode, configfile: tox.ini
plugins: console-scripts-0.2.0
collected 12 items / 11 deselected / 1 selected                                                                                                                                               

tests/test_table_validator.py Fatal Python error: Segmentation fault

Current thread 0x00007f73fe144740 (most recent call first):
  File "/home/ubuntu/GitHub/shared/PyNomad/.tox/unit/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5010 in __contains__
  File "/home/ubuntu/GitHub/shared/PyNomad/.tox/unit/lib/python3.8/site-packages/pandas/core/indexing.py", line 2322 in convert_to_index_sliceable
  File "/home/ubuntu/GitHub/shared/PyNomad/.tox/unit/lib/python3.8/site-packages/pandas/core/frame.py", line 3634 in __setitem__
  File "/home/ubuntu/GitHub/shared/testcode/helpers/upload_file_reader.py", line 73 in read_file_data
  File "/home/ubuntu/GitHub/shared/testcode/helpers/upload_file_reader.py", line 43 in get_file_data
  File "/home/ubuntu/GitHub/shared/testcode/tests/test_table_validator.py", line 47 in test_column_not_matched
  File "/usr/lib/python3.8/unittest/case.py", line 633 in _callTestMethod
  File "/usr/lib/python3.8/unittest/case.py", line 676 in run
  File "/usr/lib/python3.8/unittest/case.py", line 736 in __call__
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/unittest.py", line 327 in runtest
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/runner.py", line 168 in pytest_runtest_call
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/runner.py", line 261 in <lambda>
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/runner.py", line 340 in from_call
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/runner.py", line 260 in call_runtest_hook
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/runner.py", line 221 in call_and_report
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/runner.py", line 132 in runtestprotocol
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/runner.py", line 113 in pytest_runtest_protocol
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/main.py", line 347 in pytest_runtestloop
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/main.py", line 322 in _main
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/main.py", line 268 in wrap_session
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/main.py", line 315 in pytest_cmdline_main
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/callers.py", line 187 in _multicall
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/manager.py", line 84 in <lambda>
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/manager.py", line 93 in _hookexec
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pluggy/hooks.py", line 286 in __call__
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/config/__init__.py", line 165 in main
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/_pytest/config/__init__.py", line 188 in console_main
  File "/home/ubuntu/GitHub/shared/.tox/unit/lib/python3.8/site-packages/pytest/__main__.py", line 5 in <module>
  File "/usr/lib/python3.8/runpy.py", line 87 in _run_code
  File "/usr/lib/python3.8/runpy.py", line 194 in _run_module_as_main
Segmentation fault (core dumped)

Expected Behavior

No crash regardless of number of column re-assignments / df.column accesses

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.8.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-1055-aws
Version : #58~18.04.1-Ubuntu SMP Wed Jul 28 03:04:50 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 21.1.1
setuptools : 56.0.0
Cython : None
pytest : 7.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.0.2
psycopg2 : None
jinja2 : 3.0.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : 1.0.8
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
zstandard : None

@hans2520 hans2520 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 25, 2022
@jreback
Copy link
Contributor

jreback commented Feb 25, 2022

you are modifying an Index which is an immutable object by accessing .values and setting it

it is documented that you shouldn't do this

@hans2520
Copy link
Author

hans2520 commented Feb 25, 2022

@jreback , I've confirmed that using the statement df = df.rename(columns={col: display_name}) for the column renaming does avoid the crash.

I guess the confusion arose from the fact that I was modifying column names, not the index. However apparently it turns out the columns object itself is an index object...

Question, though -- why is pandas not using an actual immutable object for storing index data? Right now it is just 'conceptually' immutable it seems (i.e. don't do that, but we won't stop you -- that doesn't fulfill the definition of immutable. It's rather, "should not be mutated").

Also, I do know when you perform certain ops in pandas (such as modifying a value using a slice), warnings will appear from pandas. Such warnings do not appear here. This would be appreciated at a minimum, although making objects deemed immutable actually so would be the best outcome.

@jreback jreback added this to the No action milestone Feb 25, 2022
@jreback jreback added Usage Question Index Related to the Index class or subclasses and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 25, 2022
@jreback jreback closed this as completed Feb 25, 2022
@jreback
Copy link
Contributor

jreback commented Feb 25, 2022

an Index is immutable full stop
if u r modifying it then bad things can happen

@hans2520
Copy link
Author

The definition of immutable means you can't modify it, not that you can and bad things will happen -- this is a consistent and well-accepted principle.

Since an index is intended to be immutable, it should actually be implemented as such, and pandas should look into using actual immutable structures for index.

@jreback
Copy link
Contributor

jreback commented Feb 26, 2022

@hans2520 certainly would take a PR to make things truly immutable (basically just need to use read only arrays)

of course it's not so easy and we have many open issues so community help is most appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Usage Question
Projects
None yet
Development

No branches or pull requests

2 participants