Skip to content

BUG: read_csv with memory_map=True on BytesIO object fails #45630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks done
RehanSD opened this issue Jan 26, 2022 · 3 comments
Open
3 tasks done

BUG: read_csv with memory_map=True on BytesIO object fails #45630

RehanSD opened this issue Jan 26, 2022 · 3 comments
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv

Comments

@RehanSD
Copy link

RehanSD commented Jan 26, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from io import BytesIO
df = pd.DataFrame([[1, 2]])
bio = BytesIO()
df.to_csv(bio)
bio.seek(0)
pd.read_csv(bio, memory_map=True)

Issue Description

The read_csv fails, and provides this error:

UnsupportedOperation                      Traceback (most recent call last)
<ipython-input-7-c255f3bb77b8> in <module>
----> 1 pd.read_csv(bio, memory_map=True)

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    309                     stacklevel=stacklevel,
    310                 )
--> 311             return func(*args, **kwargs)
    312
    313         return wrapper

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    678     kwds.update(kwds_defaults)
    679
--> 680     return _read(filepath_or_buffer, kwds)
    681
    682

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py in _read(filepath_or_buffer, kwds)
    573
    574     # Create the parser.
--> 575     parser = TextFileReader(filepath_or_buffer, **kwds)
    576
    577     if chunksize or iterator:

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py in __init__(self, f, engine, **kwds)
    931
    932         self.handles: IOHandles | None = None
--> 933         self._engine = self._make_engine(f, self.engine)
    934
    935     def close(self):

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/parsers/readers.py in _make_engine(self, f, engine)
   1215             # "Union[str, PathLike[str], ReadCsvBuffer[bytes], ReadCsvBuffer[str]]"
   1216             # , "str", "bool", "Any", "Any", "Any", "Any", "Any"
-> 1217             self.handles = get_handle(  # type: ignore[call-overload]
   1218                 f,
   1219                 mode,

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    680
    681     # memory mapping needs to be the first step
--> 682     handle, memory_map, handles = _maybe_memory_map(
    683         handle,
    684         memory_map,

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py in _maybe_memory_map(handle, memory_map, encoding, mode, errors, decode)
   1085         wrapped = cast(
   1086             BaseBuffer,
-> 1087             _MMapWrapper(handle, encoding, errors, decode),  # type: ignore[arg-type]
   1088         )
   1089     finally:

~/.miniconda3/envs/modin/lib/python3.8/site-packages/pandas/io/common.py in __init__(self, f, encoding, errors, decode)
    959                 continue
    960             self.attributes[attribute] = getattr(f, attribute)()
--> 961         self.mmap = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    962
    963     def __getattr__(self, name: str):

UnsupportedOperation: fileno

It seems that the error is because fileno is being called on a BytesIO object. This code does work in pandas 1.3.4, which is odd, so I took a look at the sources to see what was different and noticed that in _maybe_memory_map, the except when trying to instant the _MMapWrapper was removed from common.py. (Old common.py for reference).

Expected Behavior

I would expect the read_csv to succeed and the DataFrame to be read.

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.8.12.final.0
python-bits : 64
OS : Darwin
OS-release : 21.1.0
Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.0
numpy : 1.21.4
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.0.4
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : 4.3.1
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.6.4
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.30.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : 2021.11.1
gcsfs : None
matplotlib : 3.2.2
numba : None
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : 0.16.0
pyarrow : 3.0.0
pyreadstat : None
pyxlsb : None
s3fs : 2021.11.1
scipy : 1.7.3
sqlalchemy : 1.4.27
tables : 3.6.1
tabulate : None
xarray : 0.20.1
xlrd : 2.0.1
xlwt : None
zstandard : None

@RehanSD RehanSD added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 26, 2022
@twoertwein
Copy link
Member

I don't think there is a point in using memory_map with BytesIO. The data is already in memory.

This was purposefully changed in #44777 and is mentioned in the whatsnew for 1.4.0. Silently failing (previous behavior) gives the illusion that it worked, even though it was never using mmap for BytesIO.

@RehanSD
Copy link
Author

RehanSD commented Jan 26, 2022

Thank you @twoertwein! I just wanted to confirm that this was expected behavior! I'm wondering though if it makes sense to include a more descriptive error message so people know what's going on?

@twoertwein
Copy link
Member

Catching UnsupportedOperation (and maybe AttributeError) and then re-raising a more clear message might be worth it.

@lithomas1 lithomas1 added Enhancement Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Error Reporting Incorrect or improved errors from pandas IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

3 participants