Skip to content

Index Dtype Not Preserved During read_fwf #21555

Closed
@jlandercy

Description

@jlandercy

Code Sample (copy-pastable, MCVE)

Consider the following code:

import io
import pandas as pd

# Trial FWF file:
data = io.StringIO('x10011\nx10012\nx10013\nx10024\nx20025\nx20026\nx20037\nx20038\n')

# Read and cast:
df1 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int})
# Then index:
df1.set_index(1, inplace=True)

# Read, cast and index at once:
data.seek(0)
df2 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int}, index_col=1)

Problem description

As I understand the documentation about control switches:

dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}
Use str or object together with suitable na_values settings to preserve and not interpret dtype.
If converters are specified, they will be applied INSTEAD of dtype conversion.

index_col : int or sequence or False, default None
Column to use as the row labels of the DataFrame.
If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters
at the end of each line, you might consider index_col=False to force pandas to not
use the first column as the index (row names)

Both output should be equal but it is not.

When indexing at once using index_col switch, column is inferred to be int and casted, making the switch dtype useless in this case.

>>> df1.index
Index(['001', '001', '001', '002', '002', '002', '003', '003'], dtype='object', name=1)

>>> df2.index
Int64Index([1, 1, 1, 2, 2, 2, 3, 3], dtype='int64', name=1)

>>> df1.equals(df2)
False

Expected Output

I think the expected output of:

df2 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int}, index_col=1)

Should be equal to:

df1 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int})
df1.set_index(1, inplace=True)

If not, it just makes no sense to be able to protect columns from casting using dtype switch.
For this reason, I think it is a kind of slight bug or inconsistency.

Anyway, as provided in MCVE above, there exists a solution to circonvolve the problem.

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugDtype ConversionsUnexpected or buggy dtype conversionsDuplicate ReportDuplicate issue or pull requestIO DataIO issues that don't fit into a more specific label

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions