Description
Code Sample (copy-pastable, MCVE)
Consider the following code:
import io
import pandas as pd
# Trial FWF file:
data = io.StringIO('x10011\nx10012\nx10013\nx10024\nx20025\nx20026\nx20037\nx20038\n')
# Read and cast:
df1 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int})
# Then index:
df1.set_index(1, inplace=True)
# Read, cast and index at once:
data.seek(0)
df2 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int}, index_col=1)
Problem description
As I understand the documentation about control switches:
dtype
: Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32}
Use str or object together with suitable na_values settings to preserve and not interpret dtype.
If converters are specified, they will be applied INSTEAD of dtype conversion.
index_col
: int or sequence or False, default None
Column to use as the row labels of the DataFrame.
If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters
at the end of each line, you might consider index_col=False to force pandas to not
use the first column as the index (row names)
Both output should be equal but it is not.
When indexing at once using index_col
switch, column is inferred to be int
and casted, making the switch dtype
useless in this case.
>>> df1.index
Index(['001', '001', '001', '002', '002', '002', '003', '003'], dtype='object', name=1)
>>> df2.index
Int64Index([1, 1, 1, 2, 2, 2, 3, 3], dtype='int64', name=1)
>>> df1.equals(df2)
False
Expected Output
I think the expected output of:
df2 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int}, index_col=1)
Should be equal to:
df1 = pd.read_fwf(data, widths=[2,3,1], header=None, dtype={0: str, 1: str, 2: int})
df1.set_index(1, inplace=True)
If not, it just makes no sense to be able to protect columns from casting using dtype
switch.
For this reason, I think it is a kind of slight bug or inconsistency.
Anyway, as provided in MCVE above, there exists a solution to circonvolve the problem.
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 142 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None