Skip to content

read_fwf 'infer' where first hundred lines differ from other lines #15138

Closed
@adamboche

Description

@adamboche

Code Sample, a copy-pastable example if possible

  1     1   -13.120080   0.229   0.484  -0.378  -0.872
  1     2    -1.902843  -0.090   0.256   1.791   0.967
  1     3   -22.050698  -0.176  -0.394   0.922  -0.454
  1     4   -30.349928   0.081  -0.194  -0.327  -0.981
  1     5   -22.204160  -0.168  -0.197   0.984  -0.266
  1     6   -28.001753  -0.065   0.597  -0.203  -0.802
  1     7   -17.247524   0.108   0.194   0.474   0.774
  1     8   -28.014811   0.017   0.994   0.493   0.112
  1     9   -13.325491   0.259   0.189  -1.275   0.149
  1    10   -10.063621   0.327   0.108  -1.784   0.061
...
115    18     5.697000   0.391  -0.027   0.252   1.000
115    19     8.324000  -0.283   0.132   0.227  -0.216
115    20    48.451000   0.070  -0.041   0.379  -0.082
115    21     0.146000   0.677   0.031  -0.561  -0.149
115    22     1.443000  -0.706  -0.033  -0.222   0.035
115    23     4.595000   0.654  -0.081   0.774   0.997
115    24     0.146000  -0.677   0.031   0.561  -0.149
115    25     4.595000   0.654  -0.081   0.774   0.997
115    26     6.769000  -0.363  -0.093  -0.298   0.996
115    27    24.157000  -0.280  -0.324  -0.142  -0.946

Problem description

I have a long fixed-width file (>100k lines) that whose head and tail are shown above. I want to read this file with pandas. I figure pd.read_fwf is the way to do this. The issue comes up because it reads the first hundred lines, which start with ' 1' to say "lets start reading at [2]" whereas the last hundred lines start with 115, so it skips the initial 11 and starts the line with 5, so I lose data.

A couple of approaches to solving this issue come to mind, though I'm sure there are others:

  • Don't infer until all lines are scanned
  • Take as an argument the number of lines to be scanned before concluding the format, including the option to scan all (e.g. infer_from_all)
  • Take as an argument which direction to scan -- top to bottom or bottom to top

Output of pd.show_versions()

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-107-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions