Skip to content

issue with StataReader for stata files versions 108 and older #12232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
corakingdon opened this issue Feb 4, 2016 · 4 comments
Closed

issue with StataReader for stata files versions 108 and older #12232

corakingdon opened this issue Feb 4, 2016 · 4 comments
Labels
Compat pandas objects compatability with Numpy or Python functions IO Stata read_stata, to_stata
Milestone

Comments

@corakingdon
Copy link

I am having an issue with the StataReader class, which is found in stata.py ("pandas/io/stata.py").
I have pandas: 0.17.1.

The following is the python code I am trying to run:

import sys
reload(sys).setdefaultencoding('utf-8')  
import pandas as pd
from pandas.io import stata

sr=stata.StataReader(fileName)

where fileName is a stata file.

The following code is part of the _read_old_header method(which starts on line 1184) of the StataReader class in stata.py, which gets called during the initialization of a StataReader object:

if self.format_version > 108:
    typlist = [ord(self.path_or_buf.read(1))
        for i in range(self.nvar)]
else:
    typlist = [
        self.OLD_TYPE_MAPPING[
            self._decode_bytes(self.path_or_buf.read(1))
        ] for i in range(self.nvar)
    ]

I have no errors when my stata files are newer than version 108, but with files that are version 105, there seems to be a bug in _decode_bytes. The above code passes in self and only one additional argument to _decode_bytes, the string that is returned by path_or_buf.read(1).

Here is the the method _decode_bytes (line 896):

def _decode_bytes(self, str, errors=None):
        if compat.PY3 or self._encoding is not None:
            return str.decode(self._encoding, errors)
        else:
            return str

When no third argument is passed in (as is the case when it is called by _read_old_header), the argument "errors" is set to None. Here is where the error is thrown. The error is:

TypeError: decode() argument 2 must be string, not None

That is the issue: the decode method of the string class is expecting the second argument to not be a None type, but _decode_bytes passes in errors as None by default.

@jorisvandenbossche jorisvandenbossche added the IO Stata read_stata, to_stata label Feb 4, 2016
@jorisvandenbossche
Copy link
Member

cc @bashtage @kshedden

@kshedden
Copy link
Contributor

kshedden commented Feb 5, 2016

@ckingdon95 thanks for the detailed report. We don't have any test files that old, and I cannot create a file that old with the latest version of stata, which is the only one I can access (see link below). So we might need someone to provide us with a test file to troubleshoot this. Are there any version 108 files floating around on the web?

http://www.stata.com/support/faqs/data-management/save-for-previous-version/

@kshedden
Copy link
Contributor

kshedden commented Feb 5, 2016

I don't think we need _decode_bytes there, the column types are all ASCII.

_decode_bytes is not covered by any tests and is only called in one place.

Can someone with a version < 108 file change lines 1204-1208 of stata.py to:

typlist = [self.OLD_TYPE_MAPPING[self.path_or_buf.read(1)] for i in range(self.nvar)]

and report back? If we adopt this we should remove the _decode_bytes method.

@corakingdon
Copy link
Author

thanks for the reply! the stata files I am trying to read can be found at this website: http://econ.worldbank.org/WBSITE/EXTERNAL/EXTDEC/EXTRESEARCH/EXTLSMS/0,,contentMDK:21544648~pagePK:64168445~piPK:64168309~theSitePK:3358997,00.html

@jreback jreback added the Compat pandas objects compatability with Numpy or Python functions label Feb 5, 2016
@jreback jreback added this to the 0.18.0 milestone Feb 5, 2016
@jreback jreback closed this as completed in ca4f738 Feb 8, 2016
cldy pushed a commit to cldy/pandas that referenced this issue Feb 11, 2016
Closes pandas-dev#12232, although the issue may resurface for files
containing double values (I can't determine the old type code for
doubles).

Author: Kerby Shedden <[email protected]>

Closes pandas-dev#12233 from kshedden/old_stata and squashes the following commits:

aba666c [Kerby Shedden] Read old stat files (bugfix)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO Stata read_stata, to_stata
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants