Skip to content

to_hdf writes data that doesn't match read back #7605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vm-wylbur opened this issue Jun 29, 2014 · 7 comments
Closed

to_hdf writes data that doesn't match read back #7605

vm-wylbur opened this issue Jun 29, 2014 · 7 comments

Comments

@vm-wylbur
Copy link

here's the code:

    records.to_hdf(
        args.output, 'records',
        mode='w', format='fixed', append=False,
        complib='zlib', complevel=7, fletcher32=True)

    r2 = pd.read_hdf(
        path_or_buf=args.output, key='records',
        encoding='utf-8', start=None, stop=None)

    from pandas.util.testing import assert_frame_equal
    assert_frame_equal(records, r2, check_exact=True)

and the traceback:

/Users/pball/miniconda3/lib/python3.3/site-packages/pandas/io/pytables.py:2441: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['dataset', 'record_id', 'DOD', 'CC', 'sex', 'name', 'loc', 'manner_of_death', 'eth', 'social_group', 'occ', 'clean_loc', 'month_of_death', 'year_of_death', 'name_sorted']]

  warnings.warn(ws, PerformanceWarning)
Traceback (most recent call last):
  File "src/import.py", line 59, in <module>
    tools.epilog(args, records, logger)
  File "/Users/pball/git/CO/match/import/src/lib/import_tools.py", line 46, in epilog
    assert_frame_equal(records, r2, check_exact=True)
  File "/Users/pball/miniconda3/lib/python3.3/site-packages/pandas/util/testing.py", line 585, in assert_frame_equal
    check_exact=check_exact)
  File "/Users/pball/miniconda3/lib/python3.3/site-packages/pandas/util/testing.py", line 530, in assert_series_equal
    right.values))
AssertionError: [nan nan nan ..., 'c2681113' 'c12266508' 'c2680757'] is not equal to [nan nan nan ..., 'c2681113' 'c12266508' 'c2680757'].
make: *** [output/input-records.h5] Error 1

I've been trying to figure out why upstream fixes didn't seem to appear downstream. I finally came here: apparently to_hdf is writing a file that's different when it's read back. As I've been re-running this over the last hour or so, different fields have come up in the AssertionError.

Here are a few things that do not eliminate the error: with or without compression; format table or fixed. However, changing these arguments does change which field is identified by assert_frame_equal as unequal.

I have no idea how to reproduce this without my entire dataset, which is unfortunately confidential. I'll fall back to csv for now, and I hope that I'm just doing something horribly dumb that we can fix.

@jreback
Copy link
Contributor

jreback commented Jun 29, 2014

pls show pd.show_versions() and df.info()

you have odd dtypes - that needs to be fixed in order to serialize properly

@vm-wylbur
Copy link
Author

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418243 entries, 0 to 418242
Data columns (total 18 columns):
dataset            418243 non-null object
record_id          418243 non-null object
DOD                418243 non-null object
CC                 186801 non-null object
sex                383230 non-null object
name               418243 non-null object
muni               418243 non-null float64
loc                335904 non-null object
age_int            321997 non-null float64
manner_of_death    277084 non-null object
eth                87254 non-null object
social_group       94010 non-null object
occ                130903 non-null object
clean_loc          335904 non-null object
month_of_death     418243 non-null object
year_of_death      418243 non-null object
depto              418243 non-null float64
name_sorted        418243 non-null object
dtypes: float64(3), object(15)None

INSTALLED VERSIONS
------------------
commit: None
python: 3.3.5.final.0
python-bits: 64
OS: Darwin
OS-release: 13.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.0
nose: None
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: None
IPython: 3.0.0-dev
sphinx: 1.2.2
patsy: None
scikits.timeseries: None
dateutil: 2.1
pytz: 2014.3
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 2.0.2
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.3.2
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
None

@jreback
Copy link
Contributor

jreback commented Jun 29, 2014

any unicode or actual objects? (what I mean are the object fields just straight strings?

@vm-wylbur
Copy link
Author

the strings are unicode (they're latin american names). I suspect we're onto something here -- I've been reviewing fields that turn up blank, and they've got some weird chars. Where did they come from? We've cleaned that field upstream, but maybe it got uncleaned. I'll clean to simple old-school ASCII and report back.

@jreback
Copy link
Contributor

jreback commented Jun 29, 2014

might be related to #7244
their is a decoding bug somewhere but not easily reproducible
I think the writing is fine
it's the decoding - if u want to take a stab u can step thru on the string conversions
and see if u can figure out

their is a try except around a fast path,which if it fails hits the slower path
the problem is the fast path succeeds sometimes when it shouldn't I think

if I want to debug would be great!

@vm-wylbur
Copy link
Author

progress:

  • The bug is that some unicode strings were simply not being read back (like in HDFStore still corrupted reads with utf8 #7244 as you suspected); the fields with weird chars were replaced with a blank field.
  • Upstream I forced the field into English characters with unidecode, and the written-and-read-back field equals the original.
  • the other fields are fine. I was panicking at assert_frame_equals() fails, but then I realized that NaN != NaN. I thought that the fields with NaN's were also corrupted, but testing them row-by-row, they're fine.
  • My deadline is looming and I can't dig into the larger bug now, but I can in a week or so. It's worth it: in another project, we have a table with hundreds of thousands of Arabic names, and we need to be sure nothing surprising is happening to them.

@jreback I'm closing this now, I think you've nailed it. Very much appreciated!

@jreback
Copy link
Contributor

jreback commented Jun 29, 2014

gr8

feel free to comment/update that other one when I have time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants