to_hdf writes data that doesn't match read back #7605

vm-wylbur · 2014-06-29T00:25:48Z

here's the code:

    records.to_hdf(
        args.output, 'records',
        mode='w', format='fixed', append=False,
        complib='zlib', complevel=7, fletcher32=True)

    r2 = pd.read_hdf(
        path_or_buf=args.output, key='records',
        encoding='utf-8', start=None, stop=None)

    from pandas.util.testing import assert_frame_equal
    assert_frame_equal(records, r2, check_exact=True)

and the traceback:

/Users/pball/miniconda3/lib/python3.3/site-packages/pandas/io/pytables.py:2441: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block1_values] [items->['dataset', 'record_id', 'DOD', 'CC', 'sex', 'name', 'loc', 'manner_of_death', 'eth', 'social_group', 'occ', 'clean_loc', 'month_of_death', 'year_of_death', 'name_sorted']]

  warnings.warn(ws, PerformanceWarning)
Traceback (most recent call last):
  File "src/import.py", line 59, in <module>
    tools.epilog(args, records, logger)
  File "/Users/pball/git/CO/match/import/src/lib/import_tools.py", line 46, in epilog
    assert_frame_equal(records, r2, check_exact=True)
  File "/Users/pball/miniconda3/lib/python3.3/site-packages/pandas/util/testing.py", line 585, in assert_frame_equal
    check_exact=check_exact)
  File "/Users/pball/miniconda3/lib/python3.3/site-packages/pandas/util/testing.py", line 530, in assert_series_equal
    right.values))
AssertionError: [nan nan nan ..., 'c2681113' 'c12266508' 'c2680757'] is not equal to [nan nan nan ..., 'c2681113' 'c12266508' 'c2680757'].
make: *** [output/input-records.h5] Error 1

I've been trying to figure out why upstream fixes didn't seem to appear downstream. I finally came here: apparently to_hdf is writing a file that's different when it's read back. As I've been re-running this over the last hour or so, different fields have come up in the AssertionError.

Here are a few things that do not eliminate the error: with or without compression; format table or fixed. However, changing these arguments does change which field is identified by assert_frame_equal as unequal.

I have no idea how to reproduce this without my entire dataset, which is unfortunately confidential. I'll fall back to csv for now, and I hope that I'm just doing something horribly dumb that we can fix.

The text was updated successfully, but these errors were encountered:

jreback · 2014-06-29T01:22:01Z

pls show pd.show_versions() and df.info()

you have odd dtypes - that needs to be fixed in order to serialize properly

vm-wylbur · 2014-06-29T01:25:01Z

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418243 entries, 0 to 418242
Data columns (total 18 columns):
dataset            418243 non-null object
record_id          418243 non-null object
DOD                418243 non-null object
CC                 186801 non-null object
sex                383230 non-null object
name               418243 non-null object
muni               418243 non-null float64
loc                335904 non-null object
age_int            321997 non-null float64
manner_of_death    277084 non-null object
eth                87254 non-null object
social_group       94010 non-null object
occ                130903 non-null object
clean_loc          335904 non-null object
month_of_death     418243 non-null object
year_of_death      418243 non-null object
depto              418243 non-null float64
name_sorted        418243 non-null object
dtypes: float64(3), object(15)None

INSTALLED VERSIONS
------------------
commit: None
python: 3.3.5.final.0
python-bits: 64
OS: Darwin
OS-release: 13.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.0
nose: None
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: None
IPython: 3.0.0-dev
sphinx: 1.2.2
patsy: None
scikits.timeseries: None
dateutil: 2.1
pytz: 2014.3
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 2.0.2
xlrd: 0.9.3
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.3.2
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
None

jreback · 2014-06-29T01:31:19Z

any unicode or actual objects? (what I mean are the object fields just straight strings?

vm-wylbur · 2014-06-29T01:34:06Z

the strings are unicode (they're latin american names). I suspect we're onto something here -- I've been reviewing fields that turn up blank, and they've got some weird chars. Where did they come from? We've cleaned that field upstream, but maybe it got uncleaned. I'll clean to simple old-school ASCII and report back.

jreback · 2014-06-29T01:43:55Z

might be related to #7244
their is a decoding bug somewhere but not easily reproducible
I think the writing is fine
it's the decoding - if u want to take a stab u can step thru on the string conversions
and see if u can figure out

their is a try except around a fast path,which if it fails hits the slower path
the problem is the fast path succeeds sometimes when it shouldn't I think

if I want to debug would be great!

vm-wylbur · 2014-06-29T01:53:38Z

progress:

The bug is that some unicode strings were simply not being read back (like in HDFStore still corrupted reads with utf8 #7244 as you suspected); the fields with weird chars were replaced with a blank field.
Upstream I forced the field into English characters with unidecode, and the written-and-read-back field equals the original.
the other fields are fine. I was panicking at assert_frame_equals() fails, but then I realized that NaN != NaN. I thought that the fields with NaN's were also corrupted, but testing them row-by-row, they're fine.
My deadline is looming and I can't dig into the larger bug now, but I can in a week or so. It's worth it: in another project, we have a table with hundreds of thousands of Arabic names, and we need to be sure nothing surprising is happening to them.

@jreback I'm closing this now, I think you've nailed it. Very much appreciated!

jreback · 2014-06-29T01:58:05Z

gr8

feel free to comment/update that other one when I have time!

vm-wylbur closed this as completed Jun 29, 2014

wikiped mentioned this issue Jan 21, 2015

False negative on .equals() after read_hdf() #9330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

to_hdf writes data that doesn't match read back #7605

to_hdf writes data that doesn't match read back #7605

vm-wylbur commented Jun 29, 2014

jreback commented Jun 29, 2014

Uh oh!

vm-wylbur commented Jun 29, 2014

Uh oh!

jreback commented Jun 29, 2014

Uh oh!

vm-wylbur commented Jun 29, 2014

Uh oh!

jreback commented Jun 29, 2014

Uh oh!

vm-wylbur commented Jun 29, 2014

Uh oh!

jreback commented Jun 29, 2014

Uh oh!

Uh oh!

to_hdf writes data that doesn't match read back #7605

to_hdf writes data that doesn't match read back #7605

Comments

vm-wylbur commented Jun 29, 2014

jreback commented Jun 29, 2014

Uh oh!

vm-wylbur commented Jun 29, 2014

Uh oh!

jreback commented Jun 29, 2014

Uh oh!

vm-wylbur commented Jun 29, 2014

Uh oh!

jreback commented Jun 29, 2014

Uh oh!

vm-wylbur commented Jun 29, 2014

Uh oh!

jreback commented Jun 29, 2014

Uh oh!