Skip to content

HDFStore still corrupted reads with utf8 #7244

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wabu opened this issue May 27, 2014 · 8 comments
Open

HDFStore still corrupted reads with utf8 #7244

wabu opened this issue May 27, 2014 · 8 comments
Labels
Bug IO HDF5 read_hdf, HDFStore

Comments

@wabu
Copy link
Contributor

wabu commented May 27, 2014

hdf5 read still fails to read back some utf8 encoded strings back correctly.

I written a test script that:

  • puts random valid utf8 strings into a hdf5 file
  • reads them back with pytables and decodes them
  • uses HDFStore.select to read them
    The hdfstore often fails, but the data is stored correctly inside the hdf5 file.

Here's an example:

                        orig                                         h5ed
0  '7PFKۑw\x1d\x1bxyv\x03;S'   '\x002\x003\x004\x005\x006\x007\x008\x009'
1              '\x12漣\x03_L'  ':\x00;\x00<\x00=\x00>\x00?\x00@\x00A\x00B'
2            'ӉwxIy⣂&QC\x14'   '\x00C\x00D\x00E\x00F\x00G\x00H\x00I\x00J'
3          'u\x01\x10aۋZ9<s'   '\x00T\x00U\x00V\x00W\x00X\x00Y\x00Z\x00['
4       '!\x0b\x06зt\x13QMR'                                      '\x00c'

see #6505 for a previous issue on the topic

@wabu
Copy link
Contributor Author

wabu commented May 27, 2014

version info:

python: 3.3.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-61-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_NZ.UTF-8

pandas: 0.14.0rc1-73-g8793356
Cython: 0.19.1
numpy: 1.8.1
scipy: 0.12.0
IPython: 2.0.0-dev
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2013b
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
...

@jreback jreback added this to the 0.14.1 milestone May 27, 2014
@jreback
Copy link
Contributor

jreback commented May 27, 2014

ok...feel free to have a look, AFAICT it correct decides it can't decode then uses the vectorized decode. Not sure if the error is that data.astype('U16').astype(object) does the wrong thing if it is encoded and only has ascii (so the 'regular' numpy coercion works), but I am not that familiar with what it is actually doing (numpy)

@jreback
Copy link
Contributor

jreback commented May 27, 2014

maybe make an example that has astype decodables in one section and then non-decodables.

plus the example needs to always pass/fail (on these cases). of course that means have to figure WHY it is failing!

@jreback
Copy link
Contributor

jreback commented May 27, 2014

This might be the issue here: numpy/numpy#3939

If this is the case, then may have to either detect when can use this or only do the vectorize (which is much slower, unfortunately); but better to be correct and slow then wrong and fast

This might be related as well:
numpy/numpy#3258

PyTables 'fixes' this for a VLUnicodeAtom (which is not used in tables), so we may need to do something similar
https://github.com/PyTables/PyTables/blob/develop/tables/atom.py#L1108

@jreback
Copy link
Contributor

jreback commented Jun 10, 2014

@wabu fix for this?

@wabu
Copy link
Contributor Author

wabu commented Jun 11, 2014

still did not dig deep enough into this ...

@jreback jreback modified the milestones: 0.15.0, 0.14.1 Jun 13, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@OmerJog
Copy link

OmerJog commented Jun 17, 2019

Is there a fix for this bug?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 17, 2019 via email

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

No branches or pull requests

5 participants