Skip to content

Incorrect match for pd.Term with Categorical at read_hdf #11304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
michaelaye opened this issue Oct 12, 2015 · 18 comments
Closed

Incorrect match for pd.Term with Categorical at read_hdf #11304

michaelaye opened this issue Oct 12, 2015 · 18 comments
Labels
Bug IO HDF5 read_hdf, HDFStore Needs Info Clarification about behavior needed to assess issue

Comments

@michaelaye
Copy link
Contributor

In below screenshot, I am scanning the database for a categorical called classification_id with the value 50ef44b795e6e42cd2000001 but I am getting a data-row where the categorical has the value 50ef44b795e6e42cd6000001`.

How is this possible? Note that my list of categorical is huge, more than 4 million entries, with 12 million total rows. (Yes, on average, each classification_id appears 3 times.)

screenshot 2015-10-12 15 31 25

On a side note: The display of this one row of numpy in the line with .values at the end takes a lot of time, possibly due to large size of the Categorical, can that be avoided somehow?

Here's my required meta-data for the bug report:
pandas Version: 0.17.0

INSTALLED VERSIONS

commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: None
numpy: 1.10.0
scipy: 0.16.0
statsmodels: None
IPython: 4.1.0-dev
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None

@michaelaye
Copy link
Contributor Author

Here's the database store:

<class 'pandas.io.pytables.HDFStore'>
File path: /Users/klay6683/data/planet4/2015-10-11_planet_four_classifications_queryable_cleaned.h5
/df                                        frame_table  (typ->appendable,nrows->12588576,ncols->21,indexers->[index],dc->[classification_id,image_id,image_name,user_name,marking,acquisition_date,local_mars_time])
/df/meta/classification_id/meta            series_table (typ->appendable,nrows->4454898,ncols->1,indexers->[index],dc->[values])                                                                                    
/df/meta/image_id/meta                     series_table (typ->appendable,nrows->105796,ncols->1,indexers->[index],dc->[values])                                                                                     
/df/meta/image_name/meta                   series_table (typ->appendable,nrows->420,ncols->1,indexers->[index],dc->[values])                                                                                        
/df/meta/local_mars_time/meta              series_table (typ->appendable,nrows->237,ncols->1,indexers->[index],dc->[values])                                                                                        
/df/meta/marking/meta                      series_table (typ->appendable,nrows->4,ncols->1,indexers->[index],dc->[values])                                                                                          
/df/meta/user_name/meta                    series_table (typ->appendable,nrows->111593,ncols->1,indexers->[index],dc->[values])                                                                                     
/df/meta/values_block_2/meta               series_table (typ->appendable,nrows->105796,ncols->1,indexers->[index],dc->[values])   

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

if you created this before using Pytables 3.2.2 you should create it again.

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

no idea about the display issue, pls show someting that is easily reproducible.

@michaelaye
Copy link
Contributor Author

I created this database today, so it's 3.2.2

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

show the dtype of the classification_id table itself

@michaelaye
Copy link
Contributor Author

Like so:

data.dtypes

classification_id          category
created_at           datetime64[ns]
image_id                   category
image_name                 category
image_url                  category
user_name                  category
marking                    category
x_tile                        int64
y_tile                        int64
acquisition_date     datetime64[ns]
local_mars_time            category
x                           float64
y                           float64
image_x                     float64
image_y                     float64
radius_1                    float64
radius_2                    float64
distance                    float64
angle                       float64
spread                      float64
version                     float64
dtype: object

working on an example for display time...

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

show store.get_storer('df').table

@michaelaye
Copy link
Contributor Author

/df/table (Table(12588576,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(10,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "values_block_2": Int32Col(shape=(1,), dflt=0, pos=3),
  "values_block_3": Int64Col(shape=(2,), dflt=0, pos=4),
  "classification_id": Int32Col(shape=(), dflt=0, pos=5),
  "image_id": Int32Col(shape=(), dflt=0, pos=6),
  "image_name": Int16Col(shape=(), dflt=0, pos=7),
  "user_name": Int32Col(shape=(), dflt=0, pos=8),
  "marking": Int8Col(shape=(), dflt=0, pos=9),
  "acquisition_date": Int64Col(shape=(), dflt=0, pos=10),
  "local_mars_time": Int16Col(shape=(), dflt=0, pos=11)}
  byteorder := 'little'
  chunkshape := (3718,)
  autoindex := True
  colindexes := {
    "acquisition_date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "image_name": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "local_mars_time": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "image_id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "marking": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "classification_id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "user_name": Index(6, medium, shuffle, zlib(1)).is_csi=False}

here's an easy example showing the delay in displaying a single row of data:

items = [str(i) for i in range(4000000)]
s = pd.Series(items, dtype='category')
df = pd.DataFrame({'C':s, 'data':np.random.randn(4000000)})
data = df[df.C=='20']
data.C.values

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

add #11305 for the rendering which is a bug

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

did you create the index after the table? can you show some sample code

@jreback jreback added the IO HDF5 read_hdf, HDFStore label Oct 12, 2015
@michaelaye
Copy link
Contributor Author

I'm doing stuff on temporary files while those object columns still being objects, and only at the very end convert every object column to categorical, before creating the database HDF file with one df.to_hdf command. This is the last part of the whole pipeline:

    logging.info('Merging temp files manually.')

    if image_names is None:
        image_names = get_image_names(dbname)

    dbname_base, ext = os.path.splitext(dbname)
    dbnamenew = dbname_base + '_cleaned' + ext
    logging.info('Creating concatenated db file {}'.format(dbnamenew))
    df = []
    for image_name in image_names:
        try:
            df.append(pd.read_hdf(get_temp_fname(image_name), 'df'))
        except OSError:
            continue
        else:
            os.remove(get_temp_fname(image_name))
    df = pd.concat(df, ignore_index=True)

    # change types to category
    to_category = ['image_name', 'classification_id', 'image_id', 'image_url',
                   d'user_name', 'marking', 'local_mars_time']
    for col in to_category:
        df[col] = df[col].astype('category')

    df.to_hdf(dbnamenew, 'df',
              format='table',
              data_columns=data_columns)
    logging.info('final database complete.')
    return dbnamenew

@jreback
Copy link
Contributor

jreback commented Oct 12, 2015

and if you do that search on the frame in memory, does it work correctly?
(you could simulate be reading the entire frame into memory, then selecting).

Trying to narrow down if its the creation somewhere or something else.

@michaelaye
Copy link
Contributor Author

Good point. When reading everything into memory, it does not match anything, i.e. working correctly, because this classification_id value should have been filtered out before. Maybe my filtering is crooked, something with setting things on views?

Here's how I filter: (It's a bit complicated, but basically: When one user has created more than one unique classification_id I have to keep the earlier one:)
FYI, image_name is a higher hierarchy than image_id, i.e. each image_name has hundreds of tiles that are identified via image_id:

This function gets to several cores:

    def process_image_name(image_name):
        import pandas as pd
        data = pd.read_hdf(dbname, 'df', where='image_name==' + image_name)
        data = remove_duplicates_from_image_name_data(data)
        data.to_hdf(get_temp_fname(image_name), 'df')

and the remover function is:

def remove_duplicates_from_image_name_data(data):
    """remove duplicates from this data.

    Parameters
    ==========
    data: pd.DataFrame
        data filtered for one image_name

    Returns
    =======
    For each `user_name` and `image_id` found in `data` return only the data
    for the first found classification_id. There *should* only be one
    classification_id per user_name and image_id, but sometimes the queue
    presented the same image_id more than once to the same users. This removes
    any later in time classification_ids per user_name and image_id.
    """
    def process_user_group(g):
        c_id = g.sort_values(by='created_at').classification_id.iloc[0]
        return g[g.classification_id == c_id]
    return data.groupby(['image_id', 'user_name']).apply(
        process_user_group).reset_index(drop=True)

I'm going to try now what happens on the unfiltered database file.

@michaelaye
Copy link
Contributor Author

So, I read in the whole dataframe, as above, it works in memory, and then saved it again, with exactly the same problem. (But, interestingly, half the size of the HDF file?). That should mean, that I'm not doing anything wrong, correct?

@michaelaye
Copy link
Contributor Author

I saved the dataframe after converting just the classification_id column to string. My on-disk selection works fine in that case.

@jreback
Copy link
Contributor

jreback commented Oct 13, 2015

The following replicates in size & scale what you are doing, yes?

In [1]: cats = [ "s%07d" % i for i in xrange(4000000) ]

In [2]: df = DataFrame({'A' : cats + cats + cats })

In [3]: df['B'] = df['A'].astype('category')

In [4]: df.B.cat.codes.dtype
Out[4]: dtype('int32')

In [5]: df[df.B=='s0000005']
Out[5]: 
                A         B
5        s0000005  s0000005
4000005  s0000005  s0000005
8000005  s0000005  s0000005

In [6]: df.to_hdf('test.h5','df',mode='w',data_columns=True,format='table')

In [7]: df[df.B=='s3999999']
Out[7]: 
                 A         B
3999999   s3999999  s3999999
7999999   s3999999  s3999999
11999999  s3999999  s3999999

In [8]: pd.read_hdf('test.h5','df',where='A="s3999999"')
Out[8]: 
                 A         B
3999999   s3999999  s3999999
7999999   s3999999  s3999999
11999999  s3999999  s3999999

In [9]: pd.read_hdf('test.h5','df',where='B="s3999999"')
Out[9]: 
                 A         B
3999999   s3999999  s3999999
7999999   s3999999  s3999999
11999999  s3999999  s3999999

@TomAugspurger
Copy link
Contributor

@michaelaye is this still an issue?

@mroeschke mroeschke added Bug Needs Info Clarification about behavior needed to assess issue labels May 16, 2020
@rhshadrach
Copy link
Member

Without a repeatable example or a user to report, nothing further can be done to investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Needs Info Clarification about behavior needed to assess issue
Projects
None yet
Development

No branches or pull requests

5 participants