Incorrect match for pd.Term with Categorical at read_hdf #11304

michaelaye · 2015-10-12T21:40:48Z

In below screenshot, I am scanning the database for a categorical called classification_id with the value 50ef44b795e6e42cd2000001 but I am getting a data-row where the categorical has the value 50ef44b795e6e42cd6000001`.

How is this possible? Note that my list of categorical is huge, more than 4 million entries, with 12 million total rows. (Yes, on average, each classification_id appears 3 times.)

On a side note: The display of this one row of numpy in the line with .values at the end takes a lot of time, possibly due to large size of the Categorical, can that be avoided somehow?

Here's my required meta-data for the bug report:
pandas Version: 0.17.0

INSTALLED VERSIONS

commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: None
numpy: 1.10.0
scipy: 0.16.0
statsmodels: None
IPython: 4.1.0-dev
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

michaelaye · 2015-10-12T21:43:32Z

Here's the database store:

<class 'pandas.io.pytables.HDFStore'>
File path: /Users/klay6683/data/planet4/2015-10-11_planet_four_classifications_queryable_cleaned.h5
/df                                        frame_table  (typ->appendable,nrows->12588576,ncols->21,indexers->[index],dc->[classification_id,image_id,image_name,user_name,marking,acquisition_date,local_mars_time])
/df/meta/classification_id/meta            series_table (typ->appendable,nrows->4454898,ncols->1,indexers->[index],dc->[values])                                                                                    
/df/meta/image_id/meta                     series_table (typ->appendable,nrows->105796,ncols->1,indexers->[index],dc->[values])                                                                                     
/df/meta/image_name/meta                   series_table (typ->appendable,nrows->420,ncols->1,indexers->[index],dc->[values])                                                                                        
/df/meta/local_mars_time/meta              series_table (typ->appendable,nrows->237,ncols->1,indexers->[index],dc->[values])                                                                                        
/df/meta/marking/meta                      series_table (typ->appendable,nrows->4,ncols->1,indexers->[index],dc->[values])                                                                                          
/df/meta/user_name/meta                    series_table (typ->appendable,nrows->111593,ncols->1,indexers->[index],dc->[values])                                                                                     
/df/meta/values_block_2/meta               series_table (typ->appendable,nrows->105796,ncols->1,indexers->[index],dc->[values])

jreback · 2015-10-12T21:43:35Z

if you created this before using Pytables 3.2.2 you should create it again.

jreback · 2015-10-12T21:43:57Z

no idea about the display issue, pls show someting that is easily reproducible.

michaelaye · 2015-10-12T21:44:27Z

I created this database today, so it's 3.2.2

jreback · 2015-10-12T21:46:13Z

show the dtype of the classification_id table itself

michaelaye · 2015-10-12T21:47:30Z

Like so:

data.dtypes

classification_id          category
created_at           datetime64[ns]
image_id                   category
image_name                 category
image_url                  category
user_name                  category
marking                    category
x_tile                        int64
y_tile                        int64
acquisition_date     datetime64[ns]
local_mars_time            category
x                           float64
y                           float64
image_x                     float64
image_y                     float64
radius_1                    float64
radius_2                    float64
distance                    float64
angle                       float64
spread                      float64
version                     float64
dtype: object

working on an example for display time...

jreback · 2015-10-12T21:52:12Z

show store.get_storer('df').table

michaelaye · 2015-10-12T21:54:00Z

/df/table (Table(12588576,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(10,), dflt=0.0, pos=1),
  "values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
  "values_block_2": Int32Col(shape=(1,), dflt=0, pos=3),
  "values_block_3": Int64Col(shape=(2,), dflt=0, pos=4),
  "classification_id": Int32Col(shape=(), dflt=0, pos=5),
  "image_id": Int32Col(shape=(), dflt=0, pos=6),
  "image_name": Int16Col(shape=(), dflt=0, pos=7),
  "user_name": Int32Col(shape=(), dflt=0, pos=8),
  "marking": Int8Col(shape=(), dflt=0, pos=9),
  "acquisition_date": Int64Col(shape=(), dflt=0, pos=10),
  "local_mars_time": Int16Col(shape=(), dflt=0, pos=11)}
  byteorder := 'little'
  chunkshape := (3718,)
  autoindex := True
  colindexes := {
    "acquisition_date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "image_name": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "local_mars_time": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "image_id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "marking": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "classification_id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
    "user_name": Index(6, medium, shuffle, zlib(1)).is_csi=False}

here's an easy example showing the delay in displaying a single row of data:

items = [str(i) for i in range(4000000)]
s = pd.Series(items, dtype='category')
df = pd.DataFrame({'C':s, 'data':np.random.randn(4000000)})
data = df[df.C=='20']
data.C.values

jreback · 2015-10-12T22:03:51Z

add #11305 for the rendering which is a bug

jreback · 2015-10-12T22:06:31Z

did you create the index after the table? can you show some sample code

michaelaye · 2015-10-12T22:13:00Z

I'm doing stuff on temporary files while those object columns still being objects, and only at the very end convert every object column to categorical, before creating the database HDF file with one df.to_hdf command. This is the last part of the whole pipeline:

    logging.info('Merging temp files manually.')

    if image_names is None:
        image_names = get_image_names(dbname)

    dbname_base, ext = os.path.splitext(dbname)
    dbnamenew = dbname_base + '_cleaned' + ext
    logging.info('Creating concatenated db file {}'.format(dbnamenew))
    df = []
    for image_name in image_names:
        try:
            df.append(pd.read_hdf(get_temp_fname(image_name), 'df'))
        except OSError:
            continue
        else:
            os.remove(get_temp_fname(image_name))
    df = pd.concat(df, ignore_index=True)

    # change types to category
    to_category = ['image_name', 'classification_id', 'image_id', 'image_url',
                   d'user_name', 'marking', 'local_mars_time']
    for col in to_category:
        df[col] = df[col].astype('category')

    df.to_hdf(dbnamenew, 'df',
              format='table',
              data_columns=data_columns)
    logging.info('final database complete.')
    return dbnamenew

jreback · 2015-10-12T22:16:31Z

and if you do that search on the frame in memory, does it work correctly?
(you could simulate be reading the entire frame into memory, then selecting).

Trying to narrow down if its the creation somewhere or something else.

michaelaye · 2015-10-12T22:31:21Z

Good point. When reading everything into memory, it does not match anything, i.e. working correctly, because this classification_id value should have been filtered out before. Maybe my filtering is crooked, something with setting things on views?

Here's how I filter: (It's a bit complicated, but basically: When one user has created more than one unique classification_id I have to keep the earlier one:)
FYI, image_name is a higher hierarchy than image_id, i.e. each image_name has hundreds of tiles that are identified via image_id:

This function gets to several cores:

    def process_image_name(image_name):
        import pandas as pd
        data = pd.read_hdf(dbname, 'df', where='image_name==' + image_name)
        data = remove_duplicates_from_image_name_data(data)
        data.to_hdf(get_temp_fname(image_name), 'df')

and the remover function is:

def remove_duplicates_from_image_name_data(data):
    """remove duplicates from this data.

    Parameters
    ==========
    data: pd.DataFrame
        data filtered for one image_name

    Returns
    =======
    For each `user_name` and `image_id` found in `data` return only the data
    for the first found classification_id. There *should* only be one
    classification_id per user_name and image_id, but sometimes the queue
    presented the same image_id more than once to the same users. This removes
    any later in time classification_ids per user_name and image_id.
    """
    def process_user_group(g):
        c_id = g.sort_values(by='created_at').classification_id.iloc[0]
        return g[g.classification_id == c_id]
    return data.groupby(['image_id', 'user_name']).apply(
        process_user_group).reset_index(drop=True)

I'm going to try now what happens on the unfiltered database file.

michaelaye · 2015-10-13T06:14:23Z

So, I read in the whole dataframe, as above, it works in memory, and then saved it again, with exactly the same problem. (But, interestingly, half the size of the HDF file?). That should mean, that I'm not doing anything wrong, correct?

michaelaye · 2015-10-13T06:20:40Z

I saved the dataframe after converting just the classification_id column to string. My on-disk selection works fine in that case.

jreback · 2015-10-13T12:16:05Z

The following replicates in size & scale what you are doing, yes?

In [1]: cats = [ "s%07d" % i for i in xrange(4000000) ]

In [2]: df = DataFrame({'A' : cats + cats + cats })

In [3]: df['B'] = df['A'].astype('category')

In [4]: df.B.cat.codes.dtype
Out[4]: dtype('int32')

In [5]: df[df.B=='s0000005']
Out[5]: 
                A         B
5        s0000005  s0000005
4000005  s0000005  s0000005
8000005  s0000005  s0000005

In [6]: df.to_hdf('test.h5','df',mode='w',data_columns=True,format='table')

In [7]: df[df.B=='s3999999']
Out[7]: 
                 A         B
3999999   s3999999  s3999999
7999999   s3999999  s3999999
11999999  s3999999  s3999999

In [8]: pd.read_hdf('test.h5','df',where='A="s3999999"')
Out[8]: 
                 A         B
3999999   s3999999  s3999999
7999999   s3999999  s3999999
11999999  s3999999  s3999999

In [9]: pd.read_hdf('test.h5','df',where='B="s3999999"')
Out[9]: 
                 A         B
3999999   s3999999  s3999999
7999999   s3999999  s3999999
11999999  s3999999  s3999999

TomAugspurger · 2018-06-28T11:42:31Z

@michaelaye is this still an issue?

rhshadrach · 2021-03-25T20:38:06Z

Without a repeatable example or a user to report, nothing further can be done to investigate.

jreback mentioned this issue Oct 12, 2015

PERF: rendering of large number of categories #11305

Closed

jreback added the IO HDF5 read_hdf, HDFStore label Oct 12, 2015

mroeschke added Bug Needs Info Clarification about behavior needed to assess issue labels May 16, 2020

rhshadrach closed this as completed Mar 25, 2021

Uh oh!

Incorrect match for pd.Term with Categorical at read_hdf #11304

Incorrect match for pd.Term with Categorical at read_hdf #11304

Comments

michaelaye commented Oct 12, 2015

INSTALLED VERSIONS

michaelaye commented Oct 12, 2015

Uh oh!

jreback commented Oct 12, 2015

Uh oh!

jreback commented Oct 12, 2015

Uh oh!

michaelaye commented Oct 12, 2015

Uh oh!

jreback commented Oct 12, 2015

Uh oh!

michaelaye commented Oct 12, 2015

Uh oh!

jreback commented Oct 12, 2015

Uh oh!

michaelaye commented Oct 12, 2015

Uh oh!

jreback commented Oct 12, 2015

Uh oh!

jreback commented Oct 12, 2015

Uh oh!

michaelaye commented Oct 12, 2015

Uh oh!

jreback commented Oct 12, 2015

Uh oh!

michaelaye commented Oct 12, 2015

Uh oh!

michaelaye commented Oct 13, 2015

Uh oh!

michaelaye commented Oct 13, 2015

Uh oh!

jreback commented Oct 13, 2015

Uh oh!

TomAugspurger commented Jun 28, 2018

Uh oh!

rhshadrach commented Mar 25, 2021

Uh oh!