-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Incorrect match for pd.Term with Categorical at read_hdf #11304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Here's the database store: <class 'pandas.io.pytables.HDFStore'>
File path: /Users/klay6683/data/planet4/2015-10-11_planet_four_classifications_queryable_cleaned.h5
/df frame_table (typ->appendable,nrows->12588576,ncols->21,indexers->[index],dc->[classification_id,image_id,image_name,user_name,marking,acquisition_date,local_mars_time])
/df/meta/classification_id/meta series_table (typ->appendable,nrows->4454898,ncols->1,indexers->[index],dc->[values])
/df/meta/image_id/meta series_table (typ->appendable,nrows->105796,ncols->1,indexers->[index],dc->[values])
/df/meta/image_name/meta series_table (typ->appendable,nrows->420,ncols->1,indexers->[index],dc->[values])
/df/meta/local_mars_time/meta series_table (typ->appendable,nrows->237,ncols->1,indexers->[index],dc->[values])
/df/meta/marking/meta series_table (typ->appendable,nrows->4,ncols->1,indexers->[index],dc->[values])
/df/meta/user_name/meta series_table (typ->appendable,nrows->111593,ncols->1,indexers->[index],dc->[values])
/df/meta/values_block_2/meta series_table (typ->appendable,nrows->105796,ncols->1,indexers->[index],dc->[values]) |
if you created this before using Pytables 3.2.2 you should create it again. |
no idea about the display issue, pls show someting that is easily reproducible. |
I created this database today, so it's 3.2.2 |
show the dtype of the classification_id table itself |
Like so: data.dtypes
classification_id category
created_at datetime64[ns]
image_id category
image_name category
image_url category
user_name category
marking category
x_tile int64
y_tile int64
acquisition_date datetime64[ns]
local_mars_time category
x float64
y float64
image_x float64
image_y float64
radius_1 float64
radius_2 float64
distance float64
angle float64
spread float64
version float64
dtype: object working on an example for display time... |
show |
/df/table (Table(12588576,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(10,), dflt=0.0, pos=1),
"values_block_1": Int64Col(shape=(1,), dflt=0, pos=2),
"values_block_2": Int32Col(shape=(1,), dflt=0, pos=3),
"values_block_3": Int64Col(shape=(2,), dflt=0, pos=4),
"classification_id": Int32Col(shape=(), dflt=0, pos=5),
"image_id": Int32Col(shape=(), dflt=0, pos=6),
"image_name": Int16Col(shape=(), dflt=0, pos=7),
"user_name": Int32Col(shape=(), dflt=0, pos=8),
"marking": Int8Col(shape=(), dflt=0, pos=9),
"acquisition_date": Int64Col(shape=(), dflt=0, pos=10),
"local_mars_time": Int16Col(shape=(), dflt=0, pos=11)}
byteorder := 'little'
chunkshape := (3718,)
autoindex := True
colindexes := {
"acquisition_date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"image_name": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"local_mars_time": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"image_id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"marking": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"classification_id": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"user_name": Index(6, medium, shuffle, zlib(1)).is_csi=False} here's an easy example showing the delay in displaying a single row of data: items = [str(i) for i in range(4000000)]
s = pd.Series(items, dtype='category')
df = pd.DataFrame({'C':s, 'data':np.random.randn(4000000)})
data = df[df.C=='20']
data.C.values |
add #11305 for the rendering which is a bug |
did you create the index after the table? can you show some sample code |
I'm doing stuff on temporary files while those logging.info('Merging temp files manually.')
if image_names is None:
image_names = get_image_names(dbname)
dbname_base, ext = os.path.splitext(dbname)
dbnamenew = dbname_base + '_cleaned' + ext
logging.info('Creating concatenated db file {}'.format(dbnamenew))
df = []
for image_name in image_names:
try:
df.append(pd.read_hdf(get_temp_fname(image_name), 'df'))
except OSError:
continue
else:
os.remove(get_temp_fname(image_name))
df = pd.concat(df, ignore_index=True)
# change types to category
to_category = ['image_name', 'classification_id', 'image_id', 'image_url',
d'user_name', 'marking', 'local_mars_time']
for col in to_category:
df[col] = df[col].astype('category')
df.to_hdf(dbnamenew, 'df',
format='table',
data_columns=data_columns)
logging.info('final database complete.')
return dbnamenew |
and if you do that search on the frame in memory, does it work correctly? Trying to narrow down if its the creation somewhere or something else. |
Good point. When reading everything into memory, it does not match anything, i.e. working correctly, because this Here's how I filter: (It's a bit complicated, but basically: When one user has created more than one unique This function gets to several cores: def process_image_name(image_name):
import pandas as pd
data = pd.read_hdf(dbname, 'df', where='image_name==' + image_name)
data = remove_duplicates_from_image_name_data(data)
data.to_hdf(get_temp_fname(image_name), 'df') and the remover function is: def remove_duplicates_from_image_name_data(data):
"""remove duplicates from this data.
Parameters
==========
data: pd.DataFrame
data filtered for one image_name
Returns
=======
For each `user_name` and `image_id` found in `data` return only the data
for the first found classification_id. There *should* only be one
classification_id per user_name and image_id, but sometimes the queue
presented the same image_id more than once to the same users. This removes
any later in time classification_ids per user_name and image_id.
"""
def process_user_group(g):
c_id = g.sort_values(by='created_at').classification_id.iloc[0]
return g[g.classification_id == c_id]
return data.groupby(['image_id', 'user_name']).apply(
process_user_group).reset_index(drop=True) I'm going to try now what happens on the unfiltered database file. |
So, I read in the whole dataframe, as above, it works in memory, and then saved it again, with exactly the same problem. (But, interestingly, half the size of the HDF file?). That should mean, that I'm not doing anything wrong, correct? |
I saved the dataframe after converting just the |
The following replicates in size & scale what you are doing, yes?
|
@michaelaye is this still an issue? |
Without a repeatable example or a user to report, nothing further can be done to investigate. |
In below screenshot, I am scanning the database for a categorical called
classification_id
with the value50ef44b795e6e42cd2000001
but I am getting a data-row where the categorical has the value 50ef44b795e6e42cd6000001`.How is this possible? Note that my list of categorical is huge, more than 4 million entries, with 12 million total rows. (Yes, on average, each classification_id appears 3 times.)
On a side note: The display of this one row of numpy in the line with
.values
at the end takes a lot of time, possibly due to large size of the Categorical, can that be avoided somehow?Here's my required meta-data for the bug report:
pandas Version: 0.17.0
INSTALLED VERSIONS
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: None
numpy: 1.10.0
scipy: 0.16.0
statsmodels: None
IPython: 4.1.0-dev
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.8
pymysql: None
psycopg2: None
The text was updated successfully, but these errors were encountered: