Skip to content

BUG: Pandas groupby datetime and column then apply generates ValueError #21651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ginward opened this issue Jun 27, 2018 · 12 comments · Fixed by #41697
Closed

BUG: Pandas groupby datetime and column then apply generates ValueError #21651

ginward opened this issue Jun 27, 2018 · 12 comments · Fixed by #41697
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@ginward
Copy link

ginward commented Jun 27, 2018

Code Sample, a copy-pastable example if possible

    | SYM_ROOT | TIME_M                     | BEST_BID | BEST_ASK | increment | genjud_incre | 
    |----------|----------------------------|----------|----------|-----------|--------------| 
    | A        | 2017-01-03 09:30:00.004712 | 45.91    | 46.12    | 0         | 4680         | 
    | AA       | 2017-01-03 09:30:00.004014 | 28.55    | 28.57    | 0         | 4680         | 
# Your code here
df=pd.read_csv('stack.csv')
df['TIME_M']=pd.to_datetime(df['TIME_M'],format='%Y%m%d %H:%M:%S.%f')
df.groupby(['SYM_ROOT',df['TIME_M'].dt.date]).apply(group_increment_to_end)

def group_increment_to_end(x):
    return x.iloc[0:1]

Problem description

I am trying to group my dataframe and then apply a function to each row of the dataframe. SYM_ROOT is a category variable, while TIME_M is a datetime variable.

However, I keep getting the following error:

ValueError: Key 2017-01-03 00:00:00 not in level Index([2017-01-03], dtype='object', name=u'TIME_M')

I am referring to this stackoverflow post.

I think the reason is that pandas somehow doesn't recognize the date object in the column correctly when using the group by statement. It falsely think compares 2017-01-03 00:00:00 with TIME_M and generates the error. After separating out the date as a single column, the problem fixes itself.

But I think it will be beneficial if pandas can recognize the date object correctly in the columns ...

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 2.7.14.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-693.21.1.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.4.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@mroeschke
Copy link
Member

Simpler example:

In [15]: df = pd.DataFrame({'date': pd.date_range('2010-01-01', freq='12H', periods=5), 'vals': range(5), 'let': list('abcde')})

In [16]: df
Out[16]:
                 date  vals let
0 2010-01-01 00:00:00     0   a
1 2010-01-01 12:00:00     1   b
2 2010-01-02 00:00:00     2   c
3 2010-01-02 12:00:00     3   d
4 2010-01-03 00:00:00     4   e

In [17]: df.groupby([df.let, df.date.dt.date]).apply(lambda x: x.iloc[0:])

In [18]: pd.__version__
Out[18]: '0.24.0.dev0+177.g45e55af'

From the traceback, looks like it's an issue in indexing a (Multi?)Index with dates with a Timestamp

Investigation and PRs welcome!

Traceback

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2968             try:
-> 2969                 return self._engine.get_loc(key)
   2970             except KeyError:

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
    139
--> 140     cpdef get_loc(self, object val):
    141         if is_definitely_invalid_key(val):

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
    161         try:
--> 162             return self.mapping.get_item(val)
    163         except (TypeError, ValueError):

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
   1489
-> 1490     cpdef get_item(self, object val):
   1491         cdef khiter_t k

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
   1497         else:
-> 1498             raise KeyError(val)
   1499

KeyError: Timestamp('2010-01-01 00:00:00')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in _make_concat_multiindex(indexes, keys, levels, names)
    563                 try:
--> 564                     i = level.get_loc(key)
    565                 except KeyError:

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2970             except KeyError:
-> 2971                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2972

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
    139
--> 140     cpdef get_loc(self, object val):
    141         if is_definitely_invalid_key(val):

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
    161         try:
--> 162             return self.mapping.get_item(val)
    163         except (TypeError, ValueError):

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
   1489
-> 1490     cpdef get_item(self, object val):
   1491         cdef khiter_t k

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
   1497         else:
-> 1498             raise KeyError(val)
   1499

KeyError: Timestamp('2010-01-01 00:00:00')

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    917             try:
--> 918                 result = self._python_apply_general(f)
    919             except Exception:

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    940             values,
--> 941             not_indexed_same=mutated or self.mutated)
    942

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/groupby/groupby.py in _wrap_applied_output(self, keys, values, not_indexed_same)
   4220             return self._concat_objects(keys, values,
-> 4221                                         not_indexed_same=not_indexed_same)
   4222         elif self.grouper.groupings is not None:

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/groupby/groupby.py in _concat_objects(self, keys, values, not_indexed_same)
   1135                                 levels=group_levels, names=group_names,
-> 1136                                 sort=False)
   1137             else:

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    224                        verify_integrity=verify_integrity,
--> 225                        copy=copy, sort=sort)
    226     return op.get_result()

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    377
--> 378         self.new_axes = self._get_new_axes()
    379

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in _get_new_axes(self)
    457
--> 458         new_axes[self.axis] = self._get_concat_axis()
    459         return new_axes

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in _get_concat_axis(self)
    513             concat_axis = _make_concat_multiindex(indexes, self.keys,
--> 514                                                   self.levels, self.names)
    515

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in _make_concat_multiindex(indexes, keys, levels, names)
    566                     raise ValueError('Key {key!s} not in level {level!s}'
--> 567                                      .format(key=key, level=level))
    568

ValueError: Key 2010-01-01 00:00:00 not in level Index([2010-01-01, 2010-01-02, 2010-01-03], dtype='object', name='date')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2968             try:
-> 2969                 return self._engine.get_loc(key)
   2970             except KeyError:

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
    139
--> 140     cpdef get_loc(self, object val):
    141         if is_definitely_invalid_key(val):

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
    161         try:
--> 162             return self.mapping.get_item(val)
    163         except (TypeError, ValueError):

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
   1489
-> 1490     cpdef get_item(self, object val):
   1491         cdef khiter_t k

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
   1497         else:
-> 1498             raise KeyError(val)
   1499

KeyError: Timestamp('2010-01-01 00:00:00')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in _make_concat_multiindex(indexes, keys, levels, names)
    563                 try:
--> 564                     i = level.get_loc(key)
    565                 except KeyError:

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   2970             except KeyError:
-> 2971                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2972

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
    139
--> 140     cpdef get_loc(self, object val):
    141         if is_definitely_invalid_key(val):

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
    161         try:
--> 162             return self.mapping.get_item(val)
    163         except (TypeError, ValueError):

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
   1489
-> 1490     cpdef get_item(self, object val):
   1491         cdef khiter_t k

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
   1497         else:
-> 1498             raise KeyError(val)
   1499

KeyError: Timestamp('2010-01-01 00:00:00')

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-17-e66fb9399357> in <module>()
----> 1 df.groupby([df.let, df.date.dt.date]).apply(lambda x: x.iloc[0:])

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
    928
    929                 with _group_selection_context(self):
--> 930                     return self._python_apply_general(f)
    931
    932         return result

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
    939             keys,
    940             values,
--> 941             not_indexed_same=mutated or self.mutated)
    942
    943     def _iterate_slices(self):

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/groupby/groupby.py in _wrap_applied_output(self, keys, values, not_indexed_same)
   4219         elif isinstance(v, DataFrame):
   4220             return self._concat_objects(keys, values,
-> 4221                                         not_indexed_same=not_indexed_same)
   4222         elif self.grouper.groupings is not None:
   4223             if len(self.grouper.groupings) > 1:

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/groupby/groupby.py in _concat_objects(self, keys, values, not_indexed_same)
   1134                 result = concat(values, axis=self.axis, keys=group_keys,
   1135                                 levels=group_levels, names=group_names,
-> 1136                                 sort=False)
   1137             else:
   1138

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity, sort, copy)
    223                        keys=keys, levels=levels, names=names,
    224                        verify_integrity=verify_integrity,
--> 225                        copy=copy, sort=sort)
    226     return op.get_result()
    227

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity, copy, sort)
    376         self.copy = copy
    377
--> 378         self.new_axes = self._get_new_axes()
    379
    380     def get_result(self):

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in _get_new_axes(self)
    456                 new_axes[i] = ax
    457
--> 458         new_axes[self.axis] = self._get_concat_axis()
    459         return new_axes
    460

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in _get_concat_axis(self)
    512         else:
    513             concat_axis = _make_concat_multiindex(indexes, self.keys,
--> 514                                                   self.levels, self.names)
    515
    516         self._maybe_check_integrity(concat_axis)

/mnt/c/Users/Matt Roeschke/Projects/pandas-mroeschke/pandas/core/reshape/concat.py in _make_concat_multiindex(indexes, keys, levels, names)
    565                 except KeyError:
    566                     raise ValueError('Key {key!s} not in level {level!s}'
--> 567                                      .format(key=key, level=level))
    568
    569                 to_concat.append(np.repeat(i, len(index)))

ValueError: Key 2010-01-01 00:00:00 not in level Index([2010-01-01, 2010-01-02, 2010-01-03], dtype='object', name='date')

@mroeschke mroeschke added Bug Datetime Datetime data dtype Groupby labels Jun 27, 2018
@ginward ginward changed the title Pandas groupby datetime and column then apply generates ValueError BUG: Pandas groupby datetime and column then apply generates ValueError Jun 27, 2018
@uds5501
Copy link
Contributor

uds5501 commented Jun 29, 2018

@mroeschke did you mean

In [17]: df.groupby([df.let, df.date, df.date]).apply(lambda x: x.iloc[0:])

@mroeschke
Copy link
Member

No, the code above runs property to reproduce the error (albeit I used a confusing column name date for my example dataframe)

@uds5501
Copy link
Contributor

uds5501 commented Jun 29, 2018

@mroeschke I see that now, sorry, it was syntax error from my side only!

@sagarchaturvedi1
Copy link

I am also facing this issue. My column contained date string, i parsed it using dateutil parser. I created 2 new columns for date and time. Then I grouped by an id column and date and trying to sort by time. Here is the code:

agg_data = data.groupby(["_id","updateDate"]).apply(lambda x: x.sort_values(["updateTime"]))

I get following error -
ValueError: Key 2018-06-09 00:00:00 not in level Index([2018-05-11, 2018-05-12, 2018-05-13, 2018-05-14, 2018-05-15, 2018-05-16,
2018-05-17, 2018-05-18, 2018-05-19, 2018-05-20, 2018-05-21, 2018-05-22,
2018-05-23, 2018-05-24, 2018-05-25, 2018-05-26, 2018-05-27, 2018-05-28,
2018-05-29, 2018-05-30, 2018-05-31, 2018-06-01, 2018-06-02, 2018-06-03,
2018-06-04, 2018-06-05, 2018-06-06, 2018-06-07, 2018-06-08, 2018-06-09,
2018-06-10, 2018-06-11, 2018-06-12, 2018-06-13, 2018-06-14, 2018-06-15,
2018-06-16, 2018-06-17, 2018-06-18, 2018-06-19, 2018-06-20, 2018-06-21,
2018-06-22, 2018-06-23, 2018-06-24, 2018-06-25, 2018-06-26, 2018-06-27,
2018-06-28, 2018-06-29, 2018-06-30, 2018-07-01, 2018-07-02, 2018-07-03,
2018-07-04, 2018-07-05, 2018-07-06, 2018-07-07, 2018-07-08, 2018-07-09,
2018-07-10, 2018-07-11, 2018-07-12, 2018-07-13, 2018-07-14, 2018-07-15,
2018-07-16, 2018-07-17, 2018-07-18, 2018-07-19, 2018-07-20, 2018-07-21],
dtype='object', name='updateDate')

Output of pd.show_versions() -

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1065-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.8.0
pip: 18.0
setuptools: 40.2.0
Cython: 0.28.5
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.1.6
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@rrtaylor
Copy link

I am also getting this error with the following example:

    BY_MONTH = pd.Grouper(key='date', freq='M', axis=1)
    df = pd.DataFrame({
        'date': pd.date_range(start='2000-01-01', freq='D', periods=100),
        'value': range(100)
    })
    ts = df.groupby(('value', BY_MONTH))['value'].mean()

I can confirm that this was working in 0.22.0, but started failing in 0.23.0

@TomAugspurger
Copy link
Contributor

@richardbrks your example may be correctly raising. I believe we had a change that made tuples always refer to a label. If you do df.groupby(['value', BY_MONTH]) you'll get your expected output.

@hwalinga
Copy link
Contributor

This runs fine on the current master. So I think this can be closed.

@jreback
Copy link
Contributor

jreback commented May 24, 2020

if you would like to see if we have a test for this or can add one would be great

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug Groupby Datetime Datetime data dtype labels May 24, 2020
@maxzinkus
Copy link

This seems not to be working in 1.0.5

@hwalinga
Copy link
Contributor

@maxzinkus Yeah, this only started working in 1.1.0.

@emc5ud emc5ud removed their assignment Dec 31, 2020
@Rasori
Copy link
Contributor

Rasori commented Jan 31, 2021

I can confirm that the issue is not reproducible with or after version 1.1.0.

@mroeschke mroeschke mentioned this issue May 28, 2021
10 tasks
@mroeschke mroeschke added this to the 1.3 milestone May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.