Quantile function fails when performing groupby on Time Zone Aware Timestamps #33168

jlandercy · 2020-03-31T07:16:35Z

Code Sample, a copy-pastable example if possible

Maybe not a high priority bug, but I have the feeling it can easily fixed. I just have not enough understanding on how it should be fixed. Please find below the MCVE to reproduce it:

import numpy as np
import pandas as pd

# Sample Dataset:
n = 200
c = np.random.choice([0,1,2], size=(n,))
d = np.random.randn(n)
t = pd.date_range(start='2020-04-19 00:00:00', freq='1T', periods=n, tz='UTC')
df = pd.DataFrame([r for r in zip(c, t, d)], columns=['category', 'timestamp', 'value'])
df['rtime'] = df['timestamp'].dt.floor('1H')

# Failing operation:
df.groupby('rtime').quantile([0.1, 0.5, 0.9])

Problem description

The traceback of the error is a bit laconic and I have not enough experience in Pandas source code to cover all details of this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-35-aff2c9a206a6> in <module>
----> 1 df.groupby('rtime').quantile([0.1,0.2])

/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py in quantile(self, q, interpolation)
   1926                     interpolation=interpolation,
   1927                 )
-> 1928                 for qi in q
   1929             ]
   1930             result = concat(results, axis=0, keys=q)

/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py in <listcomp>(.0)
   1926                     interpolation=interpolation,
   1927                 )
-> 1928                 for qi in q
   1929             ]
   1930             result = concat(results, axis=0, keys=q)

/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py in _get_cythonized_result(self, how, cython_dtype, aggregate, needs_values, needs_mask, needs_ngroups, result_is_index, pre_processing, post_processing, **kwargs)
   2289                 func = partial(func, ngroups)
   2290 
-> 2291             func(**kwargs)  # Call func to modify indexer values in place
   2292 
   2293             if result_is_index:

pandas/_libs/groupby.pyx in pandas._libs.groupby.__pyx_fused_cpdef()

TypeError: No matching signature found

I have found similar issues on GitHub with the same exception, but I think it is too generic to be the same related problem. Additionally, I may have found a simple corner case issue with TZ aware timestamp.

I had some hard time to reproduce the error when building the MCVE, finally I found out that it is related to the existence of an extra columns holding Time Zone aware timestamps.

Maybe the fix it is just about updating function signature to add TZ aware timestamps.

The problem can be circonvolved using one of the following writing:

df.groupby('rtime')['value'].quantile([0.1,0.2])

Or:

df['timestamp'] = df['timestamp'].dt.tz_convert(None)
df.groupby('rtime').quantile([0.1,0.2])

Or:

df.pop('timestamp')
df.groupby('rtime').quantile([0.1,0.2])

Which strongly suggests it is the existence of the TZ Aware extra column timestamp that makes the function quantile fail.

Expected Output

Expected output might be no distinction in flow when performing groupby operations on dataframe holding TimeZone aware timestamp as it does with TZ naive timestamp.

Note: Thank you for building such a great tool, pandas is a first class middleware. Your efforts are strongly appreciated. Let me know how I can help, I would be happy to understand how this can be corrected.

Output of `pd.show_versions()`

commit : None python : 3.6.9.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-91-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 9.0.1
setuptools : 46.1.3
Cython : 0.29.14
pytest : 5.3.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 0.999999999
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 5.3.2
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : None
tabulate : 0.8.3
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8
numba : None

The text was updated successfully, but these errors were encountered:

mroeschke · 2020-03-31T17:39:56Z

Very similar to #6409, this data type is not supported probably for many groupby operations.

Thanks for the report! PRs and investigations welcome

jlandercy · 2020-04-03T06:46:08Z

Hello @mroeschke, thank you for tagging this issues. Could you precise in what sense issue #33168 is similar to #6409?

I am open to investigate further this issue, but I really lack insight on how pandas works under the hood. Mainly, I can see that this call fails:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py#L2267

And the function is defined as follow:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py#L2251

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py#L2239

But then I am totally lost, where should I see next? It seems to me than groupby is a really complex method.

mroeschke · 2020-04-03T16:16:35Z

#33168 and #6409 are similar in the sense that datetime data types are not supported in groupby.

A lot of our aggregation functions for cython are located here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby.pyx

For this issue, and similarly described in #6409, we will need to convert these dates to integers before passing them to cython, and then wrap them back in their original data type

simonjayhawkins · 2020-04-08T11:54:04Z

At first glance, It appears this code sample gave the correct results in 0.25.3

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'0.25.3'
>>>
>>> # Sample Dataset:
... n = 200
>>> c = np.random.choice([0, 1, 2], size=(n,))
>>> d = np.random.randn(n)
>>> t = pd.date_range(start="2020-04-19 00:00:00", freq="1T", periods=n, tz="UTC")
>>> df = pd.DataFrame([r for r in zip(c, t, d)], columns=["category", "timestamp", "value"])
>>> df["rtime"] = df["timestamp"].dt.floor("1H")
>>>
>>> # Failing operation:
... df.groupby("rtime").quantile([0.1, 0.5, 0.9])
                               category           timestamp     value
rtime
2020-04-19 00:00:00+00:00 0.1       0.0 2020-04-19 00:05:54 -1.363624
                          0.5       1.0 2020-04-19 00:29:30 -0.101755
                          0.9       2.0 2020-04-19 00:53:06  0.882355
2020-04-19 01:00:00+00:00 0.1       0.0 2020-04-19 01:05:54 -1.265425
                          0.5       1.0 2020-04-19 01:29:30  0.022835
                          0.9       2.0 2020-04-19 01:53:06  1.470477
2020-04-19 02:00:00+00:00 0.1       0.0 2020-04-19 02:05:54 -1.313590
                          0.5       1.0 2020-04-19 02:29:30  0.049966
                          0.9       2.0 2020-04-19 02:53:06  1.162317
2020-04-19 03:00:00+00:00 0.1       0.0 2020-04-19 03:01:54 -1.321574
                          0.5       1.0 2020-04-19 03:09:30  0.054089
                          0.9       2.0 2020-04-19 03:17:06  1.055515
>>> df.groupby("rtime").quantile([0.1, 0.5, 0.9]).index.get_level_values(0)
DatetimeIndex(['2020-04-19 00:00:00+00:00', '2020-04-19 00:00:00+00:00',
               '2020-04-19 00:00:00+00:00', '2020-04-19 01:00:00+00:00',
               '2020-04-19 01:00:00+00:00', '2020-04-19 01:00:00+00:00',
               '2020-04-19 02:00:00+00:00', '2020-04-19 02:00:00+00:00',
               '2020-04-19 02:00:00+00:00', '2020-04-19 03:00:00+00:00',
               '2020-04-19 03:00:00+00:00', '2020-04-19 03:00:00+00:00'],
              dtype='datetime64[ns, UTC]', name='rtime', freq=None)
>>>

mroeschke · 2021-07-31T01:55:50Z

This looks okay on master. Could use a test

   ...: import pandas as pd
   ...:
   ...: # Sample Dataset:
   ...: n = 200
   ...: c = np.random.choice([0,1,2], size=(n,))
   ...: d = np.random.randn(n)
   ...: t = pd.date_range(start='2020-04-19 00:00:00', freq='1T', periods=n, tz='UTC')
   ...: df = pd.DataFrame([r for r in zip(c, t, d)], columns=['category', 'timestamp', 'value'])
   ...: df['rtime'] = df['timestamp'].dt.floor('1H')
   ...:
   ...: # Failing operation:
   ...: df.groupby('rtime').quantile([0.1, 0.5, 0.9])
Out[2]:
                               category     value
rtime
2020-04-19 00:00:00+00:00 0.1       0.0 -1.546211
                          0.5       1.0 -0.285830
                          0.9       2.0  1.125414
2020-04-19 01:00:00+00:00 0.1       0.0 -1.077983
                          0.5       1.0  0.067612
                          0.9       2.0  0.858393
2020-04-19 02:00:00+00:00 0.1       0.0 -1.291133
                          0.5       1.0 -0.060443
                          0.9       2.0  1.216735
2020-04-19 03:00:00+00:00 0.1       0.0 -0.590608
                          0.5       1.0 -0.063747
                          0.9       2.0  0.949927

dannyi96 · 2022-06-28T07:03:02Z

hi @mroeschke, is this issue for adding a test case still open ? if so can I pick this ?

mroeschke · 2022-06-28T16:24:51Z

hi @mroeschke, is this issue for adding a test case still open ? if so can I pick this ?

Go for it!

dannyi96 · 2022-06-28T17:32:11Z

take

… groupby on Time Zone Aware Timestamps

… groupby on Time Zone Aware Timestamps pre check changes

closes #33168 ) (#47575)

closes pandas-dev#33168 ) (pandas-dev#47575)

mroeschke added Bug Groupby Timezones Timezone data dtype labels Mar 31, 2020

cvanweelden mentioned this issue Jun 27, 2020

BUG: Fixes plotting with nullable integers (#32073) #34896

Closed

5 tasks

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jul 31, 2021

github-actions bot assigned dannyi96 Jun 28, 2022

dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 1, 2022

testcase for pandas-dev#33168 Quantile function fails when performing…

d653961

… groupby on Time Zone Aware Timestamps

dannyi96 mentioned this issue Jul 1, 2022

Adding testcase for quartile performed on a timestamp column group by ( closes #33168 ) #47575

Merged

dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 1, 2022

testcase for pandas-dev#33168 Quantile function fails when performing…

4daed96

… groupby on Time Zone Aware Timestamps pre check changes

dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 1, 2022

testcase for pandas-dev#33168 code style changes

f7556be

dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 1, 2022

testcase for pandas-dev#33168 code style changes

afcb9a6

dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 2, 2022

testcase for pandas-dev#33168 code style changes

8b331a4

dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 2, 2022

testcase for pandas-dev#33168 addressing review comments

ebf1d36

jreback added this to the 1.5 milestone Jul 3, 2022

jreback closed this as completed in #47575 Jul 3, 2022

jreback pushed a commit that referenced this issue Jul 3, 2022

Adding testcase for quartile performed on a timestamp column group by (

d05bdeb

closes #33168 ) (#47575)

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this issue Jul 13, 2022

Adding testcase for quartile performed on a timestamp column group by (

b069044

closes pandas-dev#33168 ) (pandas-dev#47575)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Quantile function fails when performing groupby on Time Zone Aware Timestamps #33168

Quantile function fails when performing groupby on Time Zone Aware Timestamps #33168

jlandercy commented Mar 31, 2020

mroeschke commented Mar 31, 2020

Uh oh!

jlandercy commented Apr 3, 2020

Uh oh!

mroeschke commented Apr 3, 2020

Uh oh!

simonjayhawkins commented Apr 8, 2020

Uh oh!

mroeschke commented Jul 31, 2021

Uh oh!

dannyi96 commented Jun 28, 2022

Uh oh!

mroeschke commented Jun 28, 2022

Uh oh!

dannyi96 commented Jun 28, 2022

Uh oh!

Uh oh!

Quantile function fails when performing groupby on Time Zone Aware Timestamps #33168

Quantile function fails when performing groupby on Time Zone Aware Timestamps #33168

Comments

jlandercy commented Mar 31, 2020

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

mroeschke commented Mar 31, 2020

Uh oh!

jlandercy commented Apr 3, 2020

Uh oh!

mroeschke commented Apr 3, 2020

Uh oh!

simonjayhawkins commented Apr 8, 2020

Uh oh!

mroeschke commented Jul 31, 2021

Uh oh!

dannyi96 commented Jun 28, 2022

Uh oh!

mroeschke commented Jun 28, 2022

Uh oh!

dannyi96 commented Jun 28, 2022

Uh oh!

Output of `pd.show_versions()`