Skip to content

Quantile function fails when performing groupby on Time Zone Aware Timestamps #33168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jlandercy opened this issue Mar 31, 2020 · 8 comments · Fixed by #47575
Closed

Quantile function fails when performing groupby on Time Zone Aware Timestamps #33168

jlandercy opened this issue Mar 31, 2020 · 8 comments · Fixed by #47575
Assignees
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions Timezones Timezone data dtype
Milestone

Comments

@jlandercy
Copy link

Code Sample, a copy-pastable example if possible

Maybe not a high priority bug, but I have the feeling it can easily fixed. I just have not enough understanding on how it should be fixed. Please find below the MCVE to reproduce it:

import numpy as np
import pandas as pd

# Sample Dataset:
n = 200
c = np.random.choice([0,1,2], size=(n,))
d = np.random.randn(n)
t = pd.date_range(start='2020-04-19 00:00:00', freq='1T', periods=n, tz='UTC')
df = pd.DataFrame([r for r in zip(c, t, d)], columns=['category', 'timestamp', 'value'])
df['rtime'] = df['timestamp'].dt.floor('1H')

# Failing operation:
df.groupby('rtime').quantile([0.1, 0.5, 0.9])

Problem description

The traceback of the error is a bit laconic and I have not enough experience in Pandas source code to cover all details of this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-35-aff2c9a206a6> in <module>
----> 1 df.groupby('rtime').quantile([0.1,0.2])

/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py in quantile(self, q, interpolation)
   1926                     interpolation=interpolation,
   1927                 )
-> 1928                 for qi in q
   1929             ]
   1930             result = concat(results, axis=0, keys=q)

/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py in <listcomp>(.0)
   1926                     interpolation=interpolation,
   1927                 )
-> 1928                 for qi in q
   1929             ]
   1930             result = concat(results, axis=0, keys=q)

/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py in _get_cythonized_result(self, how, cython_dtype, aggregate, needs_values, needs_mask, needs_ngroups, result_is_index, pre_processing, post_processing, **kwargs)
   2289                 func = partial(func, ngroups)
   2290 
-> 2291             func(**kwargs)  # Call func to modify indexer values in place
   2292 
   2293             if result_is_index:

pandas/_libs/groupby.pyx in pandas._libs.groupby.__pyx_fused_cpdef()

TypeError: No matching signature found

I have found similar issues on GitHub with the same exception, but I think it is too generic to be the same related problem. Additionally, I may have found a simple corner case issue with TZ aware timestamp.

I had some hard time to reproduce the error when building the MCVE, finally I found out that it is related to the existence of an extra columns holding Time Zone aware timestamps.

Maybe the fix it is just about updating function signature to add TZ aware timestamps.

The problem can be circonvolved using one of the following writing:

df.groupby('rtime')['value'].quantile([0.1,0.2])

Or:

df['timestamp'] = df['timestamp'].dt.tz_convert(None)
df.groupby('rtime').quantile([0.1,0.2])

Or:

df.pop('timestamp')
df.groupby('rtime').quantile([0.1,0.2])

Which strongly suggests it is the existence of the TZ Aware extra column timestamp that makes the function quantile fail.

Expected Output

Expected output might be no distinction in flow when performing groupby operations on dataframe holding TimeZone aware timestamp as it does with TZ naive timestamp.

Note: Thank you for building such a great tool, pandas is a first class middleware. Your efforts are strongly appreciated. Let me know how I can help, I would be happy to understand how this can be corrected.

Output of pd.show_versions()

commit : None python : 3.6.9.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-91-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 9.0.1
setuptools : 46.1.3
Cython : 0.29.14
pytest : 5.3.2
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 0.999999999
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.13.0
pytables : None
pytest : 5.3.2
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.15
tables : None
tabulate : 0.8.3
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8
numba : None

@mroeschke
Copy link
Member

Very similar to #6409, this data type is not supported probably for many groupby operations.

Thanks for the report! PRs and investigations welcome

@mroeschke mroeschke added Bug Groupby Timezones Timezone data dtype labels Mar 31, 2020
@jlandercy
Copy link
Author

Hello @mroeschke, thank you for tagging this issues. Could you precise in what sense issue #33168 is similar to #6409?

I am open to investigate further this issue, but I really lack insight on how pandas works under the hood. Mainly, I can see that this call fails:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py#L2267

And the function is defined as follow:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py#L2251

https://github.com/pandas-dev/pandas/blob/master/pandas/core/groupby/groupby.py#L2239

But then I am totally lost, where should I see next? It seems to me than groupby is a really complex method.

@mroeschke
Copy link
Member

#33168 and #6409 are similar in the sense that datetime data types are not supported in groupby.

A lot of our aggregation functions for cython are located here: https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/groupby.pyx

For this issue, and similarly described in #6409, we will need to convert these dates to integers before passing them to cython, and then wrap them back in their original data type

@simonjayhawkins
Copy link
Member

At first glance, It appears this code sample gave the correct results in 0.25.3

>>> import numpy as np
>>> import pandas as pd
>>> pd.__version__
'0.25.3'
>>>
>>> # Sample Dataset:
... n = 200
>>> c = np.random.choice([0, 1, 2], size=(n,))
>>> d = np.random.randn(n)
>>> t = pd.date_range(start="2020-04-19 00:00:00", freq="1T", periods=n, tz="UTC")
>>> df = pd.DataFrame([r for r in zip(c, t, d)], columns=["category", "timestamp", "value"])
>>> df["rtime"] = df["timestamp"].dt.floor("1H")
>>>
>>> # Failing operation:
... df.groupby("rtime").quantile([0.1, 0.5, 0.9])
                               category           timestamp     value
rtime
2020-04-19 00:00:00+00:00 0.1       0.0 2020-04-19 00:05:54 -1.363624
                          0.5       1.0 2020-04-19 00:29:30 -0.101755
                          0.9       2.0 2020-04-19 00:53:06  0.882355
2020-04-19 01:00:00+00:00 0.1       0.0 2020-04-19 01:05:54 -1.265425
                          0.5       1.0 2020-04-19 01:29:30  0.022835
                          0.9       2.0 2020-04-19 01:53:06  1.470477
2020-04-19 02:00:00+00:00 0.1       0.0 2020-04-19 02:05:54 -1.313590
                          0.5       1.0 2020-04-19 02:29:30  0.049966
                          0.9       2.0 2020-04-19 02:53:06  1.162317
2020-04-19 03:00:00+00:00 0.1       0.0 2020-04-19 03:01:54 -1.321574
                          0.5       1.0 2020-04-19 03:09:30  0.054089
                          0.9       2.0 2020-04-19 03:17:06  1.055515
>>> df.groupby("rtime").quantile([0.1, 0.5, 0.9]).index.get_level_values(0)
DatetimeIndex(['2020-04-19 00:00:00+00:00', '2020-04-19 00:00:00+00:00',
               '2020-04-19 00:00:00+00:00', '2020-04-19 01:00:00+00:00',
               '2020-04-19 01:00:00+00:00', '2020-04-19 01:00:00+00:00',
               '2020-04-19 02:00:00+00:00', '2020-04-19 02:00:00+00:00',
               '2020-04-19 02:00:00+00:00', '2020-04-19 03:00:00+00:00',
               '2020-04-19 03:00:00+00:00', '2020-04-19 03:00:00+00:00'],
              dtype='datetime64[ns, UTC]', name='rtime', freq=None)
>>>

@mroeschke
Copy link
Member

This looks okay on master. Could use a test

   ...: import pandas as pd
   ...:
   ...: # Sample Dataset:
   ...: n = 200
   ...: c = np.random.choice([0,1,2], size=(n,))
   ...: d = np.random.randn(n)
   ...: t = pd.date_range(start='2020-04-19 00:00:00', freq='1T', periods=n, tz='UTC')
   ...: df = pd.DataFrame([r for r in zip(c, t, d)], columns=['category', 'timestamp', 'value'])
   ...: df['rtime'] = df['timestamp'].dt.floor('1H')
   ...:
   ...: # Failing operation:
   ...: df.groupby('rtime').quantile([0.1, 0.5, 0.9])
Out[2]:
                               category     value
rtime
2020-04-19 00:00:00+00:00 0.1       0.0 -1.546211
                          0.5       1.0 -0.285830
                          0.9       2.0  1.125414
2020-04-19 01:00:00+00:00 0.1       0.0 -1.077983
                          0.5       1.0  0.067612
                          0.9       2.0  0.858393
2020-04-19 02:00:00+00:00 0.1       0.0 -1.291133
                          0.5       1.0 -0.060443
                          0.9       2.0  1.216735
2020-04-19 03:00:00+00:00 0.1       0.0 -0.590608
                          0.5       1.0 -0.063747
                          0.9       2.0  0.949927

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Jul 31, 2021
@dannyi96
Copy link
Contributor

hi @mroeschke, is this issue for adding a test case still open ? if so can I pick this ?

@mroeschke
Copy link
Member

hi @mroeschke, is this issue for adding a test case still open ? if so can I pick this ?

Go for it!

@dannyi96
Copy link
Contributor

take

dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 1, 2022
dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 1, 2022
… groupby on Time Zone Aware Timestamps pre check changes
dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 1, 2022
dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 1, 2022
dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 2, 2022
dannyi96 added a commit to dannyi96/pandas that referenced this issue Jul 2, 2022
@jreback jreback added this to the 1.5 milestone Jul 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Groupby Needs Tests Unit test(s) needed to prevent regressions Timezones Timezone data dtype
Projects
None yet
5 participants