Skip to content

qcut does not handle infinite values correctly #11113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chrish42 opened this issue Sep 15, 2015 · 3 comments
Open

qcut does not handle infinite values correctly #11113

chrish42 opened this issue Sep 15, 2015 · 3 comments
Labels
Bug cut cut, qcut

Comments

@chrish42
Copy link
Contributor

Calling qcut with infinite values in a pandas Series should be a well-defined operation, but it tends to produce wrong results or raise (un-obvious) exceptions. I'm using the following snippet to test:

data = range(10) + [np.inf] * n
s = pd.Series(data, index=data)
pd.qcut(s, [0.1, 0.9])

When called with n=1, it produces the following result:

0.000000       NaN
1.000000    [1, 9]
2.000000    [1, 9]
3.000000    [1, 9]
4.000000    [1, 9]
5.000000    [1, 9]
6.000000    [1, 9]
7.000000    [1, 9]
8.000000    [1, 9]
9.000000    [1, 9]
inf            NaN
dtype: category
Categories (1, object): [[1, 9]]

I don't think that the 0 value and the inf should get assigned to NaN bins. When called with n=2, it now produces:

0.000000           NaN
1.000000           NaN
2.000000    [1.1, inf]
3.000000    [1.1, inf]
4.000000    [1.1, inf]
5.000000    [1.1, inf]
6.000000    [1.1, inf]
7.000000    [1.1, inf]
8.000000    [1.1, inf]
9.000000    [1.1, inf]
inf         [1.1, inf]
inf         [1.1, inf]
dtype: category
Categories (1, object): [[1.1, inf]]

Again, the binning looks suspicious to me... And when called with n >= 3, I get the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-29-db4904bb94b0> in <module>()
      1 data = range(10) + [np.inf] * 3
      2 s = pd.Series(data, index=data)
----> 3 pd.qcut(s, [0.1, 0.9])

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in qcut(x, q, labels, retbins, precision)
    167     bins = algos.quantile(x, quantiles)
    168     return _bins_to_cuts(x, bins, labels=labels, retbins=retbins,precision=precision,
--> 169                          include_lowest=True)
    170 
    171 

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _bins_to_cuts(x, bins, right, labels, retbins, precision, name, include_lowest)
    201                 try:
    202                     levels = _format_levels(bins, precision, right=right,
--> 203                                             include_lowest=include_lowest)
    204                 except ValueError:
    205                     increases += 1

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_levels(bins, prec, right, include_lowest)
    240         levels = []
    241         for a, b in zip(bins, bins[1:]):
--> 242             fa, fb = fmt(a), fmt(b)
    243 
    244             if a != b and fa == fb:

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in <lambda>(v)
    236 def _format_levels(bins, prec, right=True,
    237                    include_lowest=False):
--> 238     fmt = lambda v: _format_label(v, precision=prec)
    239     if right:
    240         levels = []

C:\Anaconda\lib\site-packages\pandas\tools\tile.pyc in _format_label(x, precision)
    274                     return '%d' % (-whole - 1)
    275                 else:
--> 276                     return '%d' % (whole + 1)
    277 
    278             if 'e' in val:

TypeError: %d format: a number is required, not numpy.float64

... which doesn't look very related to the cause at first glance. What is happening here is that the value passed to _format_label() and then to the % operator is a NaN, which is doesn't support.

@toobaz toobaz added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jan 7, 2019
@ron819
Copy link

ron819 commented Jan 9, 2019

@chrish42 Are you still facing this issue?

@jbrockmendel jbrockmendel added the quantile quantile method label Nov 1, 2019
@mroeschke mroeschke added Bug cut cut, qcut and removed quantile quantile method labels Apr 5, 2020
@mroeschke mroeschke removed the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Apr 18, 2021
@fangchenli
Copy link
Member

n = 1
data = list(range(10)) + [np.inf] * n
s = pd.Series(data, index=data)
result = pd.qcut(s, [0.1, 0.9])

result:

/opt/homebrew/Caskroom/miniforge/base/envs/pandas-dev/lib/python3.8/site-packages/numpy/lib/function_base.py:4011: RuntimeWarning: invalid value encountered in multiply
  lerp_interpolation = asanyarray(add(a, diff_b_a*t, out=out))
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    result = pd.qcut(s, [0.1, 0.9])
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/reshape/tile.py", line 376, in qcut
    fac, bins = _bins_to_cuts(
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/reshape/tile.py", line 441, in _bins_to_cuts
    labels = _format_labels(
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/reshape/tile.py", line 583, in _format_labels
    return IntervalIndex.from_breaks(breaks, closed=closed)
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/indexes/interval.py", line 255, in from_breaks
    array = IntervalArray.from_breaks(
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 415, in from_breaks
    return cls.from_arrays(breaks[:-1], breaks[1:], closed, copy=copy, dtype=dtype)
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 492, in from_arrays
    return cls._simple_new(
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 335, in _simple_new
    result._validate()
  File "/Users/fangchenli/Workspace/pandas-fangchenli/pandas/core/arrays/interval.py", line 601, in _validate
    raise ValueError(msg)
ValueError: missing values must be missing in the same location both left and right sides

@HansBambel
Copy link

I am facing the same issue:
pd.qcut([1,2,3,4,5,-np.inf, np.inf], q=3, duplicates="drop")
results in ValueError: missing values must be missing in the same location both left and right sides

I was expecting the first and last bin to contain np.inf. This was working in pandas 1.1.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug cut cut, qcut
Projects
None yet
Development

No branches or pull requests

7 participants