Skip to content

Conversation

realead
Copy link
Contributor

@realead realead commented Jan 1, 2021

  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

The goal of this PR is to make this comment obsolete:

// we implicitly convert signed int to unsigned int, thus potential overflows
// for operations (<<,*,+) don't trigger undefined behavior, also >>-operator
// is implementation defined for signed ints if sign-bit is set.
// because we never really "get" the keys, there will be no convertion from
// unsigend int to (signed) int (which would be implementation defined behavior)
// this holds also for 64-, 16- and 8-bit integers

see also this discussion for even more details.

The issue is:

  • the current way (inherited from khash-original) is too clever (and suprising)
  • it is wrong as the (unsigned) keys are converted back to signed values (which is implementation defined behavior) for example here:
    result_keys[i] = <{{dtype}}>table.keys[k]

Using signed integer, one must take into account that hash-function with signed integers might have undefined/implementation defined behavior, thus we cast signed integers to unsigned counterpart in the (64bit-) hash function now.

@jreback jreback added the Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff label Jan 1, 2021
@jreback jreback added this to the 1.3 milestone Jan 1, 2021
@jreback
Copy link
Contributor

jreback commented Jan 1, 2021

this looks fine. I assume that nothing in the current impl breaks because of this (e.g. converting int -> uint is safe as far as our tests go). any perf implications?

@realead
Copy link
Contributor Author

realead commented Jan 2, 2021

@jreback I expect no perf changes and ran asv only for a subset of tests ( -b hash_functions -b ^algorithms -b ^indexing) and there were no changes (albeit hash_functions.NumericSeriesIndexing seems to be somewhat flacky on my machine - but I have observed this behavior in other PRs as well).

@jreback jreback merged commit 39e1261 into pandas-dev:master Jan 3, 2021
@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

thanks @realead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants