-
Notifications
You must be signed in to change notification settings - Fork 22
Faster isin comparisons #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit I am confused by what is here.
Please forgive my ignorance if I've got everything back-to-front.
I note that the network objects could be addressed as n.network_address.packed
and n.netmask.packed
to get 8- or 16-byte byte-strings which could be ingested by numpy.
for i, ip in enumerate(pyips): | ||
if ip in network: | ||
mask[i] = True | ||
mask |= self._isin_network(network) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to short-circuit here? Or is that something for a future cythonized method...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why loop to create networks
and then loop again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, I don't think there's anything to short-circut. We may not know whether an element of self
is in any of the other
networks until the last network.
Some context, I expected self
to be large and networks
to be small. So I'm OK looping over networks
.
cyberpandas/ip_array.py
Outdated
return mask | ||
|
||
def _isin_network(self, other): | ||
# type: (Union[ipaddress.IPv4Network,ipaddress.IPv6Network]) -> ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return type here is bool
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is elementwise, like pandas.Seres.isin
In [9]: pd.Series(['a', 'b', 'c']).isin(['a', 'b'])
Out[9]:
0 True
1 True
2 False
dtype: bool
cyberpandas/ip_array.py
Outdated
net_hi = int(other.broadcast_address) | ||
|
||
if isinstance(other, ipaddress.IPv4Network): | ||
# We know we only need the lower 64 bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicitly an IP4 address (where the hi part is non-zero) cannot be in
an IP6 address?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. The largest IPv4 address is
In [13]: IPArray([2 ** 32 - 1]).data
Out[13]:
array([(0, 4294967295)],
dtype=[('hi', '>u8'), ('lo', '>u8')])
So being IPv4 implies the hi bit is 0 (the inverse is not true, since some IPv6 addresses are less than 2 ** 64 -1).
cyberpandas/ip_array.py
Outdated
return mask | ||
|
||
def _isin_network(self, other): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should also accept list/array of IPs rather than networks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My plan is for isin
to accept either
- A single network / network string like
'192.168.0.0/28'
, or a list like of those - A list of IP Addresses
The first would call into this method. The second would call into an _is_ipaddresses(self, addresses)
method that hasn't been written yet..
cyberpandas/ip_array.py
Outdated
if isinstance(other, ipaddress.IPv4Network): | ||
# We know we only need the lower 64 bits | ||
return (self.data['hi'] == 0) & ((net_lo <= self.data['lo']) & | ||
(self.data['lo'] <= net_hi)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't entirely look like the condition (self.data['lo'] & int(other.netmask)) == int(other.network)
from the linked SO article.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should update that comment. I changed it to use what IPv4Network.__contains__
does.
# dealing with another address
else:
# address
return (int(self.network_address) <= int(other._ip) <=
int(self.broadcast_address))
Which I think is equivalent to the bit masking linked in the SO article.
cyberpandas/ip_array.py
Outdated
hi_mask = (self.data['hi'] > 0) & hi_mask_lo & hi_mask_hi | ||
# This seems incorrect... | ||
lo_mask = self.data['lo'] < (net_hi - net_lo) | ||
return hi_mask & lo_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this maybe should be
(self.data['lo'] & int(other.netmask) & lo_mask) == (int(other.network) & lo_mask) & (
(self.data['hi'] & int(other.netmask)>>64) == (int(other.network)>>64)
(lo_mask
is 0xffffffff
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean
(self.data['lo'] & int(other.netmask) & lo_mask) == (int(other.network_address) & lo_mask) & (
(self.data['hi'] & int(other.netmask)>>64) == (int(other.network_address)>>64))
with .network_address
instead of network
? That faills the test for
s = ip.IPArray([u'2001:db8::1', u'2001:db9::1'])
result = s.isin([u'2001:db8::0/96'])
It says both are in the network, while '2001:db9::1'
should not be (I think).
I'll take a closer look at your approach though. This stuff is confusing for me :/
I forgot that I had already written code for comparisons of |
# above by 'broadcast_address'. | ||
# IPArray handles comparisons between arrays of addresses, and NumPy | ||
# handles broadcasting. | ||
net_lo = type(self)([other.network_address]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this fail on attempt to compare IP4 and IP6?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we're in an IPArray, we don't care about IPv4 vs. IPv6. Everything is 128 bits.
net_lo = type(self)([other.network_address]) | ||
net_hi = type(self)([other.broadcast_address]) | ||
|
||
return (net_lo <= self) & (self <= net_hi) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this using ipaddress
's internal comparisons, or is this implemented by us elsewhere? I like that it ended up as a simple one-liner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented by us in
cyberpandas/cyberpandas/ip_array.py
Line 185 in 52e28f7
def __lt__(self, other): |
Amazing what a night of sleep can do to change your perspective on a problem :)
OK, liking it. |
No description provided.