Skip to content

Allows for merging of SparseDataFrames, and fixes __array__ interface #19488

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
5b48613
Hrm I'm trying here but nothing seems to work
hexgnu Jan 26, 2018
555fb91
First pass at fixing issues with SparseDataFrame merging
hexgnu Feb 1, 2018
89677b0
Fix linting errors
hexgnu Feb 2, 2018
97bd19b
One more linting error
hexgnu Feb 2, 2018
b04ac64
Fixing more test failures
hexgnu Feb 2, 2018
d24e464
Add more tests and fix some existing ones
hexgnu Feb 2, 2018
a796280
Only allow for sparse merging if everything is sparse
hexgnu Feb 2, 2018
be09289
Some more tests
hexgnu Feb 2, 2018
1aef901
Merge remote-tracking branch 'upstream/master' into dense_array_sparse
hexgnu Feb 5, 2018
42680ba
Merge branch 'dense_array_sparse' into fix_for_merging_sparse_frames
hexgnu Feb 5, 2018
2084ed3
Fix problem with __array__ not showing dense values
hexgnu Feb 5, 2018
77c41b7
Fix some linting errors
hexgnu Feb 5, 2018
25fd08a
I think I fixed it
hexgnu Feb 5, 2018
cd583f7
Don't assume that the dtype is int64
hexgnu Feb 5, 2018
45e7cd3
Get rid of WTF from code
hexgnu Feb 5, 2018
6522d6b
Missed one point where assumed int64
hexgnu Feb 5, 2018
029d37b
Typo fix should be len(values.shape)
hexgnu Feb 5, 2018
730c152
This will fix windows bug on appveyor
hexgnu Feb 6, 2018
171f5dd
Linting error
hexgnu Feb 6, 2018
bde2588
Add whatsnew entry for everything
hexgnu Feb 6, 2018
bfe3065
Add docstring to is_sparse_join_units
hexgnu Feb 6, 2018
d13daa7
WIP for merging sparse frames
hexgnu Feb 7, 2018
3ddba8b
Merge remote-tracking branch 'upstream/master' into fix_for_merging_s…
hexgnu Feb 7, 2018
4fa7dec
Allow for indexes to where on SparseArray
hexgnu Feb 12, 2018
353171b
Rely on ABCSparseArray over SparseArray
hexgnu Feb 12, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.23.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -538,7 +538,7 @@ Reshaping
- Bug in :func:`DataFrame.merge` in which merging using ``Index`` objects as vectors raised an Exception (:issue:`19038`)
- Bug in :func:`DataFrame.stack`, :func:`DataFrame.unstack`, :func:`Series.unstack` which were not returning subclasses (:issue:`15563`)
- Bug in timezone comparisons, manifesting as a conversion of the index to UTC in ``.concat()`` (:issue:`18523`)
-
- Bug in :func:`SparseDataFrame.merge` which raises error (:issue:`13665`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can go under enhancements, saying "Merging sparse DataFrames" is now supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh it's an enhancement alrighty thanks for the guidance.



Categorical
Expand Down
19 changes: 17 additions & 2 deletions pandas/core/internals.py
Original file line number Diff line number Diff line change
Expand Up @@ -2918,14 +2918,15 @@ def make_block(values, placement, klass=None, ndim=None, dtype=None,
# GH#19265 pyarrow is passing this
warnings.warn("fastpath argument is deprecated, will be removed "
"in a future release.", DeprecationWarning)

if klass is None:
dtype = dtype or values.dtype
klass = get_block_type(values, dtype)

elif klass is DatetimeTZBlock and not is_datetimetz(values):
return klass(values, ndim=ndim,
placement=placement, dtype=dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to install a flake8 plugin for your editor ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ok installing now. I fully admit that I'm awful when it comes to linting failures and should just tool the fix. Might as well install Ctrl-P while I'm at it.

return klass(values, ndim=ndim, placement=placement)

# TODO: flexible with index=None and/or items=None
Expand Down Expand Up @@ -5120,14 +5121,28 @@ def concatenate_block_managers(mgrs_indexers, axes, concat_axis, copy):
elif is_uniform_join_units(join_units):
b = join_units[0].block.concat_same_type(
[ju.block for ju in join_units], placement=placement)
elif is_sparse_join_units(join_units):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you ever go down the initial if with sparse arrays?

values = concatenate_join_units(join_units, concat_axis, copy=copy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a mess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, yes it is. I'll work on cleaning this up tomorrow since there's too many branches in this code.

values = values[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this line... A comment maybe? or is it incorrect?

block = join_units[0].block

if block:
fill_value = block.fill_value
else:
fill_value = np.nan
array = SparseArray(values, fill_value=fill_value)
b = make_block(array, klass=SparseBlock, placement=placement)
else:
b = make_block(
concatenate_join_units(join_units, concat_axis, copy=copy),
placement=placement)
placement=placement
)
blocks.append(b)

return BlockManager(blocks, axes)

def is_sparse_join_units(join_units):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstring would be nice, at least noting that this is true if any blocks are sparse.

return any(type(ju.block) is SparseBlock for ju in join_units)

def is_uniform_join_units(join_units):
"""
Expand Down
8 changes: 7 additions & 1 deletion pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@
concatenate_block_managers)
from pandas.util._decorators import Appender, Substitution

from pandas.core.sparse.array import SparseArray

from pandas.core.sorting import is_int64_overflow_possible
import pandas.core.algorithms as algos
import pandas.core.sorting as sorting
Expand Down Expand Up @@ -731,7 +733,11 @@ def _maybe_add_join_keys(self, result, left_indexer, right_indexer):
if mask.all():
key_col = rvals
else:
key_col = Index(lvals).where(~mask, rvals)
# Might need to be IntIndex not Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't do this, use _values

if isinstance(lvals, SparseArray):
key_col = Index(lvals.get_values()).where(~mask, rvals)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this has memory or performance issues. But this is the best solution I could come to with this. The other solution would be to look at using lvals.sp_index and implement a where on it that works.

One thing I have noticed is that IntIndex doesn't act quite like Index which makes for doing these masks tricky in sparse land.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to avoid get_values if possible.

else:
key_col = Index(lvals).where(~mask, rvals)

if result._is_label_reference(name):
result[name] = key_col
Expand Down
5 changes: 5 additions & 0 deletions pandas/core/sparse/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@
import pandas.core.ops as ops
import pandas.core.common as com

from collections import Counter

_shared_doc_kwargs = dict(klass='SparseDataFrame')


Expand Down Expand Up @@ -73,6 +75,9 @@ def __init__(self, data=None, index=None, columns=None, default_kind=None,
if columns is None:
raise Exception("cannot pass a series w/o a name or columns")
data = {columns[0]: data}
elif isinstance(data, BlockManager):
if default_fill_value is None:
default_fill_value, _ = Counter([b.fill_value for b in data.blocks]).most_common(1)[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is kosher or not...

Basically I kept running into the issue of

If you create a SparseDataFrame with a bunch of SparseSeries / SparseArrays that have fill_values like == 1 or ==0 or something then it doesn't persist that to default_fill_value. That seems like an enhancement and I could take it out of this PR but it helped me test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a recipe for surprise. I'd hate for df.reindex() to do something different, based on the types of blocks that SparseDataFrame happened to be initialized with. If a user explicitly sets default_fill_value that's one thing, but inferring it from the data seems problematic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take it out it didn't add much tbh. was just making my testing easier ;)


if default_fill_value is None:
default_fill_value = np.nan
Expand Down
29 changes: 29 additions & 0 deletions pandas/tests/reshape/merge/test_merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import numpy as np
import random
import re
import itertools

import pandas as pd
from pandas.compat import lrange, lzip
Expand Down Expand Up @@ -1800,3 +1801,31 @@ def test_merge_on_indexes(self, left_df, right_df, how, sort, expected):
how=how,
sort=sort)
tm.assert_frame_equal(result, expected)

class TestMergeSparseDataFrames(object):
# Cannot seem to get 0 or 1 working with sparse data frame
@pytest.mark.parametrize('fill_value,how', itertools.product([np.nan], ['left', 'right', 'outer', 'inner']))
def test_merge_two_sparse_frames(self, fill_value, how):
dense_evens = pd.DataFrame({'A': list(range(0, 200, 2)), 'B': np.random.randint(0,100, size=100)})
dense_threes = pd.DataFrame({'A': list(range(0, 300, 3)), 'B': np.random.randint(0,100, size=100)})

dense_merge = dense_evens.merge(dense_threes, how=how, on='A')

# If you merge two dense frames together it tends to default to float64 not the original dtype
dense_merge['B_x'] = dense_merge['B_x'].astype(np.int64, errors='ignore')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This to me seems kind of bizarre and I couldn't find an issue for it but basically:

If you merge two dense frames together that I defined above the dtype goes from int64 to float64. I think I know where the code is that's doing that so I could fix it but didn't want to get too side tracked in this work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's since np.nan is the missing value indicator, which is a float. Doing the merge induces missing values for how=left/right/outer, so the ints are cast to floats.

dense_merge['B_y'] = dense_merge['B_y'].astype(np.int64, errors='ignore')

sparse_evens = dense_evens.to_sparse(fill_value=fill_value)
sparse_threes = dense_threes.to_sparse(fill_value=fill_value)

sparse_merge = sparse_evens.merge(sparse_threes, how=how, on='A')

assert sparse_merge.default_fill_value is fill_value

tm.assert_sp_frame_equal(dense_merge.to_sparse(fill_value=fill_value), sparse_merge, exact_indices=False, check_dtype=False)


@pytest.mark.parametrize('fill_value,how', itertools.product([0, 1, np.nan, None], ['left', 'right', 'outer', 'inner']))
def test_merge_dense_sparse_frames(self, fill_value, how):
"pass"

17 changes: 10 additions & 7 deletions pandas/tests/sparse/frame/test_frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -222,27 +222,30 @@ class Unknown:
'"Unknown" for data argument'):
SparseDataFrame(Unknown())

def test_constructor_preserve_attr(self):
@pytest.mark.parametrize('fill_value', [0, 1, np.nan, None])
def test_constructor_preserve_attr(self, fill_value):
# GH 13866
arr = pd.SparseArray([1, 0, 3, 0], dtype=np.int64, fill_value=0)
arr = pd.SparseArray([1, 0, 3, 0], dtype=np.int64, fill_value=fill_value)
assert arr.dtype == np.int64
assert arr.fill_value == 0
assert arr.fill_value == fill_value

df = pd.SparseDataFrame({'x': arr})
assert df['x'].dtype == np.int64
assert df['x'].fill_value == 0
assert df['x'].fill_value == fill_value
assert df.default_fill_value == fill_value

s = pd.SparseSeries(arr, name='x')
assert s.dtype == np.int64
assert s.fill_value == 0
assert s.fill_value == fill_value

df = pd.SparseDataFrame(s)
assert df['x'].dtype == np.int64
assert df['x'].fill_value == 0
assert df['x'].fill_value == fill_value

df = pd.SparseDataFrame({'x': s})
assert df['x'].dtype == np.int64
assert df['x'].fill_value == 0
assert df['x'].fill_value == fill_value


def test_constructor_nan_dataframe(self):
# GH 10079
Expand Down