Skip to content

BUG: DataFrame.sparse.from_spmatrix hard codes an invalid fill_value for certain subtypes #59064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
15 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/user_guide/sparse.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ Use :meth:`DataFrame.sparse.from_spmatrix` to create a :class:`DataFrame` with s
sp_arr = csr_matrix(arr)
sp_arr

sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr, fill_value=0)
sdf.head()
sdf.dtypes

Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -584,7 +584,7 @@ Reshaping
Sparse
^^^^^^
- Bug in :class:`SparseDtype` for equal comparison with na fill value. (:issue:`54770`)
-
- Bug in :meth:`DataFrame.sparse.from_spmatrix` which hard coded an invalid ``fill_value`` for certain subtypes. (:issue:`59063`)

ExtensionArray
^^^^^^^^^^^^^^
Expand Down
30 changes: 24 additions & 6 deletions pandas/core/arrays/sparse/accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,9 @@ def _validate(self, data) -> None:
raise AttributeError(self._validation_msg)

@classmethod
def from_spmatrix(cls, data, index=None, columns=None) -> DataFrame:
def from_spmatrix(
cls, data, index=None, columns=None, fill_value=None
) -> DataFrame:
"""
Create a new DataFrame from a scipy sparse matrix.

Expand All @@ -276,6 +278,22 @@ def from_spmatrix(cls, data, index=None, columns=None) -> DataFrame:
index, columns : Index, optional
Row and column labels to use for the resulting DataFrame.
Defaults to a RangeIndex.
fill_value : scalar, optional
The scalar value not stored in the columns. By default, this
depends on the dtype of ``data``.

=========== ==========
dtype na_value
=========== ==========
float ``np.nan``
complex ``np.nan``
int ``0``
bool ``False``
datetime64 ``pd.NaT``
timedelta64 ``pd.NaT``
=========== ==========

The default value may be overridden by specifying a ``fill_value``.

Returns
-------
Expand All @@ -292,11 +310,11 @@ def from_spmatrix(cls, data, index=None, columns=None) -> DataFrame:
--------
>>> import scipy.sparse
>>> mat = scipy.sparse.eye(3, dtype=float)
>>> pd.DataFrame.sparse.from_spmatrix(mat)
>>> pd.DataFrame.sparse.from_spmatrix(mat, fill_value=0.0)
0 1 2
0 1.0 0 0
1 0 1.0 0
2 0 0 1.0
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
"""
from pandas._libs.sparse import IntIndex

Expand All @@ -313,7 +331,7 @@ def from_spmatrix(cls, data, index=None, columns=None) -> DataFrame:
indices = data.indices
indptr = data.indptr
array_data = data.data
dtype = SparseDtype(array_data.dtype, 0)
dtype = SparseDtype(array_data.dtype, fill_value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use na_value_for_dtype instead of introducing a new argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matthew. Thank you for your speedy response and taking the time to review my PR.

na_value_for_dtype is indirectly called in this implementation with the default argument of fill_value as None when passed to the SparseDtype constructor.

if fill_value is None:
fill_value = na_value_for_dtype(dtype)

A fill_value parameter is typically only needed when constructing a sparse format object from a dense format object, as we need to identify non-zero elements in the data to set attributes like the data, index, and index pointer (for sparse array formats like BSR, CSR, and CSC) in the sparse format object correctly. In this case, we are simply accessing these attributes from a CSC matrix, so you are right in that a fill_value parameter is not strictly required to solve the bug.

However, in addition to fixing the bug and not adding overhead, it is handy to have a fill_value parameter to give the user flexibility in certain use cases, for example, converting a SparseArray to a np.ndarray using np.asarray will use fill_value to populate the missing elements. This is along the lines of how, albeit not a sparse implementation, np.ma.core.MaskedArray has a filled method that performs the same functionality of using a custom fill_value to convert to a np.ndarray.

Would you be OK with me keeping it?

The current test failures also seem unrelated to the PR and have been around since #59027.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core maintainers have been cautious about expanding the APIs surface areas and signatures unless necessary or for consistency issues. I would prefer if this bug was solved without adding a new keyword argument first, and then you could open a new issue about adding a new keyword here that has opt-in from more core team memebers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand and yes, the cautious approach makes total sense given the size of pandas.

The issues that I see arising as a result of removing the fill_value parameter and implementing your change are:

  • DataFrame.sparse.from_spmatrix().sparse.to_coo, shown as an example in the user guide, will break for float and complex subtypes as it will raise a ValueError. It seems like the original thought of adding this was because a custom fill_value will be lost upon converting to a COO matrix as scipy.sparse._coo.coo_matrix and other sparse formats do not have an analogous attribute and instead use False, 0, 0., and 0. + 0.j when returning a dense representation of them as a np.ndarray or np.matrix, which was considered unexpected behaviour at the time. 🤷 We can remove this check as the constructor called in to_coo is using the ijv format and not directly instantiating using a single 2-D np.ndarray, so the COO matrix returned will be correct regardless of the fill_value.
    sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
    sdf.head()
    sdf.dtypes
    All sparse formats are supported, but matrices that are not in :mod:`COOrdinate <scipy.sparse>` format will be converted, copying data as needed.
    To convert back to sparse SciPy matrix in COO format, you can use the :meth:`DataFrame.sparse.to_coo` method:
    .. ipython:: python
    sdf.sparse.to_coo()

    if sp_arr.fill_value != 0:
    raise ValueError("fill value must be 0 when converting to COO matrix")
  • All of the tests that rely on a DataFrame.sparse.from_spmatrix invocation, apart from the one that I added to test the changes, test_from_spmatrix_fill_value, assume a fill_value of 0. The expectations will all have to be changed and will subsequently have a dependency on na_value_for_dtype for the fill_value to use, and the docstring example will have to change.

Would you like me to make the changes above, close the PR, or keep the fill_value parameter as it is currently implemented in my branch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes above would be appropriate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent. It should be rather straightforward and quick in that case! I will make the changes now.

arrays = []
for i in range(n_columns):
sl = slice(indptr[i], indptr[i + 1])
Expand Down
7 changes: 4 additions & 3 deletions pandas/core/dtypes/dtypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1666,7 +1666,7 @@ class SparseDtype(ExtensionDtype):
"""
Dtype for data stored in :class:`SparseArray`.

`SparseDtype` is used as the data type for :class:`SparseArray`, enabling
SparseDtype is used as the data type for :class:`SparseArray`, enabling
more efficient storage of data that contains a significant number of
repetitive values typically represented by a fill value. It supports any
scalar dtype as the underlying data type of the non-fill values.
Expand All @@ -1677,19 +1677,20 @@ class SparseDtype(ExtensionDtype):
The dtype of the underlying array storing the non-fill value values.
fill_value : scalar, optional
The scalar value not stored in the SparseArray. By default, this
depends on `dtype`.
depends on ``dtype``.

=========== ==========
dtype na_value
=========== ==========
float ``np.nan``
complex ``np.nan``
int ``0``
bool ``False``
datetime64 ``pd.NaT``
timedelta64 ``pd.NaT``
=========== ==========

The default value may be overridden by specifying a `fill_value`.
The default value may be overridden by specifying a ``fill_value``.

Attributes
----------
Expand Down
4 changes: 3 additions & 1 deletion pandas/core/dtypes/missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -618,6 +618,8 @@ def na_value_for_dtype(dtype: DtypeObj, compat: bool = True):
nan
>>> na_value_for_dtype(np.dtype("float64"))
nan
>>> na_value_for_dtype(np.dtype("complex128"))
nan
>>> na_value_for_dtype(np.dtype("bool"))
False
>>> na_value_for_dtype(np.dtype("datetime64[ns]"))
Expand All @@ -629,7 +631,7 @@ def na_value_for_dtype(dtype: DtypeObj, compat: bool = True):
elif dtype.kind in "mM":
unit = np.datetime_data(dtype)[0]
return dtype.type("NaT", unit)
elif dtype.kind == "f":
elif dtype.kind in "fc":
return np.nan
elif dtype.kind in "iu":
if compat:
Expand Down
28 changes: 24 additions & 4 deletions pandas/tests/arrays/sparse/test_accessor.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,14 +105,16 @@ def test_accessor_raises(self):

@pytest.mark.parametrize("format", ["csc", "csr", "coo"])
@pytest.mark.parametrize("labels", [None, list(string.ascii_letters[:10])])
@pytest.mark.parametrize("dtype", ["float64", "int64"])
@pytest.mark.parametrize("dtype", ["complex128", "float64", "int64"])
def test_from_spmatrix(self, format, labels, dtype):
sp_sparse = pytest.importorskip("scipy.sparse")

sp_dtype = SparseDtype(dtype, np.array(0, dtype=dtype).item())

mat = sp_sparse.eye(10, format=format, dtype=dtype)
result = pd.DataFrame.sparse.from_spmatrix(mat, index=labels, columns=labels)
result = pd.DataFrame.sparse.from_spmatrix(
mat, index=labels, columns=labels, fill_value=0
)
expected = pd.DataFrame(
np.eye(10, dtype=dtype), index=labels, columns=labels
).astype(sp_dtype)
Expand All @@ -124,7 +126,7 @@ def test_from_spmatrix_including_explicit_zero(self, format):

mat = sp_sparse.random(10, 2, density=0.5, format=format)
mat.data[0] = 0
result = pd.DataFrame.sparse.from_spmatrix(mat)
result = pd.DataFrame.sparse.from_spmatrix(mat, fill_value=0)
dtype = SparseDtype("float64", 0.0)
expected = pd.DataFrame(mat.todense()).astype(dtype)
tm.assert_frame_equal(result, expected)
Expand All @@ -139,10 +141,28 @@ def test_from_spmatrix_columns(self, columns):
dtype = SparseDtype("float64", 0.0)

mat = sp_sparse.random(10, 2, density=0.5)
result = pd.DataFrame.sparse.from_spmatrix(mat, columns=columns)
result = pd.DataFrame.sparse.from_spmatrix(mat, columns=columns, fill_value=0)
expected = pd.DataFrame(mat.toarray(), columns=columns).astype(dtype)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize(
"dtype, fill_value",
[("bool", False), ("float64", np.nan), ("complex128", np.nan)],
)
@pytest.mark.parametrize("format", ["csc", "csr", "coo"])
def test_from_spmatrix_fill_value(self, format, dtype, fill_value):
sp_sparse = pytest.importorskip("scipy.sparse")

sp_dtype = SparseDtype(dtype, fill_value)

sp_mat = sp_sparse.eye(10, format=format, dtype=dtype)
result = pd.DataFrame.sparse.from_spmatrix(sp_mat, fill_value=fill_value)
mat = np.eye(10, dtype=dtype)
expected = pd.DataFrame(
np.ma.array(mat, mask=(mat == 0)).filled(fill_value)
).astype(sp_dtype)
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize(
"colnames", [("A", "B"), (1, 2), (1, pd.NA), (0.1, 0.2), ("x", "x"), (0, 0)]
)
Expand Down
3 changes: 3 additions & 0 deletions pandas/tests/dtypes/test_missing.py
Original file line number Diff line number Diff line change
Expand Up @@ -697,6 +697,9 @@ def test_array_equivalent_index_with_tuples():
("f2", np.nan),
("f4", np.nan),
("f8", np.nan),
# Complex
("c8", np.nan),
("c16", np.nan),
# Object
("O", np.nan),
# Interval
Expand Down
4 changes: 2 additions & 2 deletions pandas/tests/indexing/test_loc.py
Original file line number Diff line number Diff line change
Expand Up @@ -1292,7 +1292,7 @@ def test_loc_getitem_range_from_spmatrix(self, spmatrix_t, dtype):
# diagonal cells are ones, meaning the last two columns are purely sparse.
rows, cols = 5, 7
spmatrix = spmatrix_t(np.eye(rows, cols, dtype=dtype), dtype=dtype)
df = DataFrame.sparse.from_spmatrix(spmatrix)
df = DataFrame.sparse.from_spmatrix(spmatrix, fill_value=0)

# regression test for GH#34526
itr_idx = range(2, rows)
Expand All @@ -1314,7 +1314,7 @@ def test_loc_getitem_sparse_frame(self):
# GH34687
sp_sparse = pytest.importorskip("scipy.sparse")

df = DataFrame.sparse.from_spmatrix(sp_sparse.eye(5))
df = DataFrame.sparse.from_spmatrix(sp_sparse.eye(5), fill_value=0)
result = df.loc[range(2)]
expected = DataFrame(
[[1.0, 0.0, 0.0, 0.0, 0.0], [0.0, 1.0, 0.0, 0.0, 0.0]],
Expand Down
Loading