-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
BUG: DataFrame.sparse.from_spmatrix
hard codes an invalid fill_value
for certain subtypes
#59064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
mroeschke
merged 15 commits into
pandas-dev:main
from
christophertitchen:fix-sparse-frame-accessor
Jun 27, 2024
Merged
Changes from 9 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
cd922b6
BUG: :bug: :sparkles: Add fill_value param to from_spmatrix method.
christophertitchen 9323e43
ENH: :sparkles: Set explicit fill_value of NaN for complex floats.
christophertitchen 212a664
TST: :white_check_mark: Fix failing tests.
christophertitchen 81b33f5
TST: :white_check_mark: Add tests for from_spmatrix method.
christophertitchen f5f1479
DOC: :memo: Add what's new entry.
christophertitchen 57367aa
TST: :white_check_mark: Fix failing tests for sparse getitem.
christophertitchen e90963a
TST: :white_check_mark: Remove test for 256-bit complex float.
christophertitchen eb222a6
DOC: :memo: Update example in docstring for from_spmatrix method.
christophertitchen 221c4aa
DOC: :memo: Update some docstrings and sparse user guide.
christophertitchen 0d07c30
DOC: :pencil2: Update dtype docstring.
christophertitchen ccba29e
BUG: :rewind: :bug: Revert fill_value change and fix to_coo method.
christophertitchen d09171e
TST: :rewind: :white_check_mark: Fix and add sparse accessor tests.
christophertitchen b813453
TST: :rewind: :white_check_mark: Fix and add sparse getitem tests.
christophertitchen 499db2f
DOC: :rewind: :memo: Revert fill_value change to sparse user guide.
christophertitchen 9eb3dac
CLN: :pencil2: Fix instantiation of np.ma.array in test.
christophertitchen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use
na_value_for_dtype
instead of introducing a new argument?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Matthew. Thank you for your speedy response and taking the time to review my PR.
na_value_for_dtype
is indirectly called in this implementation with the default argument offill_value
asNone
when passed to theSparseDtype
constructor.pandas/pandas/core/dtypes/dtypes.py
Lines 1746 to 1747 in a5e812d
A
fill_value
parameter is typically only needed when constructing a sparse format object from a dense format object, as we need to identify non-zero elements in the data to set attributes like the data, index, and index pointer (for sparse array formats like BSR, CSR, and CSC) in the sparse format object correctly. In this case, we are simply accessing these attributes from a CSC matrix, so you are right in that afill_value
parameter is not strictly required to solve the bug.However, in addition to fixing the bug and not adding overhead, it is handy to have a
fill_value
parameter to give the user flexibility in certain use cases, for example, converting aSparseArray
to anp.ndarray
usingnp.asarray
will usefill_value
to populate the missing elements. This is along the lines of how, albeit not a sparse implementation,np.ma.core.MaskedArray
has afilled
method that performs the same functionality of using a customfill_value
to convert to anp.ndarray
.Would you be OK with me keeping it?
The current test failures also seem unrelated to the PR and have been around since #59027.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core maintainers have been cautious about expanding the APIs surface areas and signatures unless necessary or for consistency issues. I would prefer if this bug was solved without adding a new keyword argument first, and then you could open a new issue about adding a new keyword here that has opt-in from more core team memebers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand and yes, the cautious approach makes total sense given the size of pandas.
The issues that I see arising as a result of removing the
fill_value
parameter and implementing your change are:DataFrame.sparse.from_spmatrix().sparse.to_coo
, shown as an example in the user guide, will break forfloat
andcomplex
subtypes as it will raise aValueError
. It seems like the original thought of adding this was because a customfill_value
will be lost upon converting to a COO matrix asscipy.sparse._coo.coo_matrix
and other sparse formats do not have an analogous attribute and instead useFalse
,0
,0.
, and0. + 0.j
when returning a dense representation of them as anp.ndarray
ornp.matrix
, which was considered unexpected behaviour at the time. 🤷 We can remove this check as the constructor called into_coo
is using the ijv format and not directly instantiating using a single 2-Dnp.ndarray
, so the COO matrix returned will be correct regardless of thefill_value
.pandas/doc/source/user_guide/sparse.rst
Lines 191 to 200 in b1e5f06
pandas/pandas/core/arrays/sparse/accessor.py
Lines 396 to 397 in b1e5f06
DataFrame.sparse.from_spmatrix
invocation, apart from the one that I added to test the changes,test_from_spmatrix_fill_value
, assume afill_value
of0
. The expectations will all have to be changed and will subsequently have a dependency onna_value_for_dtype
for thefill_value
to use, and the docstring example will have to change.Would you like me to make the changes above, close the PR, or keep the
fill_value
parameter as it is currently implemented in my branch?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes above would be appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent. It should be rather straightforward and quick in that case! I will make the changes now.