This repository was archived by the owner on Dec 22, 2019. It is now read-only.
forked from pandas-dev/pandas
-
Notifications
You must be signed in to change notification settings - Fork 0
Secondtry #1
Open
lowerthansound
wants to merge
31
commits into
master
Choose a base branch
from
secondtry
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Secondtry #1
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The test fixed another thing, but got broken by this PR due to its expected result being based indirectly in the bad upcasting of dtypes.
The sorting behaviour of append was pretty messy (still is, but to a lesser extent). When passed a dict or series, the append function would reindexing both self and other, and then send these to the concat function. When passed a list of series or dicts, the list would be transformed to a dframe and, if its columns were contained in self, we would reshape it to match those of self. Other types would be sent directly to concat though. The concat function, by itself, has a pretty consistent sorting for when sort=True, but doesn't seem so consistent for when sort=False or sort=None (I haven't dig much deeper, but pandas/core/indexes/api.py:_union_indexes sorts some arrays even if the sort argument is False. So, to solve this mess, I tried for the simplest approach. I'm leaving concat untouched and focusing only on append. Do the concatenating as usual and, if the sort parameter was False, we can reindex the resulting dframe to match what we would expect from an unsorted axis. Also, I needed to do some juggling because of the sort=None that is being deprecated. When 'other' is a dict or series and sort=None, we can set sort=False, and it will work like before. If 'other' is something else, we just avoid doing the reindex in the end. I haven't checked all cases, so I'm not sure if this is exact the same working as before. But hopefully, we can provide more tests to pinpoint the sorting behaviour of append better, and this will be enough by now.
There was a test that checked whether an exception was raised when appending to a series whose index could not be joined together. This test checked exclusively for TypeError, and this was what the old code produced during a dframe reindex at pandas/core/frame.py:6198. With the new code, however, the reindex is only being called at the end, and so, the sorts of error that occur tend to vary. For example, we got AttribteError, ValueError and TypeError. While this is not perfect, the situation didn't change much, it will only break for someone who was catching errors like this with a `except TypeError` clause. Despite this, the message that appeared was cryptic and still is. A better solution would involve informing the user of the reason such error ocurred, and keep the TypeError, to let things backward-compatible.
Lots of tests have been created. Tests scope: - Input types - Bad coercion to float (difficult to imagine, but I feel it was tested fully) - Rows index - Index name - Dtypes (+coercing) - Columns index - Index name - Dtypes (+ coercing) - Check duplicates - Sorting behavior \* Some tests not yet implemented: - Checks for duplicates in the rows index (linked with verify_integrity) - Ignoring rows index (ignore_index) --> Remember to use a modified assert_index_equal (or assert_frame_equal) - Behavior related to ignore_index (e.g. raising when unnamed Series is passed as arg) \*: The sorting tests forget about sort=None completely
Because concat does not have a clear reindexing behavior on the columns, I've implemented this behavior separately on DataFrame.append. Actually, the reindexing is implemented at pandas/code/indexes/api.py and is called from DataFrame.append. The main catch here is that any column types are allowed to be concatenated and, when sort is not possible, it raises an error. Another thing that was added was xfail marks on the tests. These represents parts of the code that haven't been implemented yet or fixes that are better on another PR. The behavior for sort=None isn't totally sorted out yet.
The resulting index dtype was being inferred (not always?) from the indexes values. This caused a subtle problem where we inferred in a place the user didn't want to. For example: >>> df1.columns Index([0, 1, 2], dtype='object') >>> df1.append(df1, sort=False).columns Int64Index([0, 1, 2], dtype='int64') This commit solves this problem, but it raises a question for empty indexes, should they be considered when calculating the final dtype or not? My intuition says that the user usually doesn't know about the index type, specially if it's empty, so we may avoid friction ignoring empty indexes in the calculation. The same argument may be used for column dtypes after an append where the resulting DataFrame has exactly one row.
TODO: This shall be reversed later, or be made a bit more strict. My best choice is: ignore when it is empty of dtype object, consider if it is empty of another dtype. May interact somewhat with the result float64 of reindex.
Will be better made in a future version.
When there were duplicates on the columns index, sort was allowed and duplicates were allowed if the indexes had the same values (as found by idx.tolist()). Now, considering that pandas doesn't allow to sort the index when there are duplicate values (DataFrame.reindex fails) and that searching for the same values is counter-productive and prone to fail, depending on the different types of indexes, the behavior was modified to this: - When sort=True and there are duplicates in at least one index, an error is raised and append stops. - Dframes with duplicate indexes are only considered to be joined when the indexes share the same identity (that is, they are the same object comparable with `idx1 is idx2`) Some other improvements to the code have also been made and I believe it is better in a general mode.
Handle columns index duplicates
86f60b0
to
2617449
Compare
Also trying to use Index without arguments
This reflects in the columns index of DataFrame.append, it will ignore empty indexes (of dtype object)! Some tests are not passing, but this is due to columns dtypes, not indexes.
222d0b4
to
a8f03b7
Compare
Implement sort=None behavior and regression tests The default value for sort (None) had a complex behavior and this commit aimed to reproduce it. When the defaul valuet change, it will be wise to remove what was added in this commit. Along with some preparation code that was already present in `append` (calculating `_obj_type` and `_item_type`).
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
git diff upstream/master -u -- "*.py" | flake8 --diff