Skip to content

API: honor copy=True when passing dict to DataFrame #38939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 69 commits into from
Mar 31, 2021
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
cbc97f0
ENH: allow non-consolidation in constructors
jbrockmendel Oct 5, 2020
a135d96
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Oct 5, 2020
5c94129
mypy fixup
jbrockmendel Oct 5, 2020
d653c54
ENH: allow non-consolidation in constructors
jbrockmendel Oct 5, 2020
e71e319
Merge branch 'myway-init-no-consolidate' of github.com:jbrockmendel/p…
jbrockmendel Jan 3, 2021
cafa718
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 3, 2021
c706ad6
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 3, 2021
3f9195e
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 4, 2021
1cba671
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 4, 2021
396daba
BUG: respect copy=False in constructing DataFrame from dict
jbrockmendel Jan 4, 2021
11ae1c9
whatsnew
jbrockmendel Jan 4, 2021
b505267
clean test
jbrockmendel Jan 4, 2021
a1e9b68
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 4, 2021
b70d997
fixed xfail
jbrockmendel Jan 4, 2021
09213e0
update whatsnew
jbrockmendel Jan 4, 2021
e895de0
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 4, 2021
31bda58
de-kludge
jbrockmendel Jan 4, 2021
37a2c0c
remove no-longer-used msg
jbrockmendel Jan 4, 2021
b93d7d5
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 10, 2021
aa667a6
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 12, 2021
185cd99
fix broken test
jbrockmendel Jan 12, 2021
701356f
Consolidate in tm, update whatsnew
jbrockmendel Jan 20, 2021
948ac67
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 21, 2021
46f2fcf
always copy when data is None
jbrockmendel Jan 21, 2021
0bbfec0
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Jan 23, 2021
590c820
update exception message
jbrockmendel Jan 23, 2021
17a693c
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Jan 27, 2021
fb8f32d
update exception message
jbrockmendel Jan 27, 2021
7835184
typo fixup
jbrockmendel Jan 28, 2021
2136289
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 2, 2021
187499c
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 2, 2021
6f30beb
Merge branch 'master' of https://github.com/pandas-dev/pandas into my…
jbrockmendel Feb 4, 2021
e80d57c
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 11, 2021
b0a6abd
CI: fix broken asv
jbrockmendel Feb 11, 2021
bf942ae
revert
jbrockmendel Feb 11, 2021
95e30a5
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 11, 2021
510f697
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 13, 2021
48e359e
Default to copy=True for dict data
jbrockmendel Feb 13, 2021
048e826
troubleshoot docbuild
jbrockmendel Feb 16, 2021
6a9c9f0
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 17, 2021
a17c728
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 18, 2021
5b3d419
update whatsnew
jbrockmendel Feb 18, 2021
f961378
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 18, 2021
fcee44b
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 27, 2021
0c60ae8
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 27, 2021
1468e59
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Feb 27, 2021
b6d8b70
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 3, 2021
8b66b11
skip for ArrayManager
jbrockmendel Mar 3, 2021
54cacfc
Update doc/source/whatsnew/v1.3.0.rst
jbrockmendel Mar 6, 2021
5ea7a75
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 6, 2021
e11ea68
requested edits
jbrockmendel Mar 6, 2021
65d01c7
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 8, 2021
41c4e7a
test test_df_mod_zero_df with and without copy
jbrockmendel Mar 8, 2021
7260a72
collect copy-adjusting code in one place
jbrockmendel Mar 9, 2021
52344bb
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 14, 2021
3ddc3d3
update docstring
jbrockmendel Mar 14, 2021
e6bae0f
whatsnew, comment
jbrockmendel Mar 14, 2021
7cab084
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 15, 2021
e8e3d84
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 15, 2021
5c44953
mypy fixup
jbrockmendel Mar 15, 2021
e32f630
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 16, 2021
abd890a
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 16, 2021
1b7f7ca
Update pandas/core/frame.py
jbrockmendel Mar 16, 2021
b326b5f
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 22, 2021
b7aed5d
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 26, 2021
4d20fe7
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 30, 2021
ad5485a
add versionchanged
jbrockmendel Mar 30, 2021
6bed6ac
Merge branch 'master' into myway-init-no-consolidate
jbrockmendel Mar 30, 2021
98b6dff
update bc .values has changed to DTA/TDA
jbrockmendel Mar 30, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,21 @@ For example:
``'table'`` option that performs the windowing operation over an entire :class:`DataFrame`.
See ref:`window.overview` for performance and functional benefits. (:issue:`15095`, :issue:`38995`)

.. _whatsnew_130.dataframe_honors_copy_with_dict:

DataFrame constructor honors ``copy=False`` With Dict
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

When passing a dictionary to :class:`DataFrame` with (the default) ``copy=False``,
a copy will no longer be made (:issue:`32960`)

.. ipython:: python

arr = np.array([1, 2, 3])
df = pd.DataFrame({"A": arr, "B": arr.copy()})
arr[0] = 0
assert df.iloc[0, 0] == 0

.. _whatsnew_130.enhancements.other:

Other enhancements
Expand Down
2 changes: 1 addition & 1 deletion pandas/_testing/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -473,7 +473,7 @@ def getPeriodData(nper=None):
# make frame
def makeTimeDataFrame(nper=None, freq="B"):
data = getTimeSeriesData(nper, freq)
return DataFrame(data)
return DataFrame(data)._consolidate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change should in theory no longer be needed? (assuming this was done to ensure consolidated dataframe when the default was changed to not copy, and thus result in non-consolidated dataframe)

(and same for the 2 cases just below)



def makeDataFrame() -> DataFrame:
Expand Down
2 changes: 1 addition & 1 deletion pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -695,7 +695,7 @@ def float_frame():

[30 rows x 4 columns]
"""
return DataFrame(tm.getSeriesData())
return DataFrame(tm.getSeriesData())._consolidate()


# ----------------------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -534,7 +534,7 @@ def __init__(
)

elif isinstance(data, dict):
mgr = init_dict(data, index, columns, dtype=dtype)
mgr = init_dict(data, index, columns, dtype=dtype, copy=copy)
elif isinstance(data, ma.MaskedArray):
import numpy.ma.mrecords as mrecords

Expand Down
4 changes: 3 additions & 1 deletion pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1763,7 +1763,9 @@ def describe(self, **kwargs):
result = self.apply(lambda x: x.describe(**kwargs))
if self.axis == 1:
return result.T
return result.unstack()
# FIXME: not being consolidated breaks
# test_describe_with_duplicate_output_column_names
return result._consolidate().unstack()

@final
def resample(self, rule, *args, **kwargs):
Expand Down
34 changes: 30 additions & 4 deletions pandas/core/internals/construction.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ def arrays_to_mgr(
columns,
dtype: Optional[DtypeObj] = None,
verify_integrity: bool = True,
consolidate: bool = True,
):
"""
Segregate Series based on type and coerce into matrices.
Expand All @@ -104,7 +105,9 @@ def arrays_to_mgr(
# from BlockManager perspective
axes = [columns, index]

return create_block_manager_from_arrays(arrays, arr_names, axes)
return create_block_manager_from_arrays(
arrays, arr_names, axes, consolidate=consolidate
)


def masked_rec_array_to_mgr(
Expand Down Expand Up @@ -153,7 +156,13 @@ def masked_rec_array_to_mgr(
# DataFrame Constructor Interface


def init_ndarray(values, index, columns, dtype: Optional[DtypeObj], copy: bool):
def init_ndarray(
values,
index,
columns,
dtype: Optional[DtypeObj],
copy: bool,
):
# input must be a ndarray, list, Series, index

if isinstance(values, ABCSeries):
Expand Down Expand Up @@ -235,7 +244,14 @@ def init_ndarray(values, index, columns, dtype: Optional[DtypeObj], copy: bool):
return create_block_manager_from_blocks(block_values, [columns, index])


def init_dict(data: Dict, index, columns, dtype: Optional[DtypeObj] = None):
def init_dict(
data: Dict,
index,
columns,
*,
dtype: Optional[DtypeObj] = None,
copy: bool = True,
):
"""
Segregate Series based on type and coerce into matrices.
Needs to handle a lot of exceptional cases.
Expand Down Expand Up @@ -269,6 +285,8 @@ def init_dict(data: Dict, index, columns, dtype: Optional[DtypeObj] = None):
val = construct_1d_arraylike_from_scalar(np.nan, len(index), nan_dtype)
arrays.loc[missing] = [val] * missing.sum()

arrays = list(arrays)

else:
keys = list(data.keys())
columns = data_names = Index(keys)
Expand All @@ -279,7 +297,15 @@ def init_dict(data: Dict, index, columns, dtype: Optional[DtypeObj] = None):
arrays = [
arr if not is_datetime64tz_dtype(arr) else arr.copy() for arr in arrays
]
return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)

if copy:
# arrays_to_mgr (via form_blocks) won't make copies for EAs
arrays = [x if not is_extension_array_dtype(x) else x.copy() for x in arrays]
# TODO: can we get rid of the dt64tz special case above?

return arrays_to_mgr(
arrays, data_names, index, columns, dtype=dtype, consolidate=copy
)


def nested_data_to_arrays(
Expand Down
74 changes: 57 additions & 17 deletions pandas/core/internals/managers.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@
import pandas.core.algorithms as algos
from pandas.core.arrays.sparse import SparseDtype
from pandas.core.base import PandasObject
from pandas.core.construction import extract_array
from pandas.core.construction import ensure_wrapped_if_datetimelike, extract_array
from pandas.core.indexers import maybe_convert_indices
from pandas.core.indexes.api import Index, ensure_index
from pandas.core.internals.blocks import (
Expand Down Expand Up @@ -955,6 +955,8 @@ def fast_xs(self, loc: int) -> ArrayLike:
else:
result = np.empty(n, dtype=dtype)

result = ensure_wrapped_if_datetimelike(result)

for blk in self.blocks:
# Such assignment may incorrectly coerce NaT to None
# result[blk.mgr_locs] = blk._slice((slice(None), loc))
Expand Down Expand Up @@ -1665,7 +1667,9 @@ def fast_xs(self, loc):
# Constructor Helpers


def create_block_manager_from_blocks(blocks, axes: List[Index]) -> BlockManager:
def create_block_manager_from_blocks(
blocks, axes: List[Index], consolidate: bool = True
) -> BlockManager:
try:
if len(blocks) == 1 and not isinstance(blocks[0], Block):
# if blocks[0] is of length 0, return empty blocks
Expand All @@ -1682,7 +1686,8 @@ def create_block_manager_from_blocks(blocks, axes: List[Index]) -> BlockManager:
]

mgr = BlockManager(blocks, axes)
mgr._consolidate_inplace()
if consolidate:
mgr._consolidate_inplace()
return mgr

except ValueError as e:
Expand All @@ -1692,7 +1697,10 @@ def create_block_manager_from_blocks(blocks, axes: List[Index]) -> BlockManager:


def create_block_manager_from_arrays(
arrays, names: Index, axes: List[Index]
arrays,
names: Index,
axes: List[Index],
consolidate: bool = True,
) -> BlockManager:
assert isinstance(names, Index)
assert isinstance(axes, list)
Expand All @@ -1702,12 +1710,13 @@ def create_block_manager_from_arrays(
# Note: just calling extract_array breaks tests that patch PandasArray._typ.
arrays = [x if not isinstance(x, ABCPandasArray) else x.to_numpy() for x in arrays]
try:
blocks = _form_blocks(arrays, names, axes)
blocks = _form_blocks(arrays, names, axes, consolidate)
mgr = BlockManager(blocks, axes)
mgr._consolidate_inplace()
return mgr
except ValueError as e:
raise construction_error(len(arrays), arrays[0].shape, axes, e)
if consolidate:
mgr._consolidate_inplace()
return mgr


def construction_error(tot_items, block_shape, axes, e=None):
Expand All @@ -1734,7 +1743,7 @@ def construction_error(tot_items, block_shape, axes, e=None):
# -----------------------------------------------------------------------


def _form_blocks(arrays, names: Index, axes) -> List[Block]:
def _form_blocks(arrays, names: Index, axes, consolidate: bool) -> List[Block]:
# put "leftover" items in float bucket, where else?
# generalize?
items_dict: DefaultDict[str, List] = defaultdict(list)
Expand All @@ -1760,23 +1769,31 @@ def _form_blocks(arrays, names: Index, axes) -> List[Block]:

blocks: List[Block] = []
if len(items_dict["FloatBlock"]):
float_blocks = _multi_blockify(items_dict["FloatBlock"])
float_blocks = _multi_blockify(
items_dict["FloatBlock"], consolidate=consolidate
)
blocks.extend(float_blocks)

if len(items_dict["ComplexBlock"]):
complex_blocks = _multi_blockify(items_dict["ComplexBlock"])
complex_blocks = _multi_blockify(
items_dict["ComplexBlock"], consolidate=consolidate
)
blocks.extend(complex_blocks)

if len(items_dict["TimeDeltaBlock"]):
timedelta_blocks = _multi_blockify(items_dict["TimeDeltaBlock"])
timedelta_blocks = _multi_blockify(
items_dict["TimeDeltaBlock"], consolidate=consolidate
)
blocks.extend(timedelta_blocks)

if len(items_dict["IntBlock"]):
int_blocks = _multi_blockify(items_dict["IntBlock"])
int_blocks = _multi_blockify(items_dict["IntBlock"], consolidate=consolidate)
blocks.extend(int_blocks)

if len(items_dict["DatetimeBlock"]):
datetime_blocks = _simple_blockify(items_dict["DatetimeBlock"], DT64NS_DTYPE)
datetime_blocks = _simple_blockify(
items_dict["DatetimeBlock"], DT64NS_DTYPE, consolidate=consolidate
)
blocks.extend(datetime_blocks)

if len(items_dict["DatetimeTZBlock"]):
Expand All @@ -1787,11 +1804,15 @@ def _form_blocks(arrays, names: Index, axes) -> List[Block]:
blocks.extend(dttz_blocks)

if len(items_dict["BoolBlock"]):
bool_blocks = _simple_blockify(items_dict["BoolBlock"], np.bool_)
bool_blocks = _simple_blockify(
items_dict["BoolBlock"], np.bool_, consolidate=consolidate
)
blocks.extend(bool_blocks)

if len(items_dict["ObjectBlock"]) > 0:
object_blocks = _simple_blockify(items_dict["ObjectBlock"], np.object_)
object_blocks = _simple_blockify(
items_dict["ObjectBlock"], np.object_, consolidate=consolidate
)
blocks.extend(object_blocks)

if len(items_dict["CategoricalBlock"]) > 0:
Expand Down Expand Up @@ -1830,11 +1851,14 @@ def _form_blocks(arrays, names: Index, axes) -> List[Block]:
return blocks


def _simple_blockify(tuples, dtype) -> List[Block]:
def _simple_blockify(tuples, dtype, consolidate: bool) -> List[Block]:
"""
return a single array of a block that has a single dtype; if dtype is
not None, coerce to this dtype
"""
if not consolidate:
return _tuples_to_blocks_no_consolidate(tuples, dtype=dtype)

values, placement = _stack_arrays(tuples, dtype)

# TODO: CHECK DTYPE?
Expand All @@ -1845,8 +1869,12 @@ def _simple_blockify(tuples, dtype) -> List[Block]:
return [block]


def _multi_blockify(tuples, dtype: Optional[Dtype] = None):
def _multi_blockify(tuples, dtype: Optional[Dtype] = None, consolidate: bool = True):
""" return an array of blocks that potentially have different dtypes """

if not consolidate:
return _tuples_to_blocks_no_consolidate(tuples, dtype=dtype)

# group by dtype
grouper = itertools.groupby(tuples, lambda x: x[2].dtype)

Expand All @@ -1861,6 +1889,18 @@ def _multi_blockify(tuples, dtype: Optional[Dtype] = None):
return new_blocks


def _tuples_to_blocks_no_consolidate(tuples, dtype: Optional[DtypeObj]) -> List[Block]:
# tuples produced within _form_blocks are of the form (placement, whatever, array)
if dtype is not None:
return [
make_block(
np.atleast_2d(x[2].astype(dtype, copy=False)), placement=x[0], ndim=2
)
for x in tuples
]
return [make_block(np.atleast_2d(x[2]), placement=x[0], ndim=2) for x in tuples]


def _stack_arrays(tuples, dtype):

# fml
Expand Down
4 changes: 1 addition & 3 deletions pandas/tests/arithmetic/test_numeric.py
Original file line number Diff line number Diff line change
Expand Up @@ -536,9 +536,7 @@ def test_df_mod_zero_df(self):
# GH#3590, modulo as ints
df = pd.DataFrame({"first": [3, 4, 5, 8], "second": [0, 0, 0, 3]})

# this is technically wrong, as the integer portion is coerced to float
# ###
first = Series([0, 0, 0, 0], dtype="float64")
first = Series([0, 0, 0, 0], dtype="int64")
second = Series([np.nan, np.nan, np.nan, 0])
expected = pd.DataFrame({"first": first, "second": second})
result = df % df
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/frame/test_arithmetic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1305,7 +1305,7 @@ def test_strings_to_numbers_comparisons_raises(self, compare_operators_no_eq_ne)
f(df, 0)

def test_comparison_protected_from_errstate(self):
missing_df = tm.makeDataFrame()
missing_df = tm.makeDataFrame()._consolidate()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are already doing this on the creation (I understand by find this fragile)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are already doing this on the creation

i dont think so

(I understand by find this fragile)

I agree. Silver lining: finding the existing fragility.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actuallly you are doing this on creation, maybe you recently added. prefer NOT to do this in the tests proper (in pandas/testing is ok)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to do the consolidation in tm.makeDataFrame

missing_df.iloc[0]["A"] = np.nan
with np.errstate(invalid="ignore"):
expected = missing_df.values < 0
Expand Down
Loading