Skip to content
This repository was archived by the owner on Apr 10, 2024. It is now read-only.

Commit 801259d

Browse files
committed
Make goals section leaner / more concise per comments
1 parent 94d0281 commit 801259d

File tree

2 files changed

+34
-45
lines changed

2 files changed

+34
-45
lines changed

source/goals.rst

Lines changed: 18 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -51,37 +51,24 @@ familiar with some of these internal details, particular around performance and
5151
memory use, and so the degree to which users are impacted will vary quite a
5252
lot.
5353

54-
Key areas of work
55-
=================
56-
57-
Possible changes or improvements to pandas's internals fall into a number of
58-
different buckets to be explored in great detail:
59-
60-
* **Decoupling from NumPy while preserving interoperability**: by eliminating
61-
the presumption that pandas objects internally must contain data stored in
62-
NumPy ``ndarray`` objects, we will be able to bring more consistency to
63-
pandas's semantics and enable the core developers to extend pandas more
64-
cleanly with new data types, data structures, and computational semantics.
65-
* **Exposing a pandas Cython and/or C/C++ API to other Python library
66-
developers**: the internals of Series and DataFrame are only weakly
67-
accessible in other developers' native code. At minimum, we wish to better
68-
enable developers to construct the precise data structures / memory
69-
representation that fill the insides of Series and DataFrame.
70-
* **Improving user control and visibility of memory use**: pandas's memory use,
71-
as a result of its internal implementation, can frequently be opaque to the
72-
user or outright unpredictable.
73-
* **Improving performance and system utilization**: We aim to improve both the
74-
micro (operations that take < 1 ms) and macro (all other operations)
75-
performance of pandas across the board. As part of this, we aim to make it
76-
easier for pandas's core developers to leverage multicore systems to
77-
accelerate computations (without running into any of Python's well-known
78-
concurrency limitations)
79-
* **Removal of deprecated / underutilized functionality**: As the Python data
80-
ecosystem has grown, a number of areas of pandas (e.g. plotting and datasets
81-
with more than 2 dimensions) may be better served by other open source
82-
projects. Also, functionality that has been explicitly deprecated or
83-
discouraged from use (like the ``.ix`` indexing operator) would ideally be
84-
removed.
54+
Goals
55+
=====
56+
57+
Some high levels goals of the pandas 2.0 plan include the following:
58+
59+
* Fixing long-standing limitations or inconsistencies in missing data: null
60+
values in integer and boolean data, and a more consistent notion of null /
61+
NA.
62+
* Improved performance and utilization of multicore systems
63+
* Better user control / visibility of memory usage (which can be opaque and
64+
difficult to conttrol)
65+
* Clearer semantics around non-NumPy data types, and permitting new pandas-only
66+
data types to be added
67+
* Exposing a "libpandas" C/C++ API to other Python library developers: the
68+
internals of Series and DataFrame are only weakly accessible in other
69+
developers' native code. This has been a limitation for scikit-learn and
70+
other projects requiring C or Cython-level access to pandas object data.
71+
* Removal of deprecated functionality
8572

8673
Non-goals / FAQ
8774
===============

source/internal-architecture.rst

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -288,32 +288,34 @@ Preserving NumPy interoperability
288288
Some of types of intended interoperability between NumPy and pandas are as
289289
follows:
290290

291-
* Users can obtain the a ``numpy.ndarray`` (possibly a view depending on the
292-
internal block structure, more on this soon) in constant time and without
293-
copying the actual data. This has a couple other implications
291+
* **Access to internal data**: Users can obtain the a ``numpy.ndarray``
292+
(possibly a view depending on the internal block structure, more on this
293+
soon) in constant time and without copying the actual data. This has a couple
294+
other implications
294295

295296
* Changes made to this array will be reflected in the source pandas object.
296297
* If you write C extension code (possibly in Cython) and respect pandas's
297298
missing data details, you can invoke certain kinds of fast custom code on
298299
pandas data (but it's somewhat inflexible -- see the latest discussion on
299300
adding a native code API to pandas).
300301

301-
* NumPy ufuncs (like ``np.sqrt`` or ``np.log``) can be invoked on
302+
* **Ufuncs**: NumPy ufuncs (like ``np.sqrt`` or ``np.log``) can be invoked on
302303
pandas objects like Series and DataFrame
303304

304-
* ``numpy.asarray`` will always yield some array, even if it discards metadata
305-
or has to create a new array. For example ``asarray`` invoked on
306-
``pandas.Categorical`` yields a reconstructed array (rather than either the
307-
categories or codes internal arrays)
305+
* **Array protocol**: ``numpy.asarray`` will always yield some array, even if
306+
it discards metadata or has to create a new array. For example ``asarray``
307+
invoked on ``pandas.Categorical`` yields a reconstructed array (rather than
308+
either the categories or codes internal arrays)
308309

309-
* Many NumPy methods designed to work on subclasses (or duck-typed classes) of
310-
``ndarray`` may be used. For example ``numpy.sum`` may be used on a Series
311-
even though it does not invoke NumPy's internal C sum algorithm. This means
312-
that a Series may be used as an interchangeable argument in a large set of
313-
functions that only know about NumPy arrays.
310+
* **Interchangeability**: Many NumPy methods designed to work on subclasses (or
311+
duck-typed classes) of ``ndarray`` may be used. For example ``numpy.sum`` may
312+
be used on a Series even though it does not invoke NumPy's internal C sum
313+
algorithm. This means that a Series may be used as an interchangeable
314+
argument in a large set of functions that only know about NumPy arrays.
314315

315316
By and large, I think much of this can be preserved, but there will be some API
316-
breakage.
317+
breakage. In particular, interchangeability is not something we can or should
318+
guarantee.
317319

318320
If we add more composite data structures (Categorical can be thought of as
319321
one existing composite data structure) to pandas or alternate non-NumPy data

0 commit comments

Comments
 (0)