-
Notifications
You must be signed in to change notification settings - Fork 21
Adding introduction, goals, scope and use cases to the RFC #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
991ab82
cc92c2c
3d998f9
d047f57
1ac9a15
e9472c8
837f87d
293c652
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -2,19 +2,186 @@ | |||||
|
||||||
## Introduction | ||||||
|
||||||
This document defines a Python data frame API. | ||||||
|
||||||
A data frame is a programming interface for expressing data manipulations over a | ||||||
data structure consisting of rows and columns. Columns are named, and values in a | ||||||
column share a common data type. This definition is intentionally left broad. | ||||||
|
||||||
## History | ||||||
## History and data frame implementations | ||||||
|
||||||
Data frame libraries in several programming language exist, such as | ||||||
|
Data frame libraries in several programming language exist, such as | |
Dataframe libraries in several programming language exist, such as |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
develop
-> developed
datapythonista marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
datapythonista marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we leave the standardization of the end-user API as potential future work for us, or do we not plan on doing any of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, and one that many readers will have. I think it would be good to explicitly this is out of scope for this version of the standard, but may be in scope for a future version. With a rationale that it's also important, one of the longer-term goals should be (I think) to make the learning curve for users less steep when switching from one library to another one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of:
- Goals
- Scope
- Out-of-scope and non-goals
is a little inconsistent, I'd suggest to make it symmetric (and add rationales as I just did in my array API scope PR), then this kind of thing may be easier to address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a verb missing in the first sentence ("to describe" ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd state here that an API designed for interactive usage is out of scope.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Data frame library authors | |
### Dataframe library authors |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very heavy handed statement. Could we reword it to something a bit friendlier of:
We encourage data frame libraries in Python to implement the API defined in this document in their libraries
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Numpy as well? It's used by dataframe libraries for their implementation
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would include Mars (https://github.com/mars-project/mars) here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add Database and Big Data systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I don't think we are planning to engage with developer of PostgreSQL, MySQL... I'm adding for now big data systems, and also Python libraries to access databases, which I guess we're more likely to engage with. But I'm open to further changes if there are different points of view.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
### Data frame power users | |
### Dataframe power users |
datapythonista marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
datapythonista marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,7 +1,184 @@ | ||
# Use cases | ||
|
||
## Introduction | ||
|
||
This section discusses the use cases considered for the standard data frame API. | ||
|
||
The goals and scope of this API are defined in the [goals](01_purpose_and_scope.html#Goals), | ||
and [scope](01_purpose_and_scope.html#Scope) sections. | ||
|
||
The target audience and stakeholders are presented in the | ||
[stakeholders](01_purpose_and_scope.html#Stakeholders) section. | ||
|
||
|
||
## Types of use cases | ||
|
||
The next types of use cases can be accomplished by the use of the standard Python data frame | ||
API defined in this document: | ||
|
||
- Downstream library receiving a data frame as a parameter | ||
- Converting a data frame from one implementation to another (try to clarify) | ||
|
||
Other types of uses cases not related to data interchange will be added later. | ||
|
||
|
||
## Concrete use cases | ||
|
||
In this section we define concrete examples of the types of use cases defined above. | ||
|
||
### Plotting library receiving data as a data frame | ||
|
||
One use case we facilitate with the API defined in this document is a plotting library | ||
receiving the data to be plotted as a data frame object. | ||
|
||
Consider the case of a scatter plot, that will be plotted with the data contained in a | ||
data frame structure. For example, consider this data: | ||
|
||
| petal length | petal width | | ||
|--------------|-------------| | ||
| 1.4 | 0.2 | | ||
| 1.7 | 0.4 | | ||
| 1.3 | 0.2 | | ||
| 1.5 | 0.1 | | ||
|
||
If we consider a pure Python implementation, we could for example receive the information | ||
as two lists, one for the _petal length_ and one for the _petal width_. | ||
|
||
```python | ||
petal_length = [1.4, 1.7, 1.3, 1.5] | ||
petal_width = [0.2, 0.4, 0.2, 0.1] | ||
|
||
def scatter_plot(x: list, y: list): | ||
""" | ||
Generate a scatter plot with the information provided in `x` and `y`. | ||
""" | ||
... | ||
``` | ||
|
||
When we consider data frames, we would like to provide them directly to the `scatter_plot` | ||
function. And we would like the plotting library to be agnostic of what specific library | ||
will be used when calling the function. We would like the code to work whether a pandas, | ||
Dask, Vaex or other current or future implementation are used. | ||
|
||
An implementation of the `scatter_plot` function could be: | ||
|
||
```python | ||
def scatter_plot(data: dataframe, x_column: str, y_column: str): | ||
""" | ||
Generate a scatter plot with the information provided in `x` and `y`. | ||
""" | ||
... | ||
``` | ||
|
||
The API documented here describes what the developer of the plotting library can expect | ||
from the object `data`. In which ways can interact with the data frame object to extract | ||
the desired information. | ||
|
||
An example of this are Seaborn plots. For example, the | ||
[scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) accepts a | ||
parameter `data`, which is expected to be a `DataFrame`. | ||
|
||
When providing a pandas `DataFrame`, the next code generates the intended scatter plot: | ||
|
||
```python | ||
import pandas | ||
import seaborn | ||
|
||
pandas_df = pandas.DataFrame({'bill': [15, 32, 28], | ||
'tip': [2, 5, 3]}) | ||
|
||
seaborn.scatterplot(data=pandas_df, x='bill', y='tip') | ||
``` | ||
|
||
But if we instead provide a Vaex data frame, then an exception occurs: | ||
|
||
```python | ||
import vaex | ||
|
||
vaex_df = vaex.from_pandas(pandas_df) | ||
|
||
seaborn.scatterplot(data=vaex_df, x='bill', y='tip') | ||
``` | ||
|
||
This is caused by Seaborn expecting a pandas `DataFrame` object. And while Vaex | ||
provides an interface very similar to pandas, it does not implement 100% of its | ||
API, and Seaborn is trying to use parts that differ. | ||
|
||
With the definition of the standard API, Seaborn developers should be able to | ||
expect a generic data frame. And any library implementing the standard data frame | ||
API could be plotted with the previous example (Vaex, cuDF, Ibis, Dask, Modin, etc.). | ||
|
||
|
||
### Change object from one implementation to another | ||
|
||
Another considered use case is transforming the data from one implementation to another. | ||
|
||
As an example, consider we are using Dask data frames, given that our data is too big to | ||
fit in memory, and we are working over a cluster. At some point in our pipeline, we | ||
reduced the size of the data frame we are working on, by filtering and grouping. And | ||
we are interested in transforming the data frame from Dask to pandas, to use some | ||
functionalities that pandas implements but Dask does not. | ||
|
||
Since Dask knows how the data in the data frame is represented, one option could be to | ||
implement a `.to_pandas()` method in the Dask data frame. Another option could be to | ||
implement this in pandas, in a `.from_dask()` method. | ||
|
||
As the ecosystem grows, this solution implies that every implementation could end up | ||
having a long list of functions or methods: | ||
|
||
- `to_pandas()` / `from_pandas()` | ||
- `to_vaex()` / `from_vaex()` | ||
- `to_modin()` / `from_modin()` | ||
- `to_dask()` / `from_dask()` | ||
- ... | ||
|
||
With a standard Python data frame API, every library could simply implement a method to | ||
import a standard data frame. And since data frame libraries are expected to implement | ||
this API, that would be enough to transform any data frame to one implementation. | ||
|
||
So, the list above would be reduced to a single function or method in each implementation: | ||
|
||
- `from_dataframe()` | ||
|
||
Note that the function `from_dataframe()` is for illustration, and not proposed as part | ||
of the standard at this point. | ||
Comment on lines
+139
to
+144
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A dataframe protocol similar to wesm/dataframe-protocol#1 is a prerequisite to this being possible in my mind. Without having a data exchange protocol defined as part of the spec / goal how can we define There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At this point the data exchange protocol is what we're trying to define. This use case tries to illustrate why such a data exchange protocol is needed. Do you think I should clarify this is the goal for the use cases? Or am I not understanding you? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that it was made clear that a dataframe data exchange protocol was in scope in this document. The only mention of a protocol is in talking about Apache Arrow as far as I can tell. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's good as it is. We are talking about use cases in this document, not the implementation right? So we can loosely define what There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
|
||
Every pair of data frame libraries could benefit from this conversion. But we can go | ||
deeper with an actual example. The conversion from an xarray `DataArray` to a pandas | ||
`DataFrame`, and the other way round. | ||
|
||
Even if xarray is not a data frame library, but a miltidimensional labeled structure, | ||
datapythonista marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
in cases where a 2-D is used, the data can be converted from and to a data frame. | ||
|
||
Currently, xarray implements a `.to_pandas()` method to convert a `DataArray` to a | ||
pandas `DataFrame`: | ||
|
||
```python | ||
import xarray | ||
|
||
xarray_data = xarray.DataArray([[15, 2], [32, 5], [28, 3]], | ||
dims=('diners', 'features'), | ||
coords={'features': ['bill', 'tip']}) | ||
|
||
pandas_df = xarray_data.to_pandas() | ||
``` | ||
|
||
To convert the pandas data frame to an xarray `Data Array`, both libraries have | ||
implementations. Both lines below are equivalent: | ||
|
||
```python | ||
pandas_df.to_xarray() | ||
xarray.DataArray(pandas_df) | ||
``` | ||
|
||
Other data frame implementations may or may not implement a way to convert to xarray. | ||
And passing a data frame to the `DataArray` constructor may or may not work. | ||
|
||
The standard data frame API would allow pandas, xarray and other libraries to | ||
implement the standard API. They could convert other representations via a single | ||
`to_dataframe()` function or method. And they could be converted to other | ||
kkraus14 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
representations that implement that function automatically. | ||
|
||
This would make conversions very simple, not only among data frame libraries, but | ||
also among other libraries which data can be expressed as tabular data, such as | ||
xarray, SQLAlchemy and others. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer
dataframe
as one word, likedatabase
.I do not want to start a holy war, and I realize there are historical reasons to call it
data frame
, butdata base
was common even throughout the 90s. https://groups.google.com/g/alt.usage.english/c/jRB0g0zK85Q?pli=1