From 991ab82866028b9eeb536c752cb9dd0cda9f2ec6 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Mon, 17 Aug 2020 00:56:08 +0100 Subject: [PATCH 1/7] Adding purpose, goals and use cases --- spec/01_purpose_and_scope.md | 116 ++++++++++++++++++++++++++++++++++- spec/02_use_cases.md | 102 ++++++++++++++++++++++++++++++ 2 files changed, 217 insertions(+), 1 deletion(-) diff --git a/spec/01_purpose_and_scope.md b/spec/01_purpose_and_scope.md index d7b41950..e3131d5b 100644 --- a/spec/01_purpose_and_scope.md +++ b/spec/01_purpose_and_scope.md @@ -2,19 +2,133 @@ ## Introduction +This document defines a Python data frame API. + +A data frame is a programming interface for expressing data manipulations over a +data structure consisting of rows and columns. Columns are named, and values in a +column share a common data type. This definition is intentionally left broad. ## History +In 2009 [pandas](https://pandas.pydata.org/) became the first major Python data frame +library to be open sourced. Its popularity has been growing, and as of 2020 the pandas +website has around one million and a half visitors per month. + +pandas is rich in features, and its public API contains more than 2,000 objects. In +recent years, the number of existing Python data frame libraries has been growing. Most +of the new libraries offer some advantages compared to pandas (distributed, out-of-core +or GPU computing for example). And most of them provide a public API very similar to pandas, +to make the transition easy to users. The main libraries in this category would be +[Dask](https://dask.org/), [Vaex](https://vaex.io/), [Modin](https://github.com/modin-project/modin), +[cuDF](https://github.com/rapidsai/cudf) and [Koalas](https://github.com/databricks/koalas). + +There are other libraries that did not base their public API in pandas, the most popular +one is [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) which bases +its API in the Spark data frame. + + +## Goals + +Given the growing Python data frame ecosystem, and its complexity, this document provides +a standard Python data frame API. Until recently, pandas has been a de-facto standard for +Python data frames. But currently there are a growing number of not only data frame libraries, +but also libraries that interact with data frames (visualization, statistical or machine learning +libraries for example). Interactions among libraries are becoming complex, and the pandas +public API is suboptimal as a standard, for its size, complexity, and implementation details +it exposes. + +The goal of the API described in this document is to provide a standard interface that encapsulates +implementation details of data frame libraries. This will allow users and third-party libraries to +write code that interacts with a standard data frame, and not with specific implementations. + +The defined API does not aim to be a convenient API for all users of data frames. Libraries targeting +specific users (data analysts, data scientists, quants, etc.) can be implemented on top of the +standard API. The standard API is targeted to software engineers, who will build code and libraries +using the API specification following proper software engineering techniques. +See the [use cases](02_use_cases.html) section for details on the exact use cases considered. -## Scope (includes out-of-scope / non-goals) +## Scope +It is in the scope of this document the different elements of the API: + +- Data structures and Python classes +- Functions and methods +- Expected returns of the different operations +- Data types (Python and low-level types) + +The scope of this document is limited to generic data frames, and not data frames specific to +certain domains. + + +### Out-of-scope and non-goals + +Implementation details of the data frames and execution of operations. This includes: + +- How data is represented and stored (whether the data is in memory, disk, distributed) +- Expectations on when the execution is happening (in an eager or lazy way) ## Stakeholders +This section provides the list of stakeholders considered for the definition of this API. + + +### Data frame library authors + +Authors of data frame libraries in Python are expected to implement the API defined +in this document in their libraries. + +The list of known Python data frame libraries at the time of writing this document is next: + +- [pandas](https://pandas.pydata.org/) +- [Dask](https://dask.org/) +- [cuDF](https://github.com/rapidsai/cudf) +- [Modin](https://github.com/modin-project/modin) +- [Vaex](https://vaex.io/) +- [Turi Create](https://github.com/apple/turicreate) +- [Koalas](https://github.com/databricks/koalas) +- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) +- [Grizzly](https://github.com/weld-project/weld#grizzly) +- [Mars](https://docs.pymars.org/en/latest/) +- [StaticFrame](https://static-frame.readthedocs.io/en/latest/) +- [dexplo](https://github.com/dexplo/dexplo/) +- [datatable](https://github.com/h2oai/datatable) +- [Eland](https://github.com/elastic/eland) + + +### Downstream library authors + +Authors of libraries that consume data frames. They can use the API defined in this document +to know how the data contained in a data frame can be consumed, and which operations are implemented. + +A non-exhaustive list of downstream library categories is next: + +- Plotting and visualization (Matplotlib, Bokeh, Altair, Plotly, etc.) +- Statistical libraries (statsmodels) +- Machine learning libraries (scikit-learn) + + +### Upstream library authors + +Authors of libraries that provide functionality used by data frames. + +A non-exhaustive list of upstream categories is next: + +- Data interchange protocols (Apache Arrow, NumPy's protocol buffer) +- Mathematical computational libraries (MKL) +- Task schedulers (Dask, Ray) + + +### Data frame power users + +This group considers power users of data frames. For example, developers of applications that +use data frames. Or authors of libraries that provide specialized data frame APIs to be built +on top of the standard API. +Basic users of data frame are not considered direct users of this standard data frame API. This +could include for example users analyzing data in a Jupyter notebook using a data frame implementation. ## High-level API overview diff --git a/spec/02_use_cases.md b/spec/02_use_cases.md index 648f17c8..e35b923d 100644 --- a/spec/02_use_cases.md +++ b/spec/02_use_cases.md @@ -1,7 +1,109 @@ # Use cases +## Introduction + +This section discusses the use cases considered for the standard data frame API. + +The goals and scope of this API are defined in the [goals](01_purpose_and_scope.html#Goals), +and [scope](01_purpose_and_scope.html#Scope) sections. + +The target audience and stakeholders are presented in the +[stakeholders](01_purpose_and_scope.html#Stakeholders) section. + + ## Types of use cases +The next types of use cases can be accomplished by the use of the standard Python data frame +API defined in this document: + +- Downstream library receiving a data frame as a parameter +- Converting a data frame from one implementation to another +- Other types of uses cases not related to data interchange will be added later ## Concrete use cases + +In this section we define concrete examples of the types of use cases defined above. + +### Plotting library receiving data as a data frame + +One use case we facilitate with the API defined in this document is a plotting library +receiving the data to be plotted as a data frame object. + +Consider the case of a scatter plot, that will be plotted with the data contained in a +data frame structure. For example, consider this data: + +| petal length | petal width | +|--------------|-------------| +| 1.4 | 0.2 | +| 1.7 | 0.4 | +| 1.3 | 0.2 | +| 1.5 | 0.1 | + +If we consider a pure Python implementation, we could for example receive the information +as two lists, one for the _petal length_ and one for the _petal width_. + +```python +petal_length = [1.4, 1.7, 1.3, 1.5] +petal_width = [0.2, 0.4, 0.2, 0.1] + +def scatter_plot(x: list, y: list): + """ + Generate a scatter plot with the information provided in `x` and `y`. + """ + ... +``` + +When we consider data frames, we would like to provide them directly to the `scatter_plot` +function. And we would like the plotting library to be agnostic of what specific library +will be used when calling the function. We would like the code to work whether a pandas, +Dask, Vaex or other current or future implementation are used. + +An implementation of the `scatter_plot` function could be: + +```python +def scatter_plot(data: dataframe, x_column: str, y_column: str): + """ + Generate a scatter plot with the information provided in `x` and `y`. + """ + ... +``` + +The API documented here describes what the developer of the plotting library can expect +from the object `data`. In which ways can interact with the data frame object to extract +the desired information. + + +### Change object from one implementation to another + +Another considered use case is transforming the data from one implementation to another. + +As an example, consider we are using Dask data frames, given that our data is too big to +fit in memory, and we are working over a cluster. At some point in our pipeline, we +reduced the size of the data frame we are working on, by filtering and grouping. And +we are interested in transforming the data frame from Dask to pandas, to use some +functionalities that pandas implements but Dask does not. + +Since Dask knows how the data in the data frame is represented, one option could be to +implement a `.to_pandas()` method in the Dask data frame. Another option could be to +implement this in pandas, in a `.from_dask()` method. + +As the ecosystem grows, this solution implies that every implementation could end up +having a long list of methods: + +- `.to_pandas()` / `.from_pandas()` +- `.to_vaex()` / `.from_vaex()` +- `.to_modin()` / `.from_modin()` +- `.to_dask()` / `.from_dask()` +- ... + +With a standard Python data frame API, every library could simply implement a method to +import a standard data frame. And since data frame libraries are expected to implement +this API, that would be enough to transform any data frame to one implementation. + +So, the list above would be reduced to a single method in each implementation: + +- `.from_dataframe()` + +Note that the method `.from_dataframe()` is for illustration, and not proposed as part +of the standard. From cc92c2cc6bc1dcee2a3b0b77c722d3cfa906f53f Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Tue, 18 Aug 2020 22:02:34 +0100 Subject: [PATCH 2/7] Changes after first review --- spec/01_purpose_and_scope.md | 135 ++++++++++++++++++++++++----------- spec/02_use_cases.md | 97 ++++++++++++++++++++++--- 2 files changed, 180 insertions(+), 52 deletions(-) diff --git a/spec/01_purpose_and_scope.md b/spec/01_purpose_and_scope.md index e3131d5b..c3e59543 100644 --- a/spec/01_purpose_and_scope.md +++ b/spec/01_purpose_and_scope.md @@ -8,24 +8,59 @@ A data frame is a programming interface for expressing data manipulations over a data structure consisting of rows and columns. Columns are named, and values in a column share a common data type. This definition is intentionally left broad. +## History and data frame implementations -## History +Data frame libraries in several programming language exist, such as +[R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame), +[Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html), +[Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others. -In 2009 [pandas](https://pandas.pydata.org/) became the first major Python data frame -library to be open sourced. Its popularity has been growing, and as of 2020 the pandas -website has around one million and a half visitors per month. +In Python, the most popular data frame library is [pandas](https://pandas.pydata.org/). +pandas was initially develop at a hedge fund, with a focus on +[panel data](https://en.wikipedia.org/wiki/Panel_data) and financial time series. +It was open sourced in 2009, and since then it has been growing in popularity, including +many other domains outside time series and financial data. While still rich in time series +functionality, today is considered a general-purpose data frame library. The original +`Panel` class that gave name to the library was deprecated in 2017 and removed in 2019, +to focus on the main `DataFrame` class. -pandas is rich in features, and its public API contains more than 2,000 objects. In -recent years, the number of existing Python data frame libraries has been growing. Most -of the new libraries offer some advantages compared to pandas (distributed, out-of-core -or GPU computing for example). And most of them provide a public API very similar to pandas, -to make the transition easy to users. The main libraries in this category would be -[Dask](https://dask.org/), [Vaex](https://vaex.io/), [Modin](https://github.com/modin-project/modin), -[cuDF](https://github.com/rapidsai/cudf) and [Koalas](https://github.com/databricks/koalas). +Internally, pandas is implemented on top of NumPy, which is used to store the data +and to perform many of the operations. Some parts of pandas are writen in Cython. -There are other libraries that did not base their public API in pandas, the most popular -one is [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) which bases -its API in the Spark data frame. +As of 2020 the pandas website has around one million and a half visitors per month. + +Other libraries emerged in the last years, to address some of the limitations of pandas. +But in most cases, the libraries implemented a public API very similar to pandas, to +make the transition to their libraries easier. Next, there is a short description of +the main data frame libraries in Python. + +[Dask](https://dask.org/) is a task scheduler built in Python, which implements a data +frame interface. Dask data frame use pandas internally in the workers, and it provides +an API similar to pandas, adapted to its distributed and lazy nature. + +[Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to +create memory maps that avoid loading data sets to memory. Some parts of Vaex are +implemented in C++. + +[Modin](https://github.com/modin-project/modin) is another distributed data frame +library based originally on [Ray](https://github.com/ray-project/ray). But built in +a more modular way, that allows it to also use Dask as a scheduler, or replace the +pandas-like public API by a SQLite-like one. + +[cuDF](https://github.com/rapidsai/cudf) is a GPU data frame library built on top +of Apache Arrow and RAPIDS. It provides an API similar to pandas. + +[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a data +frame library that uses Spark as a backend. PySpark public API is based on the +original Spark API, and not in pandas. + +[Koalas](https://github.com/databricks/koalas) is a data frame library built on +top of PySpark that provides a pandas-like API. + +[Ibis](https://ibis-project.org/) is a data frame library with multiple SQL backends. +It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to +SQL statements, executed by the backends. It supports conventional DBMS, as well +as big data systems such as Apache Impala or BigQuery. ## Goals @@ -36,7 +71,8 @@ Python data frames. But currently there are a growing number of not only data fr but also libraries that interact with data frames (visualization, statistical or machine learning libraries for example). Interactions among libraries are becoming complex, and the pandas public API is suboptimal as a standard, for its size, complexity, and implementation details -it exposes. +it exposes (for example, using NumPy data types or `NaN`). + The goal of the API described in this document is to provide a standard interface that encapsulates implementation details of data frame libraries. This will allow users and third-party libraries to @@ -44,17 +80,20 @@ write code that interacts with a standard data frame, and not with specific impl The defined API does not aim to be a convenient API for all users of data frames. Libraries targeting specific users (data analysts, data scientists, quants, etc.) can be implemented on top of the -standard API. The standard API is targeted to software engineers, who will build code and libraries -using the API specification following proper software engineering techniques. +standard API. The standard API is targeted to software developers, who will write reusable code +(as opposed as users performing fast interactive analysis of data). + +See the [scope](#Scope) section for detailed information on what is in scope, and the +[use cases](02_use_cases.html) section for details on the exact use cases considered. -See the [use cases](02_use_cases.html) section for details on the exact use cases considered. ## Scope -It is in the scope of this document the different elements of the API: +It is in the scope of this document the different elements of the API. This includes signatures +and semantics. To be more specific: - Data structures and Python classes -- Functions and methods +- Functions, methods, attributes and other API elements - Expected returns of the different operations - Data types (Python and low-level types) @@ -68,7 +107,12 @@ Implementation details of the data frames and execution of operations. This incl - How data is represented and stored (whether the data is in memory, disk, distributed) - Expectations on when the execution is happening (in an eager or lazy way) +- Other execution details +The API defined in this document needs to be used by libraries as diverse as Ibis, Dask, +Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory. +Any decision that involves assumptions on where the data is stored, or where execution +happens are out of the scope of this document. ## Stakeholders @@ -82,20 +126,21 @@ in this document in their libraries. The list of known Python data frame libraries at the time of writing this document is next: -- [pandas](https://pandas.pydata.org/) -- [Dask](https://dask.org/) - [cuDF](https://github.com/rapidsai/cudf) -- [Modin](https://github.com/modin-project/modin) -- [Vaex](https://vaex.io/) -- [Turi Create](https://github.com/apple/turicreate) -- [Koalas](https://github.com/databricks/koalas) -- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) +- [Dask](https://dask.org/) +- [datatable](https://github.com/h2oai/datatable) +- [dexplo](https://github.com/dexplo/dexplo/) +- [Eland](https://github.com/elastic/eland) - [Grizzly](https://github.com/weld-project/weld#grizzly) +- [Ibis](https://ibis-project.org/) +- [Koalas](https://github.com/databricks/koalas) - [Mars](https://docs.pymars.org/en/latest/) +- [Modin](https://github.com/modin-project/modin) +- [pandas](https://pandas.pydata.org/) +- [PySpark](https://spark.apache.org/docs/latest/api/python/index.html) - [StaticFrame](https://static-frame.readthedocs.io/en/latest/) -- [dexplo](https://github.com/dexplo/dexplo/) -- [datatable](https://github.com/h2oai/datatable) -- [Eland](https://github.com/elastic/eland) +- [Turi Create](https://github.com/apple/turicreate) +- [Vaex](https://vaex.io/) ### Downstream library authors @@ -105,9 +150,9 @@ to know how the data contained in a data frame can be consumed, and which operat A non-exhaustive list of downstream library categories is next: -- Plotting and visualization (Matplotlib, Bokeh, Altair, Plotly, etc.) -- Statistical libraries (statsmodels) -- Machine learning libraries (scikit-learn) +- Plotting and visualization (e.g. Matplotlib, Bokeh, Altair, Plotly) +- Statistical libraries (e.g. statsmodels) +- Machine learning libraries (e.g. scikit-learn) ### Upstream library authors @@ -116,19 +161,27 @@ Authors of libraries that provide functionality used by data frames. A non-exhaustive list of upstream categories is next: -- Data interchange protocols (Apache Arrow, NumPy's protocol buffer) -- Mathematical computational libraries (MKL) -- Task schedulers (Dask, Ray) +- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow) +- Task schedulers (e.g. Dask, Ray) ### Data frame power users -This group considers power users of data frames. For example, developers of applications that -use data frames. Or authors of libraries that provide specialized data frame APIs to be built -on top of the standard API. -Basic users of data frame are not considered direct users of this standard data frame API. This -could include for example users analyzing data in a Jupyter notebook using a data frame implementation. +This group considers developers of reusable code that use data frames. For example, developers of +applications that use data frames. Or authors of libraries that provide specialized data frame +APIs to be built on top of the standard API. + +People using data frames in an interactive way are considered out of scope. These users include data +analysts, data scientist and other users that are key for data frames. But this type of user may need +shortcuts, or libraries that take decisions for them to save them time. For example automatic type +inference, or excesive use of very compact syntax like Python squared brackets / `__getitem__`. +Standardizing on such practices can be extremely difficult, and it is out of scope. + +With the development of a standard API that targets developers writing reusable code we expected +to also serve data analysts and other interactive users. But in an indirect way, by providing a +standard API where other libraries can be built on top. Including libraries with the syntactic +sugar required for fast analysis of data. ## High-level API overview diff --git a/spec/02_use_cases.md b/spec/02_use_cases.md index e35b923d..4fa9abde 100644 --- a/spec/02_use_cases.md +++ b/spec/02_use_cases.md @@ -17,8 +17,9 @@ The next types of use cases can be accomplished by the use of the standard Pytho API defined in this document: - Downstream library receiving a data frame as a parameter -- Converting a data frame from one implementation to another -- Other types of uses cases not related to data interchange will be added later +- Converting a data frame from one implementation to another (try to clarify) + +Other types of uses cases not related to data interchange will be added later. ## Concrete use cases @@ -73,6 +74,40 @@ The API documented here describes what the developer of the plotting library can from the object `data`. In which ways can interact with the data frame object to extract the desired information. +An example of this are Seaborn plots. For example, the +[scatterplot](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) accepts a +parameter `data`, which is expected to be a `DataFrame`. + +When providing a pandas `DataFrame`, the next code generates the intended scatter plot: + +```python +import pandas +import seaborn + +pandas_df = pandas.DataFrame({'bill': [15, 32, 28], + 'tip': [2, 5, 3]}) + +seaborn.scatterplot(data=pandas_df, x='bill', y='tip') +``` + +But if we instead provide a Vaex data frame, then an exception occurs: + +```python +import vaex + +vaex_df = vaex.from_pandas(pandas_df) + +seaborn.scatterplot(data=vaex_df, x='bill', y='tip') +``` + +This is caused by Seaborn expecting a pandas `DataFrame` object. And while Vaex +provides an interface very similar to pandas, it does not implement 100% of its +API, and Seaborn is trying to use parts that differ. + +With the definition of the standard API, Seaborn developers should be able to +expect a generic data frame. And any library implementing the standard data frame +API could be plotted with the previous example (Vaex, cuDF, Ibis, Dask, Modin, etc.). + ### Change object from one implementation to another @@ -89,21 +124,61 @@ implement a `.to_pandas()` method in the Dask data frame. Another option could b implement this in pandas, in a `.from_dask()` method. As the ecosystem grows, this solution implies that every implementation could end up -having a long list of methods: +having a long list of functions or methods: -- `.to_pandas()` / `.from_pandas()` -- `.to_vaex()` / `.from_vaex()` -- `.to_modin()` / `.from_modin()` -- `.to_dask()` / `.from_dask()` +- `to_pandas()` / `from_pandas()` +- `to_vaex()` / `from_vaex()` +- `to_modin()` / `from_modin()` +- `to_dask()` / `from_dask()` - ... With a standard Python data frame API, every library could simply implement a method to import a standard data frame. And since data frame libraries are expected to implement this API, that would be enough to transform any data frame to one implementation. -So, the list above would be reduced to a single method in each implementation: +So, the list above would be reduced to a single function or method in each implementation: + +- `from_dataframe()` + +Note that the function `from_dataframe()` is for illustration, and not proposed as part +of the standard at this point. + +Every pair of data frame libraries could benefit from this conversion. But we can go +deeper with an actual example. The conversion from an xarray `DataArray` to a pandas +`DataFrame`, and the other way round. + +Even if xarray is not a data frame library, but a miltidimensional labeled structure, +in cases where a 2-D is used, the data can be converted from and to a data frame. + +Currently, xarray implements a `.to_pandas()` method to convert a `DataArray` to a +pandas `DataFrame`: + +```python +import xarray + +xarray_data = xarray.DataArray([[15, 2], [32, 5], [28, 3]], + dims=('diners', 'features'), + coords={'features': ['bill', 'tip']}) + +pandas_df = xarray_data.to_pandas() +``` + +To convert the pandas data frame to an xarray `Data Array`, both libraries have +implementations. Both lines below are equivalent: + +```python +pandas_df.to_xarray() +xarray.DataArray(pandas_df) +``` + +Other data frame implementations may or may not implement a way to convert to xarray. +And passing a data frame to the `DataArray` constructor may or may not work. -- `.from_dataframe()` +The standard data frame API would allow pandas, xarray and other libraries to +implement the standard API. They could convert other representations via a single +`to_dataframe()` function or method. And they could be converted to other +representations that implement that function automatically. -Note that the method `.from_dataframe()` is for illustration, and not proposed as part -of the standard. +This would make conversions very simple, not only among data frame libraries, but +also among other libraries which data can be expressed as tabular data, such as +xarray, SQLAlchemy and others. From 3d998f9498beef6f82ee0425742d06be862f6197 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Tue, 25 Aug 2020 12:03:13 +0100 Subject: [PATCH 3/7] Apply suggestions from code review Co-authored-by: Hyukjin Kwon --- spec/01_purpose_and_scope.md | 7 +++---- spec/02_use_cases.md | 2 +- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/spec/01_purpose_and_scope.md b/spec/01_purpose_and_scope.md index c3e59543..e841eb6a 100644 --- a/spec/01_purpose_and_scope.md +++ b/spec/01_purpose_and_scope.md @@ -25,7 +25,7 @@ functionality, today is considered a general-purpose data frame library. The ori to focus on the main `DataFrame` class. Internally, pandas is implemented on top of NumPy, which is used to store the data -and to perform many of the operations. Some parts of pandas are writen in Cython. +and to perform many of the operations. Some parts of pandas are written in Cython. As of 2020 the pandas website has around one million and a half visitors per month. @@ -173,9 +173,9 @@ applications that use data frames. Or authors of libraries that provide speciali APIs to be built on top of the standard API. People using data frames in an interactive way are considered out of scope. These users include data -analysts, data scientist and other users that are key for data frames. But this type of user may need +analysts, data scientists and other users that are key for data frames. But this type of user may need shortcuts, or libraries that take decisions for them to save them time. For example automatic type -inference, or excesive use of very compact syntax like Python squared brackets / `__getitem__`. +inference, or excessive use of very compact syntax like Python squared brackets / `__getitem__`. Standardizing on such practices can be extremely difficult, and it is out of scope. With the development of a standard API that targets developers writing reusable code we expected @@ -205,4 +205,3 @@ sugar required for fast analysis of data. ## References - diff --git a/spec/02_use_cases.md b/spec/02_use_cases.md index 4fa9abde..b681830d 100644 --- a/spec/02_use_cases.md +++ b/spec/02_use_cases.md @@ -147,7 +147,7 @@ Every pair of data frame libraries could benefit from this conversion. But we ca deeper with an actual example. The conversion from an xarray `DataArray` to a pandas `DataFrame`, and the other way round. -Even if xarray is not a data frame library, but a miltidimensional labeled structure, +Even if xarray is not a data frame library, but a multidimensional labeled structure, in cases where a 2-D is used, the data can be converted from and to a data frame. Currently, xarray implements a `.to_pandas()` method to convert a `DataArray` to a From d047f578989035c8ab0cdc6c532c8160868bd6ec Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Sat, 29 Aug 2020 14:34:39 +0100 Subject: [PATCH 4/7] Addressing comments from reviews --- spec/01_purpose_and_scope.md | 139 ++++++++++++++++++++++------------- spec/02_use_cases.md | 56 +++++++------- 2 files changed, 114 insertions(+), 81 deletions(-) diff --git a/spec/01_purpose_and_scope.md b/spec/01_purpose_and_scope.md index c3e59543..69a7e190 100644 --- a/spec/01_purpose_and_scope.md +++ b/spec/01_purpose_and_scope.md @@ -2,25 +2,25 @@ ## Introduction -This document defines a Python data frame API. +This document defines a Python dataframe API. -A data frame is a programming interface for expressing data manipulations over a +A dataframe is a programming interface for expressing data manipulations over a data structure consisting of rows and columns. Columns are named, and values in a column share a common data type. This definition is intentionally left broad. -## History and data frame implementations +## History and dataframe implementations Data frame libraries in several programming language exist, such as [R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame), [Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html), [Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others. -In Python, the most popular data frame library is [pandas](https://pandas.pydata.org/). -pandas was initially develop at a hedge fund, with a focus on +In Python, the most popular dataframe library is [pandas](https://pandas.pydata.org/). +pandas was initially developed at a hedge fund, with a focus on [panel data](https://en.wikipedia.org/wiki/Panel_data) and financial time series. It was open sourced in 2009, and since then it has been growing in popularity, including many other domains outside time series and financial data. While still rich in time series -functionality, today is considered a general-purpose data frame library. The original +functionality, today is considered a general-purpose dataframe library. The original `Panel` class that gave name to the library was deprecated in 2017 and removed in 2019, to focus on the main `DataFrame` class. @@ -32,61 +32,45 @@ As of 2020 the pandas website has around one million and a half visitors per mon Other libraries emerged in the last years, to address some of the limitations of pandas. But in most cases, the libraries implemented a public API very similar to pandas, to make the transition to their libraries easier. Next, there is a short description of -the main data frame libraries in Python. +the main dataframe libraries in Python. -[Dask](https://dask.org/) is a task scheduler built in Python, which implements a data -frame interface. Dask data frame use pandas internally in the workers, and it provides +[Dask](https://dask.org/) is a task scheduler built in Python, which implements a +dataframe interface. Dask dataframe use pandas internally in the workers, and it provides an API similar to pandas, adapted to its distributed and lazy nature. [Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to create memory maps that avoid loading data sets to memory. Some parts of Vaex are implemented in C++. -[Modin](https://github.com/modin-project/modin) is another distributed data frame +[Modin](https://github.com/modin-project/modin) is another distributed dataframe library based originally on [Ray](https://github.com/ray-project/ray). But built in a more modular way, that allows it to also use Dask as a scheduler, or replace the pandas-like public API by a SQLite-like one. -[cuDF](https://github.com/rapidsai/cudf) is a GPU data frame library built on top +[cuDF](https://github.com/rapidsai/cudf) is a GPU dataframe library built on top of Apache Arrow and RAPIDS. It provides an API similar to pandas. -[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a data -frame library that uses Spark as a backend. PySpark public API is based on the +[PySpark](https://spark.apache.org/docs/latest/api/python/index.html) is a +dataframe library that uses Spark as a backend. PySpark public API is based on the original Spark API, and not in pandas. -[Koalas](https://github.com/databricks/koalas) is a data frame library built on +[Koalas](https://github.com/databricks/koalas) is a dataframe library built on top of PySpark that provides a pandas-like API. -[Ibis](https://ibis-project.org/) is a data frame library with multiple SQL backends. +[Ibis](https://ibis-project.org/) is a dataframe library with multiple SQL backends. It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to SQL statements, executed by the backends. It supports conventional DBMS, as well as big data systems such as Apache Impala or BigQuery. - -## Goals - -Given the growing Python data frame ecosystem, and its complexity, this document provides -a standard Python data frame API. Until recently, pandas has been a de-facto standard for -Python data frames. But currently there are a growing number of not only data frame libraries, -but also libraries that interact with data frames (visualization, statistical or machine learning +Given the growing Python dataframe ecosystem, and its complexity, this document provides +a standard Python dataframe API. Until recently, pandas has been a de-facto standard for +Python dataframes. But currently there are a growing number of not only dataframe libraries, +but also libraries that interact with dataframes (visualization, statistical or machine learning libraries for example). Interactions among libraries are becoming complex, and the pandas public API is suboptimal as a standard, for its size, complexity, and implementation details it exposes (for example, using NumPy data types or `NaN`). -The goal of the API described in this document is to provide a standard interface that encapsulates -implementation details of data frame libraries. This will allow users and third-party libraries to -write code that interacts with a standard data frame, and not with specific implementations. - -The defined API does not aim to be a convenient API for all users of data frames. Libraries targeting -specific users (data analysts, data scientists, quants, etc.) can be implemented on top of the -standard API. The standard API is targeted to software developers, who will write reusable code -(as opposed as users performing fast interactive analysis of data). - -See the [scope](#Scope) section for detailed information on what is in scope, and the -[use cases](02_use_cases.html) section for details on the exact use cases considered. - - ## Scope It is in the scope of this document the different elements of the API. This includes signatures @@ -97,22 +81,69 @@ and semantics. To be more specific: - Expected returns of the different operations - Data types (Python and low-level types) -The scope of this document is limited to generic data frames, and not data frames specific to +The scope of this document is limited to generic dataframes, and not dataframes specific to certain domains. -### Out-of-scope and non-goals +### Goals + +The goal of the API described in this document is to provide a standard interface that encapsulates +implementation details of dataframe libraries. This will allow users and third-party libraries to +write code that interacts with a standard dataframe, and not with specific implementations. + +The main goals for the API defined in this document are: + +- Provide a common API for dataframes so software can be developed to communicate with it +- Provide a common API for dataframes to build user interfaces on top of it, for example + libraries for interactive use or specific domains and industries +- Simplify interactions between the projects of the ecosystem, for example, software that + receives data as a dataframe +- Make conversion of data among different implementations easier +- Help user transition from one dataframe library to another + +See the [use cases](02_use_cases.html) section for details on the exact use cases considered. -Implementation details of the data frames and execution of operations. This includes: + +### Out-of-scope + +#### Execution details + +Implementation details of the dataframes and execution of operations. This includes: - How data is represented and stored (whether the data is in memory, disk, distributed) - Expectations on when the execution is happening (in an eager or lazy way) - Other execution details -The API defined in this document needs to be used by libraries as diverse as Ibis, Dask, -Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory. -Any decision that involves assumptions on where the data is stored, or where execution -happens are out of the scope of this document. +**Rationale:** The API defined in this document needs to be used by libraries as diverse as Ibis, +Dask, Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory. +Any decision that involves assumptions on where the data is stored, or where execution happens +could prevent implementation from adopting the standard. + +#### High level APIs + +It is out of scope to provide an API designed for interactive use. While interactive use +is a key aspect of dataframes, an API designed for interactive use can be built on top +of the API defined in this document. + +Domain or industry specific APIs are also out of scope, but can benefit from the standard +to better interact with the different dataframe implementation. + +**Rationale:** Interactive or domain specific users are key in the Python dataframe ecosystem. +But the amount and diversity of users makes it unfeasible to standardize every dataframe feature +that is currently used. In particular, functionality built as syntactic sugar for convenience in +interactive use, or heavily overloaded create very complex APIs. For example, the pandas dataframe +constructor, which accepts a huge number of formats, or its `__getitem__` (e.g. `df[something]`) +which is heavily overloaded. Implementations can provide convenient functionality like this one +for the users they are targeting, but it is out-of-scope for the standard, so the standard is +simple and easy to adopt. + + +### Non-goals + +- Build an API that is appropriate to all users +- Have a unique dataframe implementation for Python +- Standardize functionalities specific to a domain or industry + ## Stakeholders @@ -121,10 +152,10 @@ This section provides the list of stakeholders considered for the definition of ### Data frame library authors -Authors of data frame libraries in Python are expected to implement the API defined -in this document in their libraries. +We encourage dataframe libraries in Python to implement the API defined in this document +in their libraries. -The list of known Python data frame libraries at the time of writing this document is next: +The list of known Python dataframe libraries at the time of writing this document is next: - [cuDF](https://github.com/rapidsai/cudf) - [Dask](https://dask.org/) @@ -145,8 +176,8 @@ The list of known Python data frame libraries at the time of writing this docume ### Downstream library authors -Authors of libraries that consume data frames. They can use the API defined in this document -to know how the data contained in a data frame can be consumed, and which operations are implemented. +Authors of libraries that consume dataframes. They can use the API defined in this document +to know how the data contained in a dataframe can be consumed, and which operations are implemented. A non-exhaustive list of downstream library categories is next: @@ -157,23 +188,25 @@ A non-exhaustive list of downstream library categories is next: ### Upstream library authors -Authors of libraries that provide functionality used by data frames. +Authors of libraries that provide functionality used by dataframes. A non-exhaustive list of upstream categories is next: - Data formats, protocols and libraries for data analytics (e.g. Apache Arrow) -- Task schedulers (e.g. Dask, Ray) +- Task schedulers (e.g. Dask, Ray, Mars) +- Big data systems (e.g. Spark, Hive, Impala, Presto) +- Libraries for database access (e.g. SQLAlchemy) ### Data frame power users -This group considers developers of reusable code that use data frames. For example, developers of -applications that use data frames. Or authors of libraries that provide specialized data frame +This group considers developers of reusable code that use dataframes. For example, developers of +applications that use dataframes. Or authors of libraries that provide specialized dataframe APIs to be built on top of the standard API. -People using data frames in an interactive way are considered out of scope. These users include data -analysts, data scientist and other users that are key for data frames. But this type of user may need +People using dataframes in an interactive way are considered out of scope. These users include data +analysts, data scientist and other users that are key for dataframes. But this type of user may need shortcuts, or libraries that take decisions for them to save them time. For example automatic type inference, or excesive use of very compact syntax like Python squared brackets / `__getitem__`. Standardizing on such practices can be extremely difficult, and it is out of scope. diff --git a/spec/02_use_cases.md b/spec/02_use_cases.md index 4fa9abde..725e742d 100644 --- a/spec/02_use_cases.md +++ b/spec/02_use_cases.md @@ -2,7 +2,7 @@ ## Introduction -This section discusses the use cases considered for the standard data frame API. +This section discusses the use cases considered for the standard dataframe API. The goals and scope of this API are defined in the [goals](01_purpose_and_scope.html#Goals), and [scope](01_purpose_and_scope.html#Scope) sections. @@ -13,11 +13,11 @@ The target audience and stakeholders are presented in the ## Types of use cases -The next types of use cases can be accomplished by the use of the standard Python data frame +The next types of use cases can be accomplished by the use of the standard Python dataframe API defined in this document: -- Downstream library receiving a data frame as a parameter -- Converting a data frame from one implementation to another (try to clarify) +- Downstream library receiving a dataframe as a parameter +- Converting a dataframe from one implementation to another (try to clarify) Other types of uses cases not related to data interchange will be added later. @@ -26,13 +26,13 @@ Other types of uses cases not related to data interchange will be added later. In this section we define concrete examples of the types of use cases defined above. -### Plotting library receiving data as a data frame +### Plotting library receiving data as a dataframe One use case we facilitate with the API defined in this document is a plotting library -receiving the data to be plotted as a data frame object. +receiving the data to be plotted as a dataframe object. Consider the case of a scatter plot, that will be plotted with the data contained in a -data frame structure. For example, consider this data: +dataframe structure. For example, consider this data: | petal length | petal width | |--------------|-------------| @@ -55,7 +55,7 @@ def scatter_plot(x: list, y: list): ... ``` -When we consider data frames, we would like to provide them directly to the `scatter_plot` +When we consider dataframes, we would like to provide them directly to the `scatter_plot` function. And we would like the plotting library to be agnostic of what specific library will be used when calling the function. We would like the code to work whether a pandas, Dask, Vaex or other current or future implementation are used. @@ -71,7 +71,7 @@ def scatter_plot(data: dataframe, x_column: str, y_column: str): ``` The API documented here describes what the developer of the plotting library can expect -from the object `data`. In which ways can interact with the data frame object to extract +from the object `data`. In which ways can interact with the dataframe object to extract the desired information. An example of this are Seaborn plots. For example, the @@ -90,7 +90,7 @@ pandas_df = pandas.DataFrame({'bill': [15, 32, 28], seaborn.scatterplot(data=pandas_df, x='bill', y='tip') ``` -But if we instead provide a Vaex data frame, then an exception occurs: +But if we instead provide a Vaex dataframe, then an exception occurs: ```python import vaex @@ -105,7 +105,7 @@ provides an interface very similar to pandas, it does not implement 100% of its API, and Seaborn is trying to use parts that differ. With the definition of the standard API, Seaborn developers should be able to -expect a generic data frame. And any library implementing the standard data frame +expect a generic dataframe. And any library implementing the standard dataframe API could be plotted with the previous example (Vaex, cuDF, Ibis, Dask, Modin, etc.). @@ -113,14 +113,14 @@ API could be plotted with the previous example (Vaex, cuDF, Ibis, Dask, Modin, e Another considered use case is transforming the data from one implementation to another. -As an example, consider we are using Dask data frames, given that our data is too big to +As an example, consider we are using Dask dataframes, given that our data is too big to fit in memory, and we are working over a cluster. At some point in our pipeline, we -reduced the size of the data frame we are working on, by filtering and grouping. And -we are interested in transforming the data frame from Dask to pandas, to use some +reduced the size of the dataframe we are working on, by filtering and grouping. And +we are interested in transforming the dataframe from Dask to pandas, to use some functionalities that pandas implements but Dask does not. -Since Dask knows how the data in the data frame is represented, one option could be to -implement a `.to_pandas()` method in the Dask data frame. Another option could be to +Since Dask knows how the data in the dataframe is represented, one option could be to +implement a `.to_pandas()` method in the Dask dataframe. Another option could be to implement this in pandas, in a `.from_dask()` method. As the ecosystem grows, this solution implies that every implementation could end up @@ -132,9 +132,9 @@ having a long list of functions or methods: - `to_dask()` / `from_dask()` - ... -With a standard Python data frame API, every library could simply implement a method to -import a standard data frame. And since data frame libraries are expected to implement -this API, that would be enough to transform any data frame to one implementation. +With a standard Python dataframe API, every library could simply implement a method to +import a standard dataframe. And since dataframe libraries are expected to implement +this API, that would be enough to transform any dataframe to one implementation. So, the list above would be reduced to a single function or method in each implementation: @@ -143,12 +143,12 @@ So, the list above would be reduced to a single function or method in each imple Note that the function `from_dataframe()` is for illustration, and not proposed as part of the standard at this point. -Every pair of data frame libraries could benefit from this conversion. But we can go +Every pair of dataframe libraries could benefit from this conversion. But we can go deeper with an actual example. The conversion from an xarray `DataArray` to a pandas `DataFrame`, and the other way round. -Even if xarray is not a data frame library, but a miltidimensional labeled structure, -in cases where a 2-D is used, the data can be converted from and to a data frame. +Even if xarray is not a dataframe library, but a miltidimensional labeled structure, +in cases where a 2-D is used, the data can be converted from and to a dataframe. Currently, xarray implements a `.to_pandas()` method to convert a `DataArray` to a pandas `DataFrame`: @@ -163,7 +163,7 @@ xarray_data = xarray.DataArray([[15, 2], [32, 5], [28, 3]], pandas_df = xarray_data.to_pandas() ``` -To convert the pandas data frame to an xarray `Data Array`, both libraries have +To convert the pandas dataframe to an xarray `Data Array`, both libraries have implementations. Both lines below are equivalent: ```python @@ -171,14 +171,14 @@ pandas_df.to_xarray() xarray.DataArray(pandas_df) ``` -Other data frame implementations may or may not implement a way to convert to xarray. -And passing a data frame to the `DataArray` constructor may or may not work. +Other dataframe implementations may or may not implement a way to convert to xarray. +And passing a dataframe to the `DataArray` constructor may or may not work. -The standard data frame API would allow pandas, xarray and other libraries to +The standard dataframe API would allow pandas, xarray and other libraries to implement the standard API. They could convert other representations via a single -`to_dataframe()` function or method. And they could be converted to other +`from_dataframe()` function or method. And they could be converted to other representations that implement that function automatically. -This would make conversions very simple, not only among data frame libraries, but +This would make conversions very simple, not only among dataframe libraries, but also among other libraries which data can be expressed as tabular data, such as xarray, SQLAlchemy and others. From e9472c8ad961ff934d56e1ea9136d6faa91dfda0 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Tue, 1 Sep 2020 17:49:25 +0100 Subject: [PATCH 5/7] Addressing comments from reviews --- spec/01_purpose_and_scope.md | 31 +++++++++++++++++++++---------- 1 file changed, 21 insertions(+), 10 deletions(-) diff --git a/spec/01_purpose_and_scope.md b/spec/01_purpose_and_scope.md index 7d976b21..b1a327ef 100644 --- a/spec/01_purpose_and_scope.md +++ b/spec/01_purpose_and_scope.md @@ -10,7 +10,7 @@ column share a common data type. This definition is intentionally left broad. ## History and dataframe implementations -Data frame libraries in several programming language exist, such as +Dataframe libraries in several programming language exist, such as [R](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/data.frame), [Scala](https://docs.databricks.com/spark/latest/dataframes-datasets/introduction-to-dataframes-scala.html), [Julia](https://juliadata.github.io/DataFrames.jl/stable/) and others. @@ -73,6 +73,10 @@ it exposes (for example, using NumPy data types or `NaN`). ## Scope +In the first iteration of the API standard, the scope is limited to create a data exchange +protocol. In future iterations the scope will be broader, including elements to operate with +the data. + It is in the scope of this document the different elements of the API. This includes signatures and semantics. To be more specific: @@ -87,18 +91,25 @@ certain domains. ### Goals -The goal of the API described in this document is to provide a standard interface that encapsulates +The goal of the first iteration is to provide a data exchange protocol, so consumers of dataframes +can interact with a standard interface to access their data. + +The goal of the of future iterations will be to provide a standard interface that encapsulates implementation details of dataframe libraries. This will allow users and third-party libraries to -write code that interacts with a standard dataframe, and not with specific implementations. +write code that interacts and operates with a standard dataframe, and not with specific implementations. The main goals for the API defined in this document are: -- Provide a common API for dataframes so software can be developed to communicate with it +- Make conversion of data among different implementations easier +- Let third party libraries consuming dataframes receive dataframes from any implementations + +In the future, besides a data exchange protocol, the standard aims to include common operations +done with dataframe, with the next goals in mind: + +- Provide a common API for dataframes so software using dataframes can work with all + implementations - Provide a common API for dataframes to build user interfaces on top of it, for example libraries for interactive use or specific domains and industries -- Simplify interactions between the projects of the ecosystem, for example, software that - receives data as a dataframe -- Make conversion of data among different implementations easier - Help user transition from one dataframe library to another See the [use cases](02_use_cases.html) section for details on the exact use cases considered. @@ -150,7 +161,7 @@ simple and easy to adopt. This section provides the list of stakeholders considered for the definition of this API. -### Data frame library authors +### Dataframe library authors We encourage dataframe libraries in Python to implement the API defined in this document in their libraries. @@ -192,13 +203,13 @@ Authors of libraries that provide functionality used by dataframes. A non-exhaustive list of upstream categories is next: -- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow) +- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow, NumPy) - Task schedulers (e.g. Dask, Ray, Mars) - Big data systems (e.g. Spark, Hive, Impala, Presto) - Libraries for database access (e.g. SQLAlchemy) -### Data frame power users +### Dataframe power users This group considers developers of reusable code that use dataframes. For example, developers of From 837f87d577ca6f1ff99bb609170908673a18d4e5 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Thu, 3 Sep 2020 11:01:35 +0100 Subject: [PATCH 6/7] Apply suggestions from code review Co-authored-by: Maarten Breddels --- spec/01_purpose_and_scope.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/spec/01_purpose_and_scope.md b/spec/01_purpose_and_scope.md index b1a327ef..797be277 100644 --- a/spec/01_purpose_and_scope.md +++ b/spec/01_purpose_and_scope.md @@ -94,14 +94,14 @@ certain domains. The goal of the first iteration is to provide a data exchange protocol, so consumers of dataframes can interact with a standard interface to access their data. -The goal of the of future iterations will be to provide a standard interface that encapsulates +The goal of future iterations will be to provide a standard interface that encapsulates implementation details of dataframe libraries. This will allow users and third-party libraries to write code that interacts and operates with a standard dataframe, and not with specific implementations. The main goals for the API defined in this document are: - Make conversion of data among different implementations easier -- Let third party libraries consuming dataframes receive dataframes from any implementations +- Let third party libraries consume dataframes from any implementations In the future, besides a data exchange protocol, the standard aims to include common operations done with dataframe, with the next goals in mind: From 293c652f984d4c6c681d147d3e5b69d08e2e2c59 Mon Sep 17 00:00:00 2001 From: Marc Garcia Date: Fri, 4 Sep 2020 18:26:33 +0100 Subject: [PATCH 7/7] Apply suggestions from code review Co-authored-by: Devin Petersohn --- spec/01_purpose_and_scope.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/spec/01_purpose_and_scope.md b/spec/01_purpose_and_scope.md index 797be277..3f1731d1 100644 --- a/spec/01_purpose_and_scope.md +++ b/spec/01_purpose_and_scope.md @@ -35,15 +35,15 @@ make the transition to their libraries easier. Next, there is a short descriptio the main dataframe libraries in Python. [Dask](https://dask.org/) is a task scheduler built in Python, which implements a -dataframe interface. Dask dataframe use pandas internally in the workers, and it provides +dataframe interface. Dask dataframe uses pandas internally in the workers, and it provides an API similar to pandas, adapted to its distributed and lazy nature. [Vaex](https://vaex.io/) is an out-of-core alternative to pandas. Vaex uses hdf5 to create memory maps that avoid loading data sets to memory. Some parts of Vaex are implemented in C++. -[Modin](https://github.com/modin-project/modin) is another distributed dataframe -library based originally on [Ray](https://github.com/ray-project/ray). But built in +[Modin](https://github.com/modin-project/modin) is a distributed dataframe +library originally built on [Ray](https://github.com/ray-project/ray), but has a more modular way, that allows it to also use Dask as a scheduler, or replace the pandas-like public API by a SQLite-like one.