Description
I'd just like to make another case for the expressions syntax, and to see if a simpler version of the proposal I'd previous put together might be acceptable
The current syntax isn't sufficient for new libraries
Great Tables (a new project by Posit) has recently introduced native Polars support - the way they did it really caught my attention. They didn't repeat their pandas support but with Polars functions - instead, they really make expressions a main focus: https://posit-dev.github.io/great-tables/blog/polars-styling/ . The whole thing's worth reading, but I really want to draw your attention to
As it turns out, polars expressions make styling tables very straightforward. The same polars code that you would use to select or filter combines with Great Tables to highlight, circle, or bolden text.
In this post, I’ll show how Great Tables uses polars expressions to make delightful tables, like the one below.
I was expecting this to happen, and I expect it'll happen a whole load more. If new libraries lean in to the expressions syntax, then the Standard will be dead on arrival.
If we want to promote the Standard, we need to keep up with the times. This requires extra work, but so does anything worthwhile.
The current rules break method chaining
Let's take the following:
- join
lineitem
andsupplier
on'a'
(left join) - we only keep rows where column
'a'
plus column'b'
is greater than 0 - double the value of column
'a'
and only keep that and column'd'
You might expect to be able to do this with:
(
lineitem.join(supplier, on="a", how="left")
.filter((lineitem.col("a") + lineitem.col("b")) > 0)
.assign(lineitem.col("a") * 2)
.select("a", "d")
)
However, it will raise, because lineitem.col('a')
was derived from a different dataframe than lineitem.join(supplier, on='a', how='left')
, and that's not allowed. (yes, I'm aware that you can workaround this with temporary variables, but my point is: method chaining is very popular among dataframe users and devs - are we sure we don't want to support it?).
With expressions, though, there's no issue:
(
lineitem.join(supplier, on="a", how="left")
.filter((pdx.col("a") + pdx.col("b")) > 0)
.select(pdx.col("a") * 2, pdx.col("d"))
)
You also don't need the extra assign
statement
It's not a zero-cost abstraction
The current syntax is also not a zero-cost abstraction on Polars - trying to use only two objects (Column
, DataFrame
) to represent four (Series
, Expr
, DataFrame
, LazyFrame
) means that the resulting code isn't going to be as efficient as it could be:
df: pl.DataFrame
df.filter((df['a']+df['b'])>0)
is less efficient than
df: pl.DataFrame
df.filter((pl.col('a')+pl.col('b'))>0)
and the current API, in the persist
ed case, resolves to the first one. I don't see a way of this unfortunately.
Telling people "you can use the standard if you want, but it'll be more efficient to use the Polars API directly" is a recipe for people just using Polars and forgetting about the Standard. I'm calling it.
The way forwards
We don't necessarily need to separate DataFrame
from LazyFrame
. But I'm once again making the case for Expr
being separate from Column
.
@shwina @kkraus14 if I made a simpler version of this summer's proposal, would you be open to reconsidering this? I'm tagging you two specifically because, as far as I remember, everyone else was positive about it.
Alternatives
We need to do something here, I don't want my name on standard which is just a "pandas minus"