Skip to content

Expressions - another attempt #346

Closed
@MarcoGorelli

Description

@MarcoGorelli

I'd just like to make another case for the expressions syntax, and to see if a simpler version of the proposal I'd previous put together might be acceptable

The current syntax isn't sufficient for new libraries

Great Tables (a new project by Posit) has recently introduced native Polars support - the way they did it really caught my attention. They didn't repeat their pandas support but with Polars functions - instead, they really make expressions a main focus: https://posit-dev.github.io/great-tables/blog/polars-styling/ . The whole thing's worth reading, but I really want to draw your attention to

As it turns out, polars expressions make styling tables very straightforward. The same polars code that you would use to select or filter combines with Great Tables to highlight, circle, or bolden text.

In this post, I’ll show how Great Tables uses polars expressions to make delightful tables, like the one below.

I was expecting this to happen, and I expect it'll happen a whole load more. If new libraries lean in to the expressions syntax, then the Standard will be dead on arrival.

If we want to promote the Standard, we need to keep up with the times. This requires extra work, but so does anything worthwhile.

The current rules break method chaining

Let's take the following:

  • join lineitem and supplier on 'a' (left join)
  • we only keep rows where column 'a' plus column 'b' is greater than 0
  • double the value of column 'a' and only keep that and column 'd'

You might expect to be able to do this with:

(
    lineitem.join(supplier, on="a", how="left")
    .filter((lineitem.col("a") + lineitem.col("b")) > 0)
    .assign(lineitem.col("a") * 2)
    .select("a", "d")
)

However, it will raise, because lineitem.col('a') was derived from a different dataframe than lineitem.join(supplier, on='a', how='left'), and that's not allowed. (yes, I'm aware that you can workaround this with temporary variables, but my point is: method chaining is very popular among dataframe users and devs - are we sure we don't want to support it?).

With expressions, though, there's no issue:

(
    lineitem.join(supplier, on="a", how="left")
    .filter((pdx.col("a") + pdx.col("b")) > 0)
    .select(pdx.col("a") * 2, pdx.col("d"))
)

You also don't need the extra assign statement

It's not a zero-cost abstraction

The current syntax is also not a zero-cost abstraction on Polars - trying to use only two objects (Column, DataFrame) to represent four (Series, Expr, DataFrame, LazyFrame) means that the resulting code isn't going to be as efficient as it could be:

df: pl.DataFrame
df.filter((df['a']+df['b'])>0)

is less efficient than

df: pl.DataFrame
df.filter((pl.col('a')+pl.col('b'))>0)

and the current API, in the persisted case, resolves to the first one. I don't see a way of this unfortunately.

Telling people "you can use the standard if you want, but it'll be more efficient to use the Polars API directly" is a recipe for people just using Polars and forgetting about the Standard. I'm calling it.

The way forwards

We don't necessarily need to separate DataFrame from LazyFrame. But I'm once again making the case for Expr being separate from Column.

@shwina @kkraus14 if I made a simpler version of this summer's proposal, would you be open to reconsidering this? I'm tagging you two specifically because, as far as I remember, everyone else was positive about it.

Alternatives

We need to do something here, I don't want my name on standard which is just a "pandas minus"

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions