Skip to content

cumulative functions api design #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
semio opened this issue Aug 31, 2016 · 15 comments
Closed

cumulative functions api design #25

semio opened this issue Aug 31, 2016 · 15 comments
Labels

Comments

@semio
Copy link
Owner

semio commented Aug 31, 2016

current format of accumulate:

procedure: accumulate
ingredients:
    - sg-datapoints
options:
    op:
        gdp: "aagr"  # run aagr() on gdp datapoints

but problem is:

  1. functions like aagr() can be run in multiple timeframes, e.g n years growth rate where n is parameter.
  2. we should only calculate these statistics within country, so we should always group them before accumulate.

To fix these problems we should add more options to the procedure, such as

procedure: accumulate
ingredients:
    - sg-datapoints
options:
    groupby:
        - geo
    op:
        gdp:
            func: aagr
            param:
                window: 10

EDIT: If we always need to groupby before accumulate. maybe combine accumulate procedure into groupby?

@jheeffer
Copy link

jheeffer commented Oct 4, 2016

I think there's no groupby for accumulative functions, just the column over which to accumulate (let's call that accumulate_over. Then automatically you 'groupby' all the other key columns.

Starting data

gender geo year co2
male swe 2015 5
female swe 2015 7
male swe 2016 8
female swe 2016 9
male chn 2014 4
female chn 2014 3
male chn 2015 8
female chn 2015 5

Cumulative sum over year

gender geo year co2
male swe 2015 5
female swe 2015 7
male swe 2016 13
female swe 2016 18
male chn 2014 4
female chn 2014 3
male chn 2015 12
female chn 2015 8

You could even imagine a cumulative sum over geo, taking alphabetical order:

gender geo year co2
male chn 2014 4
female chn 2014 3
male chn 2015 8
female chn 2015 5
male swe 2015 13 (8+5)
female swe 2015 12 (5+7)
male swe 2016 8
female swe 2016 9

But a cumulative function over something non-chronological seems kind of nonsensical? I can't think of a use case for that.

Furthermore, I couldn't think of what 'Cumulative sum over time and group by only geo' would mean. The only thing to me that makes sense is: First group by year and geo, summing co2, collapsing gender. Then do cumulative sum over time. But that's just first a groupby and then cumsum.

The groupby you mentioned could be seen as the opposite of accumulate_over. The union of groupby and accumulate_over should be the complete set of keys and they should be disjoint (mutual exclusive). I.e. groupby would contain all the keys which aren't used to accumulate_over and vice versa.

group_by: ['geo','gender'],
accumulate_over: 'time'

Having both would be redundant though. My preference goes to accumulate_over. It seems more clear to me?

So I think we can have the following configuration

{
  procedure: 'accumulate',
  ingredients: 'sg-datapoints'
  result: 'sg-datapoints-accumulated'
  options: {
    accumulate_over: 'time',
    accumulate: {
      co2_emission: 'cumsum',
      income: {
        function: 'aagr',
        window: 10
      }
    }
  }
}

accumulate procedure can only be done on datapoints
accumulate_over must be of type time and be a dimension of the datapoint
accumulate must contain only indicators

Questions:

  1. Do you agree with this procedure configuration and the logic behind it?
  2. Do you like it better to have a separate parameters object for functions with parameters?
  3. The name, accumulate.. is there a better name for these kinds of 'grouping' functions, with a window? Basically, cumsum is just a window-sum with an unlimited window? windowop? I'm not sure. accumulate seems too narrow, e.g. aagr isn't cumulative at all.

Description of cumulative functions

accumulate applies a cumulative function on separate subsets of rows.

  1. The subsets of rows are defined by rows that have the same key, if the concept in accumulate_over is disregarded.
  2. The subsets of rows are then sorted by the accumulate_over concept.
  3. The cumulative function is applied to each subset of rows.

E.g.

Start with the below table and procedure: Cumulative sum of co2 over year

{
  procedure: 'accumulate',
  options: {
    accumulate_over: 'time',
    accumulate: {
      co2: 'cumsum',
    }
  }
}
gender geo year co2
male swe 2015 5
female swe 2015 7
male swe 2016 8
female swe 2016 9
male chn 2014 4
female chn 2014 3
male chn 2015 8
female chn 2015 5

The rows that cumsum will be applied to are the ones where the key is the same, when not regarding accumulate_over. I.e. where geo and gender are the same, regardless of year.
For the ease of overview the following table was first sorted by year and then the other two keys, so that the subsets occur together in the table, sorted by year.

gender geo year co2
female chn 2014 3
female chn 2015 5
female swe 2015 7
female swe 2016 9
male chn 2014 4
male chn 2015 8
male swe 2015 5
male swe 2016 8

Then, the cumsum function is applied to co2 in each group

gender geo year co2
female chn 2014 3
female chn 2015 8
female swe 2015 7
female swe 2016 16
male chn 2014 4
male chn 2015 12
male swe 2015 5
male swe 2016 13

@jheeffer
Copy link

jheeffer commented Oct 4, 2016

Windowed function over multiple columns

This is doable, following the logic described above. First sort by the accumulate_over columns, then sort by the other key columns. Form groups where the 'other key' columns are equal. Apply cumulative function per group.

In this case the order of accumulate_over columns matters.

image

{
  procedure: 'accumulate',
  options: {
    accumulate_over: ['year','geo'] || ['geo','year'],
    accumulate: {
      co2: 'cumsum',
    }
  }
}

image

However I have no clue when you would use something like this, so it doesn't need to be supported (until we find a use case).

@semio
Copy link
Owner Author

semio commented Oct 17, 2016

Thanks for the detail explanation! l understand there are a few problems for the current api:

  1. accumulate should be able to apply to any of the columns
  2. There should be no groupby for accumulate function, we just first groupby and then apply accumulative function.
  3. accumulate is too narrow, and aagr is not accumulative at all.

I agree your points and so I am thinking that if we can have a more general way to do applying functions to a ingredient. In pandas, there are a few ways to do this:

  • dataframe.apply(function): apply function to each row or column
  • dataframe.groupby.apply(function): apply function to each group
  • dataframe.rolling.apply(function): apply function to each rolling windows
  • and there are a few more, you can find it in the doc

I think we can follow this setup and use the name apply. Because accumulate_over columns are used in sorting the data before running the function, we can change to name to sort_by.

{
  procedure: 'apply',
  ingredients: ['cdiac-datapoints']
  options: {
    sort_by: ['year','geo'] || ['geo','year'],
    apply: {
      co2: 'cumsum'
    }
  }
}

As discussed in #4, we can return a grouped datapoint without aggregate. So we can run groupby first and apply a function to the grouped result:

[
  {
    procedure: 'groupby',
    ingredients: ['cdiac-datapoints'],
    options: {
      groupby: 'geo'
    },
    result: 'cdiac-datapoints-grouped'
  },
  {
    procedure: 'apply',
    ingredients: ['cdiac-datapoints-grouped'],
    options: {
      sort_by: ['year','geo'] || ['geo','year'],
      apply: {
        co2: 'cumsum'
      }
    }
  }
]

For windowed functions, we can treat it as a function that have a window parameter. I think we don't need an other rolling procedure for now.

[
  {
    procedure: 'groupby',
    ingredients: ['cdiac-datapoints'],
    options: {
      groupby: 'geo'
    },
    result: 'cdiac-datapoints-grouped'
  },
  {
    procedure: 'apply',
    ingredients: ['cdiac-datapoints-grouped'],
    options: {
      sort_by: ['year','geo'] || ['geo','year'],
      apply: {
        co2: {
          function: 'aagr',
          window: 10
        }
      }
    }
  }
]

Basically above is same as you recommended, but I just change the name to apply. Do you think it's OK?

@jheeffer
Copy link

jheeffer commented Dec 8, 2016

I'll read into this a bit more, I think we really should try to follow pandas here, they though this out pretty well I think. Maybe we can support a subset of their window functions: http://pandas.pydata.org/pandas-docs/stable/api.html#window
Also use the same options as they do for their functions. It will make the mapping from recipe to pandas code a lot easier too I guess.

@jheeffer
Copy link

jheeffer commented Dec 8, 2016

In general I have more and more the feeling we're creating a declarative layer on top of the pandas library, with some extra functions (trend bridges) and input formats (ddf, dictionaries). Not necessarily a bad thing though, but good to have in mind.

@jheeffer
Copy link

jheeffer commented Dec 8, 2016

Okay, I dove into the pandas functions.

First I noticed from the docs and in your examples above is that your conception of groupby is different from mine. Probably because of our backgrounds (me SQL and you python/pandas).

In pandas, running groupby on a dataframe returns a 'grouping' object. You can then apply different functions on that grouping. You can either transform, aggregate or filter per group.
In SQL, running groupby on a table returns a table. There is no intermediate result or functions like transform or filter. There is table as result and only aggregate functions. All columns in the result are either in group_by or aggregate functions.

I like the additional functionality of groupby in pandas though. However, we cannot have a grouping object be the output of a procedure. For simplicities sake, all procedures should always output ingredients (dataframes). Otherwise you're going into typing ingredients, which adds a new layer of complexity.
So any groupby + function should be in one procedure I think.

Similar for window functions, both rolling windows and expanding windows. The definition of the window and the functions to apply should be in one procedure.

On the note of expanding windows: A cumsum is very similar to sum over expanding window, except for the handling of NaN's (see note in this section).

@jheeffer
Copy link

jheeffer commented Dec 8, 2016

Now, let's define declarative procedures for these : )

@jheeffer
Copy link

jheeffer commented Dec 9, 2016

One group by with aggregate, transform, filter option (only one allowed)
One window with aggregate option. Window size integer or time value is rolling, window size 'expanding' is expanding window.
If it's not clear, I'll add examples tomorrow.

@semio
Copy link
Owner Author

semio commented Dec 9, 2016

I see, thanks for making it clear :)

yes, it'd be good if you can write the specs, would make it easier for me to implement them :)

@jheeffer
Copy link

jheeffer commented Dec 9, 2016

Both groupby and window define a group of rows on which to apply a certain function. They differ in how they define that group (static grouping vs changing window), what type of functions they allow on this group (aggregate, transform or filter vs just aggregate) and how the results of the functions are saved (back to the group vs to one position in the group (edge/center of window))

procedure group functions result
groupby static group aggregate
transform
filter
at once applies to all rows in group
window changing group
(rolling/expanding)
aggregate updates one row in group
(typically window edge or center)

groupby

- procedure: groupby
  ingredient: population_and_foo_by_year_country_age_gender_education
  options:
    groupby: ["year","country"] || country
    aggregate:
      population: sum
      foo:
        function: bar
        param1: baz
    transform:
      population: foo
    filter:
      population: foo
  result: population_by_year_country
  • Only one of aggregate,transform or filter can be used in one procedure.
  • Any columns not mentioned in groupby or functions are dropped.

window

- procedure: window
  ingredient: immigration_surplus_by_year_country_gender
  options:
    window: 
      column: year # column which window is created from ("on" parameter in df.rolling)
      size: 5 || "expanding" # if positive integer or time offset, rolling window, if expanding, expanding window
      min_periods: 1 #optional, as in pandas
      center: false # optional, as in pandas
    aggregate:
      immigration_surplus: sum
      foo:
        function: bar
        param1: baz
  result: cumulative_immigration_surplus
  • In multidimensional data, like datapoints, you want to apply the rolling function over groupings of the keys, except for the window column. In pandas implementation, you probably need some groupby to support multidimensional rolling: (not tested)
df.groupby(by=["country","gender"]).expanding(on="year").sum()

advanced: lambda functions

Both groupby and window have this function section, where we define what functions to run on what column.
In pandas, you are allowed to declare your own functions for groupby and window. Should we allow this too?

    transform:
      foo: "lambda x: (x - x.mean()) / x.std()"
  • My main concern: Does this pose any security risks?

@semio
Copy link
Owner Author

semio commented Dec 9, 2016

groupby and window looks good to me, I will try to implement them.

For lambda functions, yes, security is the main concern. This is mainly because we need to use eval() to run the code, and eval can run any code provided. There are ways to limit the functions which can call by eval, but I think they are still not secure enough. See here.

I will dig into the lambda issue a bit later, see if there is a good way to do this.

@semio
Copy link
Owner Author

semio commented Dec 12, 2016

Because running python code in eval() is not safe and there is no other way to run python codes in recipe, I think we'd better just provide some pre-defined functions for recipe users. If we want the ability to create custom functions, we could make a function definition block in recipe and parse it. Just don't include python codes in the recipe.

But of course, if we choose to trust the people writing recipes, we can enable this feature in recipe.

semio added a commit that referenced this issue Dec 12, 2016
related issue: #25

see the test_groupby.yaml for an example recipe
@semio semio changed the title accumulate() should accept more parameters cumulative functions api design Dec 13, 2016
semio added a commit that referenced this issue Dec 13, 2016
related issue: #25

Note: there is a bug using groupby with rolling on specific column for now, so
we are not using the `on` parameter in rolling.
pandas-dev/pandas#13966
@semio
Copy link
Owner Author

semio commented Dec 13, 2016

on window function: there is a bug using groupby with rolling on specific column for now, so I don't use the on parameter in rolling. (see the link in my last commit messages)

this should have little impact to us. Because we have sorted our dimensions and we will group all other keys. So the order will be the same as if we are rolling on the target column.

@jheeffer
Copy link

How does the script choose which column to roll on then?

@semio
Copy link
Owner Author

semio commented Dec 13, 2016

We choose by grouping by all keys except for the column to roll on.

For example, the dataframe have index column ['geo', 'gender', 'year']

the column to roll is 'year'

then:

  1. firstly group the dataframe by ['geo', 'gender'] (which is ['geo', 'gender', 'year'].remove('year'))
  2. then each group have index 'year'
  3. not using the on parameter equals to rolling on index. In our case, because the index is 'year' column, so rolling on index is same as rolling on 'year' column

@semio semio closed this as completed Jan 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants