cumulative functions api design #25

semio · 2016-08-31T07:01:23Z

current format of accumulate:

procedure: accumulate
ingredients:
    - sg-datapoints
options:
    op:
        gdp: "aagr"  # run aagr() on gdp datapoints

but problem is:

functions like aagr() can be run in multiple timeframes, e.g n years growth rate where n is parameter.
we should only calculate these statistics within country, so we should always group them before accumulate.

To fix these problems we should add more options to the procedure, such as

procedure: accumulate
ingredients:
    - sg-datapoints
options:
    groupby:
        - geo
    op:
        gdp:
            func: aagr
            param:
                window: 10

EDIT: If we always need to groupby before accumulate. maybe combine accumulate procedure into groupby?

The text was updated successfully, but these errors were encountered:

jheeffer · 2016-10-04T20:20:20Z

I think there's no groupby for accumulative functions, just the column over which to accumulate (let's call that accumulate_over. Then automatically you 'groupby' all the other key columns.

Starting data

gender	geo	year	co2
male	swe	2015	5
female	swe	2015	7
male	swe	2016	8
female	swe	2016	9
male	chn	2014	4
female	chn	2014	3
male	chn	2015	8
female	chn	2015	5

Cumulative sum over year

gender	geo	year	co2
male	swe	2015	5
female	swe	2015	7
male	swe	2016	13
female	swe	2016	18
male	chn	2014	4
female	chn	2014	3
male	chn	2015	12
female	chn	2015	8

You could even imagine a cumulative sum over geo, taking alphabetical order:

gender	geo	year	co2
male	chn	2014	4
female	chn	2014	3
male	chn	2015	8
female	chn	2015	5
male	swe	2015	13 (8+5)
female	swe	2015	12 (5+7)
male	swe	2016	8
female	swe	2016	9

But a cumulative function over something non-chronological seems kind of nonsensical? I can't think of a use case for that.

Furthermore, I couldn't think of what 'Cumulative sum over time and group by only geo' would mean. The only thing to me that makes sense is: First group by year and geo, summing co2, collapsing gender. Then do cumulative sum over time. But that's just first a groupby and then cumsum.

The groupby you mentioned could be seen as the opposite of accumulate_over. The union of groupby and accumulate_over should be the complete set of keys and they should be disjoint (mutual exclusive). I.e. groupby would contain all the keys which aren't used to accumulate_over and vice versa.

group_by: ['geo','gender'],
accumulate_over: 'time'

Having both would be redundant though. My preference goes to accumulate_over. It seems more clear to me?

So I think we can have the following configuration

{
  procedure: 'accumulate',
  ingredients: 'sg-datapoints'
  result: 'sg-datapoints-accumulated'
  options: {
    accumulate_over: 'time',
    accumulate: {
      co2_emission: 'cumsum',
      income: {
        function: 'aagr',
        window: 10
      }
    }
  }
}

accumulate procedure can only be done on datapoints
accumulate_over must be of type time and be a dimension of the datapoint
accumulate must contain only indicators

Questions:

Do you agree with this procedure configuration and the logic behind it?
Do you like it better to have a separate parameters object for functions with parameters?
The name, accumulate.. is there a better name for these kinds of 'grouping' functions, with a window? Basically, cumsum is just a window-sum with an unlimited window? windowop? I'm not sure. accumulate seems too narrow, e.g. aagr isn't cumulative at all.

Description of cumulative functions

accumulate applies a cumulative function on separate subsets of rows.

The subsets of rows are defined by rows that have the same key, if the concept in accumulate_over is disregarded.
The subsets of rows are then sorted by the accumulate_over concept.
The cumulative function is applied to each subset of rows.

E.g.

Start with the below table and procedure: Cumulative sum of co2 over year

{
  procedure: 'accumulate',
  options: {
    accumulate_over: 'time',
    accumulate: {
      co2: 'cumsum',
    }
  }
}

gender	geo	year	co2
male	swe	2015	5
female	swe	2015	7
male	swe	2016	8
female	swe	2016	9
male	chn	2014	4
female	chn	2014	3
male	chn	2015	8
female	chn	2015	5

The rows that cumsum will be applied to are the ones where the key is the same, when not regarding accumulate_over. I.e. where geo and gender are the same, regardless of year.
For the ease of overview the following table was first sorted by year and then the other two keys, so that the subsets occur together in the table, sorted by year.

gender	geo	year	co2
female	chn	2014	3
female	chn	2015	5
female	swe	2015	7
female	swe	2016	9
male	chn	2014	4
male	chn	2015	8
male	swe	2015	5
male	swe	2016	8

Then, the cumsum function is applied to co2 in each group

gender	geo	year	co2
female	chn	2014	3
female	chn	2015	8
female	swe	2015	7
female	swe	2016	16
male	chn	2014	4
male	chn	2015	12
male	swe	2015	5
male	swe	2016	13

jheeffer · 2016-10-04T21:38:55Z

Windowed function over multiple columns

This is doable, following the logic described above. First sort by the accumulate_over columns, then sort by the other key columns. Form groups where the 'other key' columns are equal. Apply cumulative function per group.

In this case the order of accumulate_over columns matters.

{
  procedure: 'accumulate',
  options: {
    accumulate_over: ['year','geo'] || ['geo','year'],
    accumulate: {
      co2: 'cumsum',
    }
  }
}

However I have no clue when you would use something like this, so it doesn't need to be supported (until we find a use case).

semio · 2016-10-17T12:52:54Z

Thanks for the detail explanation! l understand there are a few problems for the current api:

accumulate should be able to apply to any of the columns
There should be no groupby for accumulate function, we just first groupby and then apply accumulative function.
accumulate is too narrow, and aagr is not accumulative at all.

I agree your points and so I am thinking that if we can have a more general way to do applying functions to a ingredient. In pandas, there are a few ways to do this:

dataframe.apply(function): apply function to each row or column
dataframe.groupby.apply(function): apply function to each group
dataframe.rolling.apply(function): apply function to each rolling windows
and there are a few more, you can find it in the doc

I think we can follow this setup and use the name apply. Because accumulate_over columns are used in sorting the data before running the function, we can change to name to sort_by.

{
  procedure: 'apply',
  ingredients: ['cdiac-datapoints']
  options: {
    sort_by: ['year','geo'] || ['geo','year'],
    apply: {
      co2: 'cumsum'
    }
  }
}

As discussed in #4, we can return a grouped datapoint without aggregate. So we can run groupby first and apply a function to the grouped result:

[
  {
    procedure: 'groupby',
    ingredients: ['cdiac-datapoints'],
    options: {
      groupby: 'geo'
    },
    result: 'cdiac-datapoints-grouped'
  },
  {
    procedure: 'apply',
    ingredients: ['cdiac-datapoints-grouped'],
    options: {
      sort_by: ['year','geo'] || ['geo','year'],
      apply: {
        co2: 'cumsum'
      }
    }
  }
]

For windowed functions, we can treat it as a function that have a window parameter. I think we don't need an other rolling procedure for now.

[
  {
    procedure: 'groupby',
    ingredients: ['cdiac-datapoints'],
    options: {
      groupby: 'geo'
    },
    result: 'cdiac-datapoints-grouped'
  },
  {
    procedure: 'apply',
    ingredients: ['cdiac-datapoints-grouped'],
    options: {
      sort_by: ['year','geo'] || ['geo','year'],
      apply: {
        co2: {
          function: 'aagr',
          window: 10
        }
      }
    }
  }
]

Basically above is same as you recommended, but I just change the name to apply. Do you think it's OK?

jheeffer · 2016-12-08T10:13:45Z

I'll read into this a bit more, I think we really should try to follow pandas here, they though this out pretty well I think. Maybe we can support a subset of their window functions: http://pandas.pydata.org/pandas-docs/stable/api.html#window
Also use the same options as they do for their functions. It will make the mapping from recipe to pandas code a lot easier too I guess.

jheeffer · 2016-12-08T10:24:43Z

In general I have more and more the feeling we're creating a declarative layer on top of the pandas library, with some extra functions (trend bridges) and input formats (ddf, dictionaries). Not necessarily a bad thing though, but good to have in mind.

jheeffer · 2016-12-08T20:45:17Z

Okay, I dove into the pandas functions.

First I noticed from the docs and in your examples above is that your conception of groupby is different from mine. Probably because of our backgrounds (me SQL and you python/pandas).

In pandas, running groupby on a dataframe returns a 'grouping' object. You can then apply different functions on that grouping. You can either transform, aggregate or filter per group.
In SQL, running groupby on a table returns a table. There is no intermediate result or functions like transform or filter. There is table as result and only aggregate functions. All columns in the result are either in group_by or aggregate functions.

I like the additional functionality of groupby in pandas though. However, we cannot have a grouping object be the output of a procedure. For simplicities sake, all procedures should always output ingredients (dataframes). Otherwise you're going into typing ingredients, which adds a new layer of complexity.
So any groupby + function should be in one procedure I think.

Similar for window functions, both rolling windows and expanding windows. The definition of the window and the functions to apply should be in one procedure.

On the note of expanding windows: A cumsum is very similar to sum over expanding window, except for the handling of NaN's (see note in this section).

jheeffer · 2016-12-08T20:49:37Z

Now, let's define declarative procedures for these : )

jheeffer · 2016-12-09T01:05:59Z

One group by with aggregate, transform, filter option (only one allowed)
One window with aggregate option. Window size integer or time value is rolling, window size 'expanding' is expanding window.
If it's not clear, I'll add examples tomorrow.

semio · 2016-12-09T07:35:17Z

I see, thanks for making it clear :)

yes, it'd be good if you can write the specs, would make it easier for me to implement them :)

jheeffer · 2016-12-09T09:19:32Z

Both groupby and window define a group of rows on which to apply a certain function. They differ in how they define that group (static grouping vs changing window), what type of functions they allow on this group (aggregate, transform or filter vs just aggregate) and how the results of the functions are saved (back to the group vs to one position in the group (edge/center of window))

procedure	group	functions	result
groupby	static group	aggregate transform filter	at once applies to all rows in group
window	changing group (rolling/expanding)	aggregate	updates one row in group (typically window edge or center)

groupby

- procedure: groupby
  ingredient: population_and_foo_by_year_country_age_gender_education
  options:
    groupby: ["year","country"] || country
    aggregate:
      population: sum
      foo:
        function: bar
        param1: baz
    transform:
      population: foo
    filter:
      population: foo
  result: population_by_year_country

Only one of aggregate,transform or filter can be used in one procedure.
Any columns not mentioned in groupby or functions are dropped.

window

- procedure: window
  ingredient: immigration_surplus_by_year_country_gender
  options:
    window: 
      column: year # column which window is created from ("on" parameter in df.rolling)
      size: 5 || "expanding" # if positive integer or time offset, rolling window, if expanding, expanding window
      min_periods: 1 #optional, as in pandas
      center: false # optional, as in pandas
    aggregate:
      immigration_surplus: sum
      foo:
        function: bar
        param1: baz
  result: cumulative_immigration_surplus

In multidimensional data, like datapoints, you want to apply the rolling function over groupings of the keys, except for the window column. In pandas implementation, you probably need some groupby to support multidimensional rolling: (not tested)

df.groupby(by=["country","gender"]).expanding(on="year").sum()

advanced: lambda functions

Both groupby and window have this function section, where we define what functions to run on what column.
In pandas, you are allowed to declare your own functions for groupby and window. Should we allow this too?

    transform:
      foo: "lambda x: (x - x.mean()) / x.std()"

My main concern: Does this pose any security risks?

semio · 2016-12-09T16:58:50Z

groupby and window looks good to me, I will try to implement them.

For lambda functions, yes, security is the main concern. This is mainly because we need to use eval() to run the code, and eval can run any code provided. There are ways to limit the functions which can call by eval, but I think they are still not secure enough. See here.

I will dig into the lambda issue a bit later, see if there is a good way to do this.

semio · 2016-12-12T00:49:53Z

Because running python code in eval() is not safe and there is no other way to run python codes in recipe, I think we'd better just provide some pre-defined functions for recipe users. If we want the ability to create custom functions, we could make a function definition block in recipe and parse it. Just don't include python codes in the recipe.

But of course, if we choose to trust the people writing recipes, we can enable this feature in recipe.

related issue: #25 see the test_groupby.yaml for an example recipe

related issue: #25 Note: there is a bug using groupby with rolling on specific column for now, so we are not using the `on` parameter in rolling. pandas-dev/pandas#13966

semio · 2016-12-13T02:20:02Z

on window function: there is a bug using groupby with rolling on specific column for now, so I don't use the on parameter in rolling. (see the link in my last commit messages)

this should have little impact to us. Because we have sorted our dimensions and we will group all other keys. So the order will be the same as if we are rolling on the target column.

jheeffer · 2016-12-13T07:36:23Z

How does the script choose which column to roll on then?

semio · 2016-12-13T11:48:11Z

We choose by grouping by all keys except for the column to roll on.

For example, the dataframe have index column ['geo', 'gender', 'year']

the column to roll is 'year'

then:

firstly group the dataframe by ['geo', 'gender'] (which is ['geo', 'gender', 'year'].remove('year'))
then each group have index 'year'
not using the on parameter equals to rolling on index. In our case, because the index is 'year' column, so rolling on index is same as rolling on 'year' column

semio added the recipe label Aug 31, 2016

jheeffer mentioned this issue Oct 4, 2016

recipe procedure: groupby #4

Closed

semio added a commit that referenced this issue Dec 12, 2016

new: groupby procedure

2c51b6d

related issue: #25 see the test_groupby.yaml for an example recipe

semio changed the title ~~accumulate() should accept more parameters~~ cumulative functions api design Dec 13, 2016

semio added a commit that referenced this issue Dec 13, 2016

new procedure: windows

3b14540

related issue: #25 Note: there is a bug using groupby with rolling on specific column for now, so we are not using the `on` parameter in rolling. pandas-dev/pandas#13966

semio referenced this issue Dec 13, 2016

release the dimension check

9eed0e8

semio closed this as completed Jan 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cumulative functions api design #25

cumulative functions api design #25

semio commented Aug 31, 2016 •

edited

Loading

jheeffer commented Oct 4, 2016 •

edited

Loading

Uh oh!

jheeffer commented Oct 4, 2016 •

edited

Loading

Uh oh!

semio commented Oct 17, 2016 •

edited

Loading

Uh oh!

jheeffer commented Dec 8, 2016

Uh oh!

jheeffer commented Dec 8, 2016 •

edited

Loading

Uh oh!

jheeffer commented Dec 8, 2016 •

edited

Loading

Uh oh!

jheeffer commented Dec 8, 2016

Uh oh!

jheeffer commented Dec 9, 2016 •

edited

Loading

Uh oh!

semio commented Dec 9, 2016

Uh oh!

jheeffer commented Dec 9, 2016 •

edited

Loading

Uh oh!

semio commented Dec 9, 2016 •

edited

Loading

Uh oh!

semio commented Dec 12, 2016 •

edited

Loading

Uh oh!

semio commented Dec 13, 2016 •

edited

Loading

Uh oh!

jheeffer commented Dec 13, 2016

Uh oh!

semio commented Dec 13, 2016 •

edited

Loading

Uh oh!

cumulative functions api design #25

cumulative functions api design #25

Comments

semio commented Aug 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jheeffer commented Oct 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jheeffer commented Oct 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

semio commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jheeffer commented Dec 8, 2016

Uh oh!

jheeffer commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jheeffer commented Dec 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jheeffer commented Dec 8, 2016

Uh oh!

jheeffer commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

semio commented Dec 9, 2016

Uh oh!

jheeffer commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

groupby

window

advanced: lambda functions

Uh oh!

semio commented Dec 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

semio commented Dec 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

semio commented Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jheeffer commented Dec 13, 2016

Uh oh!

semio commented Dec 13, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

semio commented Aug 31, 2016 •

edited

Loading

jheeffer commented Oct 4, 2016 •

edited

Loading

jheeffer commented Oct 4, 2016 •

edited

Loading

semio commented Oct 17, 2016 •

edited

Loading

jheeffer commented Dec 8, 2016 •

edited

Loading

jheeffer commented Dec 8, 2016 •

edited

Loading

jheeffer commented Dec 9, 2016 •

edited

Loading

jheeffer commented Dec 9, 2016 •

edited

Loading

semio commented Dec 9, 2016 •

edited

Loading

semio commented Dec 12, 2016 •

edited

Loading

semio commented Dec 13, 2016 •

edited

Loading

semio commented Dec 13, 2016 •

edited

Loading