-
Notifications
You must be signed in to change notification settings - Fork 1
cumulative functions api design #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think there's no groupby for accumulative functions, just the column over which to accumulate (let's call that Starting data
Cumulative sum over
You could even imagine a cumulative sum over
But a cumulative function over something non-chronological seems kind of nonsensical? I can't think of a use case for that. Furthermore, I couldn't think of what 'Cumulative sum over time and group by only geo' would mean. The only thing to me that makes sense is: First group by year and geo, summing co2, collapsing gender. Then do cumulative sum over time. But that's just first a groupby and then cumsum. The group_by: ['geo','gender'],
accumulate_over: 'time' Having both would be redundant though. My preference goes to accumulate_over. It seems more clear to me? So I think we can have the following configuration {
procedure: 'accumulate',
ingredients: 'sg-datapoints'
result: 'sg-datapoints-accumulated'
options: {
accumulate_over: 'time',
accumulate: {
co2_emission: 'cumsum',
income: {
function: 'aagr',
window: 10
}
}
}
} accumulate procedure can only be done on datapoints Questions:
Description of cumulative functions
E.g. Start with the below table and procedure: Cumulative sum of co2 over year {
procedure: 'accumulate',
options: {
accumulate_over: 'time',
accumulate: {
co2: 'cumsum',
}
}
}
The rows that
Then, the
|
Windowed function over multiple columns This is doable, following the logic described above. First sort by the In this case the order of
However I have no clue when you would use something like this, so it doesn't need to be supported (until we find a use case). |
Thanks for the detail explanation! l understand there are a few problems for the current api:
I agree your points and so I am thinking that if we can have a more general way to do applying functions to a ingredient. In pandas, there are a few ways to do this:
I think we can follow this setup and use the name {
procedure: 'apply',
ingredients: ['cdiac-datapoints']
options: {
sort_by: ['year','geo'] || ['geo','year'],
apply: {
co2: 'cumsum'
}
}
} As discussed in #4, we can return a grouped datapoint without aggregate. So we can run groupby first and apply a function to the grouped result: [
{
procedure: 'groupby',
ingredients: ['cdiac-datapoints'],
options: {
groupby: 'geo'
},
result: 'cdiac-datapoints-grouped'
},
{
procedure: 'apply',
ingredients: ['cdiac-datapoints-grouped'],
options: {
sort_by: ['year','geo'] || ['geo','year'],
apply: {
co2: 'cumsum'
}
}
}
] For windowed functions, we can treat it as a function that have a [
{
procedure: 'groupby',
ingredients: ['cdiac-datapoints'],
options: {
groupby: 'geo'
},
result: 'cdiac-datapoints-grouped'
},
{
procedure: 'apply',
ingredients: ['cdiac-datapoints-grouped'],
options: {
sort_by: ['year','geo'] || ['geo','year'],
apply: {
co2: {
function: 'aagr',
window: 10
}
}
}
}
] Basically above is same as you recommended, but I just change the name to |
I'll read into this a bit more, I think we really should try to follow pandas here, they though this out pretty well I think. Maybe we can support a subset of their window functions: http://pandas.pydata.org/pandas-docs/stable/api.html#window |
In general I have more and more the feeling we're creating a declarative layer on top of the pandas library, with some extra functions (trend bridges) and input formats (ddf, dictionaries). Not necessarily a bad thing though, but good to have in mind. |
Okay, I dove into the pandas functions. First I noticed from the docs and in your examples above is that your conception of groupby is different from mine. Probably because of our backgrounds (me SQL and you python/pandas). In pandas, running groupby on a dataframe returns a 'grouping' object. You can then apply different functions on that grouping. You can either transform, aggregate or filter per group. I like the additional functionality of groupby in pandas though. However, we cannot have a grouping object be the output of a procedure. For simplicities sake, all procedures should always output ingredients (dataframes). Otherwise you're going into typing ingredients, which adds a new layer of complexity. Similar for window functions, both rolling windows and expanding windows. The definition of the window and the functions to apply should be in one procedure. On the note of expanding windows: A cumsum is very similar to sum over expanding window, except for the handling of NaN's (see note in this section). |
Now, let's define declarative procedures for these : ) |
One group by with aggregate, transform, filter option (only one allowed) |
I see, thanks for making it clear :) yes, it'd be good if you can write the specs, would make it easier for me to implement them :) |
Both groupby and window define a group of rows on which to apply a certain function. They differ in how they define that group (static grouping vs changing window), what type of functions they allow on this group (aggregate, transform or filter vs just aggregate) and how the results of the functions are saved (back to the group vs to one position in the group (edge/center of window))
groupby- procedure: groupby
ingredient: population_and_foo_by_year_country_age_gender_education
options:
groupby: ["year","country"] || country
aggregate:
population: sum
foo:
function: bar
param1: baz
transform:
population: foo
filter:
population: foo
result: population_by_year_country
window- procedure: window
ingredient: immigration_surplus_by_year_country_gender
options:
window:
column: year # column which window is created from ("on" parameter in df.rolling)
size: 5 || "expanding" # if positive integer or time offset, rolling window, if expanding, expanding window
min_periods: 1 #optional, as in pandas
center: false # optional, as in pandas
aggregate:
immigration_surplus: sum
foo:
function: bar
param1: baz
result: cumulative_immigration_surplus
df.groupby(by=["country","gender"]).expanding(on="year").sum() advanced: lambda functionsBoth groupby and window have this function section, where we define what functions to run on what column.
|
groupby and window looks good to me, I will try to implement them. For lambda functions, yes, security is the main concern. This is mainly because we need to use I will dig into the lambda issue a bit later, see if there is a good way to do this. |
Because running python code in eval() is not safe and there is no other way to run python codes in recipe, I think we'd better just provide some pre-defined functions for recipe users. If we want the ability to create custom functions, we could make a function definition block in recipe and parse it. Just don't include python codes in the recipe. But of course, if we choose to trust the people writing recipes, we can enable this feature in recipe. |
related issue: #25 see the test_groupby.yaml for an example recipe
related issue: #25 Note: there is a bug using groupby with rolling on specific column for now, so we are not using the `on` parameter in rolling. pandas-dev/pandas#13966
on window function: there is a bug using groupby with rolling on specific column for now, so I don't use the this should have little impact to us. Because we have sorted our dimensions and we will group all other keys. So the order will be the same as if we are rolling on the target column. |
How does the script choose which column to roll on then? |
We choose by grouping by all keys except for the column to roll on. For example, the dataframe have index column ['geo', 'gender', 'year'] the column to roll is 'year' then:
|
Uh oh!
There was an error while loading. Please reload this page.
current format of accumulate:
but problem is:
To fix these problems we should add more options to the procedure, such as
EDIT: If we always need to groupby before accumulate. maybe combine
accumulate
procedure intogroupby
?The text was updated successfully, but these errors were encountered: