Skip to content

Commit a8373d2

Browse files
authored
docs: refine AggregateUDFImpl::is_ordered_set_aggregate documentation (#17805)
Going through some tickets related to ordered set aggregates and got a little confused on DataFusion's support for them. As I understand it, #13511 made `WITHIN GROUP` mandatory for ordered set aggregate functions, of which we support only two so far: - `approx_percentile_cont` - Technically `approx_median` shares some internals with `approx_percentile_cont` but itself isn't an ordered set aggregation - `approx_percentile_cont_with_weight` (which uses `approx_percentile_cont` internally) This was then amended in #16999 to make it optional, at least via the SQL API; it is still mandatory on the DataFrame API: https://github.com/apache/datafusion/blob/bbb5cc79de3d037d0b06572ff417de7c3d9fe437/datafusion/functions-aggregate/src/approx_percentile_cont.rs#L53-L58 I'm updating the doc here to try clarify things to my understanding, as a followup to the original doc update: #17744
1 parent 22c4214 commit a8373d2

File tree

1 file changed

+43
-12
lines changed

1 file changed

+43
-12
lines changed

datafusion/expr/src/udaf.rs

Lines changed: 43 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -746,21 +746,52 @@ pub trait AggregateUDFImpl: Debug + DynEq + DynHash + Send + Sync {
746746
true
747747
}
748748

749-
/// If this function is ordered-set aggregate function, return true
750-
/// otherwise, return false
749+
/// If this function is an ordered-set aggregate function, return `true`.
750+
/// Otherwise, return `false` (default).
751751
///
752-
/// Ordered-set aggregate functions require an explicit `ORDER BY` clause
753-
/// because the calculation performed by these functions is dependent on the
754-
/// specific sequence of the input rows, unlike other aggregate functions
755-
/// like `SUM`, `AVG`, or `COUNT`.
752+
/// Ordered-set aggregate functions allow specifying a sort order that affects
753+
/// how the function calculates its result, unlike other aggregate functions
754+
/// like `SUM` or `COUNT`. For example, `percentile_cont` is an ordered-set
755+
/// aggregate function that calculates the exact percentile value from a list
756+
/// of values; the output of calculating the `0.75` percentile depends on if
757+
/// you're calculating on an ascending or descending list of values.
756758
///
757-
/// An example of an ordered-set aggregate function is `percentile_cont`
758-
/// which computes a specific percentile value from a sorted list of values, and
759-
/// is only meaningful when the input data is ordered.
759+
/// Setting this to return `true` affects only SQL parsing & planning; it allows
760+
/// use of the `WITHIN GROUP` clause to specify this order, for example:
760761
///
761-
/// In SQL syntax, ordered-set aggregate functions are used with the
762-
/// `WITHIN GROUP (ORDER BY ...)` clause to specify the ordering of the input
763-
/// data.
762+
/// ```sql
763+
/// -- Ascending
764+
/// SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY c1 ASC) FROM table;
765+
/// -- Default ordering is ascending if not explicitly specified
766+
/// SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY c1) FROM table;
767+
/// -- Descending
768+
/// SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY c1 DESC) FROM table;
769+
/// ```
770+
///
771+
/// This calculates the `0.75` percentile of the column `c1` from `table`,
772+
/// according to the specific ordering. The column specified in the `WITHIN GROUP`
773+
/// ordering clause is taken as the column to calculate values on; specifying
774+
/// the `WITHIN GROUP` clause is optional so these queries are equivalent:
775+
///
776+
/// ```sql
777+
/// -- If no WITHIN GROUP is specified then default ordering is implementation
778+
/// -- dependent; in this case ascending for percentile_cont
779+
/// SELECT percentile_cont(c1, 0.75) FROM table;
780+
/// SELECT percentile_cont(0.75) WITHIN GROUP (ORDER BY c1 ASC) FROM table;
781+
/// ```
782+
///
783+
/// Aggregate UDFs can define their default ordering if the function is called
784+
/// without the `WITHIN GROUP` clause, though a default of ascending is the
785+
/// standard practice.
786+
///
787+
/// Note that setting this to `true` does not guarantee input sort order to
788+
/// the aggregate function; it expects the function to handle ordering the
789+
/// input values themselves (e.g. `percentile_cont` must buffer and sort
790+
/// the values internally). That is, DataFusion does not introduce any kind
791+
/// of sort into the plan for these functions.
792+
///
793+
/// Setting this to `false` disallows calling this function with the `WITHIN GROUP`
794+
/// clause.
764795
fn is_ordered_set_aggregate(&self) -> bool {
765796
false
766797
}

0 commit comments

Comments
 (0)