Skip to content

Commit 9e062df

Browse files
alambJefffrey
andauthored
Improve documentation for Signature, Volatility, and TypeSignature (#17264)
* Improve documentation for Signature, Volatility, and TypeSignature * Update datafusion/expr-common/src/signature.rs Co-authored-by: Jeffrey Vo <[email protected]> * Update datafusion/expr-common/src/signature.rs Co-authored-by: Jeffrey Vo <[email protected]> * clarify immutable * clarify ScalarUdfImpl * examples of when stable functions will be inlined * Reduce duplication between signature and TypeSignature --------- Co-authored-by: Jeffrey Vo <[email protected]>
1 parent cea3ada commit 9e062df

File tree

2 files changed

+103
-39
lines changed

2 files changed

+103
-39
lines changed

datafusion/expr-common/src/signature.rs

Lines changed: 83 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,7 @@
1515
// specific language governing permissions and limitations
1616
// under the License.
1717

18-
//! Signature module contains foundational types that are used to represent signatures, types,
19-
//! and return types of functions in DataFusion.
18+
//! Function signatures: [`Volatility`], [`Signature`] and [`TypeSignature`]
2019
2120
use std::fmt::Display;
2221
use std::hash::Hash;
@@ -44,42 +43,90 @@ pub const TIMEZONE_WILDCARD: &str = "+TZ";
4443
/// valid length. It exists to avoid the need to enumerate all possible fixed size list lengths.
4544
pub const FIXED_SIZE_LIST_WILDCARD: i32 = i32::MIN;
4645

47-
/// A function's volatility, which defines the functions eligibility for certain optimizations
46+
/// How a function's output changes with respect to a fixed input
47+
///
48+
/// The volatility of a function determines eligibility for certain
49+
/// optimizations. You should always define your function to have the strictest
50+
/// possible volatility to maximize performance and avoid unexpected
51+
/// results.
52+
///
4853
#[derive(Debug, PartialEq, Eq, PartialOrd, Ord, Clone, Copy, Hash)]
4954
pub enum Volatility {
50-
/// An immutable function will always return the same output when given the same
51-
/// input. DataFusion will attempt to inline immutable functions during planning.
55+
/// Always returns the same output when given the same input.
56+
///
57+
/// DataFusion will inline immutable functions during planning.
58+
///
59+
/// For example, the `abs` function is immutable, so `abs(-1)` will be
60+
/// evaluated and replaced with `1` during planning rather than invoking
61+
/// the function at runtime.
5262
Immutable,
53-
/// A stable function may return different values given the same input across different
54-
/// queries but must return the same value for a given input within a query. An example of
55-
/// this is the `Now` function. DataFusion will attempt to inline `Stable` functions
56-
/// during planning, when possible.
57-
/// For query `select col1, now() from t1`, it might take a while to execute but
58-
/// `now()` column will be the same for each output row, which is evaluated
59-
/// during planning.
63+
/// May return different values given the same input across different
64+
/// queries but must return the same value for a given input within a query.
65+
///
66+
/// For example, the `now()` function is stable, because the query `select
67+
/// col1, now() from t1`, will return different results each time it is run,
68+
/// but within the same query, the output of the `now()` function has the
69+
/// same value for each output row.
70+
///
71+
/// DataFusion will inline `Stable` functions when possible. For example,
72+
/// `Stable` functions are inlined when planning a query for execution, but
73+
/// not in View definitions or prepared statements.
6074
Stable,
61-
/// A volatile function may change the return value from evaluation to evaluation.
62-
/// Multiple invocations of a volatile function may return different results when used in the
63-
/// same query. An example of this is the random() function. DataFusion
64-
/// can not evaluate such functions during planning.
65-
/// In the query `select col1, random() from t1`, `random()` function will be evaluated
66-
/// for each output row, resulting in a unique random value for each row.
75+
/// May change the return value from evaluation to evaluation.
76+
///
77+
/// Multiple invocations of a volatile function may return different results
78+
/// when used in the same query on different rows. An example of this is the
79+
/// `random()` function.
80+
///
81+
/// DataFusion can not evaluate such functions during planning or push these
82+
/// predicates into scans. In the query `select col1, random() from t1`,
83+
/// `random()` function will be evaluated for each output row, resulting in
84+
/// a unique random value for each row.
6785
Volatile,
6886
}
6987

70-
/// A function's type signature defines the types of arguments the function supports.
88+
/// The types of arguments for which a function has implementations.
89+
///
90+
/// [`TypeSignature`] **DOES NOT** define the types that a user query could call the
91+
/// function with. DataFusion will automatically coerce (cast) argument types to
92+
/// one of the supported function signatures, if possible.
7193
///
72-
/// Functions typically support only a few different types of arguments compared to the
73-
/// different datatypes in Arrow. To make functions easy to use, when possible DataFusion
74-
/// automatically coerces (add casts to) function arguments so they match the type signature.
94+
/// # Overview
95+
/// Functions typically provide implementations for a small number of different
96+
/// argument [`DataType`]s, rather than all possible combinations. If a user
97+
/// calls a function with arguments that do not match any of the declared types,
98+
/// DataFusion will attempt to automatically coerce (add casts to) function
99+
/// arguments so they match the [`TypeSignature`]. See the [`type_coercion`] module
100+
/// for more details
75101
///
76-
/// For example, a function like `cos` may only be implemented for `Float64` arguments. To support a query
77-
/// that calls `cos` with a different argument type, such as `cos(int_column)`, type coercion automatically
78-
/// adds a cast such as `cos(CAST int_column AS DOUBLE)` during planning.
102+
/// # Example: Numeric Functions
103+
/// For example, a function like `cos` may only provide an implementation for
104+
/// [`DataType::Float64`]. When users call `cos` with a different argument type,
105+
/// such as `cos(int_column)`, and type coercion automatically adds a cast such
106+
/// as `cos(CAST int_column AS DOUBLE)` during planning.
79107
///
80-
/// # Data Types
108+
/// [`type_coercion`]: crate::type_coercion
81109
///
82-
/// ## Timestamps
110+
/// ## Example: Strings
111+
///
112+
/// There are several different string types in Arrow, such as
113+
/// [`DataType::Utf8`], [`DataType::LargeUtf8`], and [`DataType::Utf8View`].
114+
///
115+
/// Some functions may have specialized implementations for these types, while others
116+
/// may be able to handle only one of them. For example, a function that
117+
/// only works with [`DataType::Utf8View`] would have the following signature:
118+
///
119+
/// ```
120+
/// # use arrow::datatypes::DataType;
121+
/// # use datafusion_expr_common::signature::{TypeSignature};
122+
/// // Declares the function must be invoked with a single argument of type `Utf8View`.
123+
/// // if a user calls the function with `Utf8` or `LargeUtf8`, DataFusion will
124+
/// // automatically add a cast to `Utf8View` during planning.
125+
/// let type_signature = TypeSignature::Exact(vec![DataType::Utf8View]);
126+
///
127+
/// ```
128+
///
129+
/// # Example: Timestamps
83130
///
84131
/// Types to match are represented using Arrow's [`DataType`]. [`DataType::Timestamp`] has an optional variable
85132
/// timezone specification. To specify a function can handle a timestamp with *ANY* timezone, use
@@ -130,8 +177,9 @@ pub enum TypeSignature {
130177
Exact(Vec<DataType>),
131178
/// One or more arguments belonging to the [`TypeSignatureClass`], in order.
132179
///
133-
/// [`Coercion`] contains not only the desired type but also the allowed casts.
134-
/// For example, if you expect a function has string type, but you also allow it to be casted from binary type.
180+
/// [`Coercion`] contains not only the desired type but also the allowed
181+
/// casts. For example, if you expect a function has string type, but you
182+
/// also allow it to be casted from binary type.
135183
///
136184
/// For functions that take no arguments (e.g. `random()`) see [`TypeSignature::Nullary`].
137185
Coercible(Vec<Coercion>),
@@ -206,7 +254,7 @@ impl TypeSignature {
206254
/// just listing specific DataTypes. For example, TypeSignatureClass::Timestamp matches any timestamp
207255
/// type regardless of timezone or precision.
208256
///
209-
/// Used primarily with TypeSignature::Coercible to define function signatures that can accept
257+
/// Used primarily with [`TypeSignature::Coercible`] to define function signatures that can accept
210258
/// arguments that can be coerced to a particular class of types.
211259
#[derive(Debug, Clone, Eq, PartialEq, PartialOrd, Hash)]
212260
pub enum TypeSignatureClass {
@@ -736,10 +784,12 @@ impl Hash for ImplicitCoercion {
736784
}
737785
}
738786

739-
/// Defines the supported argument types ([`TypeSignature`]) and [`Volatility`] for a function.
787+
/// Provides information necessary for calling a function.
788+
///
789+
/// - [`TypeSignature`] defines the argument types that a function has implementations
790+
/// for.
740791
///
741-
/// DataFusion will automatically coerce (cast) argument types to one of the supported
742-
/// function signatures, if possible.
792+
/// - [`Volatility`] defines how the output of the function changes with the input.
743793
#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Hash)]
744794
pub struct Signature {
745795
/// The data types that the function accepts. See [TypeSignature] for more information.

datafusion/expr/src/udf.rs

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -486,18 +486,32 @@ pub trait ScalarUDFImpl: Debug + DynEq + DynHash + Send + Sync {
486486
))
487487
}
488488

489-
/// Returns the function's [`Signature`] for information about what input
490-
/// types are accepted and the function's Volatility.
489+
/// Returns a [`Signature`] describing the argument types for which this
490+
/// function has an implementation, and the function's [`Volatility`].
491+
///
492+
/// See [`Signature`] for more details on argument type handling
493+
/// and [`Self::return_type`] for computing the return type.
494+
///
495+
/// [`Volatility`]: datafusion_expr_common::signature::Volatility
491496
fn signature(&self) -> &Signature;
492497

493-
/// What [`DataType`] will be returned by this function, given the types of
494-
/// the arguments.
498+
/// [`DataType`] returned by this function, given the types of the
499+
/// arguments.
500+
///
501+
/// # Arguments
502+
///
503+
/// `arg_types` Data types of the arguments. The implementation of
504+
/// `return_type` can assume that some other part of the code has coerced
505+
/// the actual argument types to match [`Self::signature`].
495506
///
496507
/// # Notes
497508
///
498509
/// If you provide an implementation for [`Self::return_field_from_args`],
499-
/// DataFusion will not call `return_type` (this function). In such cases
500-
/// is recommended to return [`DataFusionError::Internal`].
510+
/// DataFusion will not call `return_type` (this function). While it is
511+
/// valid to to put [`unimplemented!()`] or [`unreachable!()`], it is
512+
/// recommended to return [`DataFusionError::Internal`] instead, which
513+
/// reduces the severity of symptoms if bugs occur (an error rather than a
514+
/// panic).
501515
///
502516
/// [`DataFusionError::Internal`]: datafusion_common::DataFusionError::Internal
503517
fn return_type(&self, arg_types: &[DataType]) -> Result<DataType>;

0 commit comments

Comments
 (0)