@@ -111,14 +111,40 @@ def materialize(
111
111
:param lazy:
112
112
Whether this task is lazy or not.
113
113
114
- Unlike a normal task, lazy tasks always get executed. However, if a lazy
115
- task produces a lazy table (e.g. a SQL query), the table store checks if
116
- the same query has been executed before. If this is the case, then the
117
- query doesn't get executed, and instead, the table gets copied from the cache.
114
+ Unlike a normal task, lazy tasks always get executed. However, before table
115
+ returned by a lazy task gets materialized, the table store checks if
116
+ the same table has been materialized before. If this is the case, then the
117
+ table doesn't get materialized, and instead, the table gets copied from the cache.
118
+
119
+ This is efficient for tasks that return SQL queries, because the query
120
+ only gets generated but will not be executed again if the resulting table is cache-valid.
121
+
122
+ The same also works for :py:class:`ExternalTableReference <pydiverse.pipedag.container.ExternalTableReference>`,
123
+ where the "query" is just the identifier of the table in the store.
124
+
125
+ .. Note:: For tasks returning an ``ExternalTableReference`` pipedag cannot automatically
126
+ know if the external tables has changed of not. This should be controlled via a cache function
127
+ given via the ``cache`` argument of ``materialize``.
128
+ See :py:class:`ExternalTableReference <pydiverse.pipedag.container.ExternalTableReference>`
129
+ for an example.
130
+
131
+
132
+ For tasks returning a Polars DataFrame, the output is deemed cache-valid
133
+ if the hash of the resulting DataFrame is the same as the hash of the previous run.
134
+ So, even though the task always gets executed, downstream tasks can remain cache-valid
135
+ if the DataFrame is the same as before. This is useful for small tasks that are hard to
136
+ implement using only LazyFrames, but where the DataFrame generation is cheap.
137
+
138
+
139
+
140
+ In both cases, you don't need to manually bump the ``version`` of a lazy task.
141
+
142
+ .. Warning:: A task returning a Polars LazyFrame should `not` be marked as lazy.
143
+ Use ``version=AUTO_VERSION`` instead. See :py:class:`AUTO_VERSION`.
144
+ .. Warning:: A task returning a Pandas DataFrame should `not` be marked as lazy.
145
+ No hashing is implemented for Pandas DataFrames, so the task will always
146
+ be deemed cache-invalid, and thus, cache-invalidate all downstream tasks.
118
147
119
- This behaviour is very useful, because you don't need to manually bump
120
- the `version` of a lazy task. This only works because for lazy tables
121
- generating the query is very cheap compared to executing it.
122
148
:param group_node_tag:
123
149
Set a tag that may add this task to a configuration based group node.
124
150
:param nout:
0 commit comments