@@ -109,14 +109,40 @@ def materialize(
109
109
:param lazy:
110
110
Whether this task is lazy or not.
111
111
112
- Unlike a normal task, lazy tasks always get executed. However, if a lazy
113
- task produces a lazy table (e.g. a SQL query), the table store checks if
114
- the same query has been executed before. If this is the case, then the
115
- query doesn't get executed, and instead, the table gets copied from the cache.
112
+ Unlike a normal task, lazy tasks always get executed. However, before table
113
+ returned by a lazy task gets materialized, the table store checks if
114
+ the same table has been materialized before. If this is the case, then the
115
+ table doesn't get materialized, and instead, the table gets copied from the cache.
116
+
117
+ This is efficient for tasks that return SQL queries, because the query
118
+ only gets generated but will not be executed again if the resulting table is cache-valid.
119
+
120
+ The same also works for :py:class:`ExternalTableReference <pydiverse.pipedag.container.ExternalTableReference>`,
121
+ where the "query" is just the identifier of the table in the store.
122
+
123
+ .. Note:: For tasks returning an ``ExternalTableReference`` pipedag cannot automatically
124
+ know if the external tables has changed of not. This should be controlled via a cache function
125
+ given via the ``cache`` argument of ``materialize``.
126
+ See :py:class:`ExternalTableReference <pydiverse.pipedag.container.ExternalTableReference>`
127
+ for an example.
128
+
129
+
130
+ For tasks returning a Polars DataFrame, the output is deemed cache-valid
131
+ if the hash of the resulting DataFrame is the same as the hash of the previous run.
132
+ So, even though the task always gets executed, downstream tasks can remain cache-valid
133
+ if the DataFrame is the same as before. This is useful for small tasks that are hard to
134
+ implement using only LazyFrames, but where the DataFrame generation is cheap.
135
+
136
+
137
+
138
+ In both cases, you don't need to manually bump the ``version`` of a lazy task.
139
+
140
+ .. Warning:: A task returning a Polars LazyFrame should `not` be marked as lazy.
141
+ Use ``version=AUTO_VERSION`` instead. See :py:class:`AUTO_VERSION`.
142
+ .. Warning:: A task returning a Pandas DataFrame should `not` be marked as lazy.
143
+ No hashing is implemented for Pandas DataFrames, so the task will always
144
+ be deemed cache-invalid, and thus, cache-invalidate all downstream tasks.
116
145
117
- This behaviour is very useful, because you don't need to manually bump
118
- the `version` of a lazy task. This only works because for lazy tables
119
- generating the query is very cheap compared to executing it.
120
146
:param group_node_tag:
121
147
Set a tag that may add this task to a configuration based group node.
122
148
:param nout:
0 commit comments