You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* xref:redpanda-cloud:get-started:cloud-overview.adoc#redpanda-cloud-vs-self-managed-feature-compatibility[Redpanda Cloud vs Self-Managed feature compatibility]
9
9
10
+
== JSON Schema support for Iceberg topics
11
+
12
+
Redpanda now supports JSON Schema for Iceberg topics. This allows you to use all supported schema types (Protobuf, Avro, and JSON Schema) for Iceberg topics. For more information, see xref:manage:iceberg/specify-iceberg-schema.adoc[].
13
+
10
14
== Manage SASL users with Kafka APIs
11
15
12
16
Redpanda now supports the following Kafka APIs for managing SASL user credentials as described in https://cwiki.apache.org/confluence/display/KAFKA/KIP-554%3A+Add+Broker-side+SCRAM+Config+API[KIP-554^]:
In xref:manage:iceberg/about-iceberg-topics.adoc#enable-iceberg-integration[Iceberg-enabled clusters], the `redpanda.iceberg.mode` topic property determines how Redpanda maps topic data to the Iceberg table structure. You can have the generated Iceberg table match the structure of an Avro or Protobuf schema in the Schema Registry, or you can use the `key_value` mode where Redpanda stores the record values as-is in the table.
15
-
16
-
NOTE: The JSON Schema format is not supported for Iceberg topics. If your topic data is in JSON, use the `key_value` mode.
16
+
In xref:manage:iceberg/about-iceberg-topics.adoc#enable-iceberg-integration[Iceberg-enabled clusters], the `redpanda.iceberg.mode` topic property determines how Redpanda maps topic data to the Iceberg table structure. You can have the generated Iceberg table match the structure of a schema in the Schema Registry, or you can use the `key_value` mode where Redpanda stores the record values as-is in the table.
17
17
18
18
== Supported Iceberg modes
19
19
@@ -37,7 +37,11 @@ In the xref:manage:schema-reg/schema-reg-overview.adoc#wire-format[Schema Regist
37
37
38
38
=== value_schema_latest
39
39
40
-
Creates an Iceberg table whose structure matches the latest schema registered for the subject in the Schema Registry. You must register a schema in the Schema Registry. Unlike the `value_schema_id_prefix` mode, `value_schema_latest` does not require that producers use the wire format.
40
+
Creates an Iceberg table whose structure matches the latest schema registered for the subject in the Schema Registry. You must register a schema in the Schema Registry.
41
+
42
+
Producers cannot use the wire format in `value_schema_latest` mode. Redpanda expects the serialized message as-is without the magic byte or schema ID prefix in the record value.
43
+
44
+
NOTE: The `value_schema_latest` mode is not compatible with the xref:reference:rpk/rpk-topic/rpk-topic-produce[`rpk topic produce`] command which embeds the wire format header. You must use your own producer code to produce to topics in `value_schema_latest` mode.
41
45
42
46
The latest schema is cached periodically. The cache period is defined by the cluster property `iceberg_latest_schema_cache_ttl_ms` (default: 5 minutes).
* For both Avro and Protobuf, specify a different subject name by using the key-value pair `subject=<subject-name>`, for example `value_schema_latest:subject=sensor-data`.
95
99
* For Protobuf only:
96
-
** Specify a different message definition by using a key-value pair `protobuf_name=<message-name>`, for example: `value_schema_latest:protobuf_name=com.example.manufacturing.SensorData`.
100
+
+
101
+
--
102
+
** Specify a different message definition by using a key-value pair `protobuf_name=<message-full-name>`. You must use the fully qualified name, which includes the package name, for example, `value_schema_latest:protobuf_name=com.example.manufacturing.SensorData`.
97
103
** To specify both a different subject and message definition, separate the key-value pairs with a comma, for example: `value_schema_latest:subject=my_protobuf_schema,protobuf_name=com.example.manufacturing.SensorData`.
104
+
--
105
+
+
106
+
NOTE: If you don't specify the fully qualified Protobuf message name, Redpanda pauses the data translation to the Iceberg table until you fix the topic misconfiguration.
107
+
98
108
99
109
== How Iceberg modes translate to table format
100
110
@@ -140,7 +150,7 @@ CREATE TABLE ClickEvent (
140
150
)
141
151
----
142
152
143
-
Use `key_value` mode if the topic data is in JSON or if you are able to use the Iceberg data in its semi-structured format.
153
+
Use `key_value` mode if you want to use the Iceberg data in its semi-structured format.
144
154
145
155
The `value_schema_id_prefix` and `value_schema_latest` modes can use the schema to translate to the following table format:
146
156
@@ -162,7 +172,7 @@ CREATE TABLE ClickEvent (
162
172
163
173
As you produce records to the topic, the data also becomes available in object storage for Iceberg-compatible clients to consume. You can use the same analytical tools to xref:manage:iceberg/query-iceberg-topics.adoc[read the Iceberg topic data] in a data lake as you would for a relational database.
164
174
165
-
If Redpanda fails to translate the record to the columnar format as defined by the schema, it writes the record to a dead-letter queue (DLQ) table. See xref:manage:iceberg/about-iceberg-topics.adoc#manage-dead-letter-queue[Manage dead-letter queue] for more information.
175
+
If Redpanda fails to translate the record to the columnar format as defined by the schema, it writes the record to a dead-letter queue (DLQ) table. See xref:manage:iceberg/about-iceberg-topics.adoc#troubleshoot-errors[Troubleshoot errors] for more information.
166
176
167
177
=== Schema types translation
168
178
@@ -185,7 +195,7 @@ Avro::
185
195
| string | string
186
196
| record | struct
187
197
| array | list
188
-
| maps | map
198
+
| map | map
189
199
| fixed | fixed*
190
200
| decimal | decimal
191
201
| uuid | uuid*
@@ -234,14 +244,97 @@ Protobuf::
234
244
235
245
There are some cases where the Protobuf type does not map directly to an Iceberg type and Redpanda applies the following transformations:
236
246
237
-
* Repeated values are translated into Iceberg `array` types.
247
+
* Repeated values are translated into Iceberg `list` types.
238
248
* Enums are translated into Iceberg `int` types based on the integer value of the enumerated type.
239
249
* `uint32` and `fixed32` are translated into Iceberg `long` types as that is the existing semantic for unsigned 32-bit values in Iceberg.
240
250
* `uint64` and `fixed64` values are translated into their Base-10 string representation.
241
251
* `google.protobuf.Timestamp` is translated into `timestamp` in Iceberg.
242
252
243
253
Recursive types are not supported.
244
254
--
255
+
256
+
JSON Schema::
257
+
+
258
+
--
259
+
Requirements:
260
+
261
+
- Only JSON Schema Draft-07 is currently supported.
262
+
- You must declare the JSON Schema dialect using the `$schema` keyword, for example `"$schema": "http://json-schema.org/draft-07/schema#"`.
263
+
- You must use a JSON Schema that constrains JSON documents to a strict type in order for Redpanda to translate to Iceberg; that is, each subschema must use the `type` keyword.
| The keywords `items` and `additionalItems` must be used to constrain element types.
291
+
292
+
| boolean
293
+
| boolean
294
+
|
295
+
296
+
| null
297
+
|
298
+
| The `null` type is not supported except when it is paired with another type to indicate nullability.
299
+
300
+
| number
301
+
| double
302
+
|
303
+
304
+
| integer
305
+
| long
306
+
|
307
+
308
+
| string
309
+
| string
310
+
| The `format` keyword can be used for custom Iceberg types. See <<format-translation,`format` annotation translation>> for details.
311
+
312
+
| object
313
+
| struct
314
+
| The `properties` keyword must be used to define `struct` fields and constrain their types. The `additionalProperties` keyword is accepted only when it is set to `false`.
315
+
316
+
|===
317
+
318
+
[[format-translation]]
319
+
.`format` annotation translation
320
+
|===
321
+
| `format` value | Iceberg type
322
+
323
+
| date-time | timestamptz
324
+
| date | date
325
+
| time | time
326
+
327
+
|===
328
+
329
+
The following are not supported for JSON Schema:
330
+
331
+
* Relative and absolute (including external) references using `$ref` and `$dynamicRef` keywords
Copy file name to clipboardExpand all lines: modules/manage/partials/iceberg/about-iceberg-topics.adoc
+23-13Lines changed: 23 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,7 +52,7 @@ endif::[]
52
52
53
53
* It is not possible to append topic data to an existing Iceberg table that is not created by Redpanda.
54
54
* If you enable the Iceberg integration on an existing Redpanda topic, Redpanda does not backfill the generated Iceberg table with topic data.
55
-
* JSON schemas are not currently supported. If the topic data is in JSON, use the `key_value` mode to store the JSON in Iceberg, which then can be parsed by most query engines.
55
+
* JSON schemas are supported starting with Redpanda version 25.2.
56
56
57
57
== Enable Iceberg integration
58
58
@@ -231,7 +231,7 @@ Data in an Iceberg-enabled topic is consumable from Kafka based on the configure
231
231
232
232
== Schema evolution
233
233
234
-
Redpanda supports schema evolution for Avro and Protobuf schemas in accordance with the https://iceberg.apache.org/spec/#schema-evolution[Iceberg specification^]. Permitted schema evolutions include reordering fields and promoting field types. When you update the schema in Schema Registry, Redpanda automatically updates the Iceberg table schema to match the new schema.
234
+
Redpanda supports schema evolution in accordance with the https://iceberg.apache.org/spec/#schema-evolution[Iceberg specification^]. Permitted schema evolutions include reordering fields and promoting field types. When you update the schema in Schema Registry, Redpanda automatically updates the Iceberg table schema to match the new schema.
235
235
236
236
For example, if you produce records to a topic `demo-topic` with the following Avro schema:
237
237
@@ -310,22 +310,21 @@ Querying the Iceberg table for `demo-topic` includes the new column `ts`:
Errors may occur when translating records in the `value_schema_id_prefix` mode to the Iceberg table format; for example, if you do not use the Schema Registry wire format with the magic byte, if the schema ID in the record is not found in the Schema Registry, or if an Avro or Protobuf data type cannot be translated to an Iceberg type.
315
+
If Redpanda encounters an error while writing a record to the Iceberg table, Redpanda by default writes the record to a separate dead-letter queue (DLQ) Iceberg table named `<topic-name>~dlq`. The following can cause errors to occur when translating records in the `value_schema_id_prefix` and `value_schema_latest` modes to the Iceberg table format:
316
316
317
-
ifndef::env-cloud[]
318
-
If Redpanda encounters an error while writing a record to the Iceberg table, Redpanda writes the record to a separate dead-letter queue (DLQ) Iceberg table named `<topic-name>~dlq`. To disable the default behavior for a topic and drop the record, set the xref:reference:properties/topic-properties.adoc#redpanda-iceberg-invalid-record-action[`redpanda.iceberg.invalid.record.action`] topic property to `drop`. You can also configure the default cluster-wide behavior for invalid records by setting the `iceberg_invalid_record_action` property.
319
-
endif::[]
320
-
ifdef::env-cloud[]
321
-
If Redpanda encounters an error while writing a record to the Iceberg table, Redpanda writes the record to a separate dead-letter queue (DLQ) Iceberg table named `<topic-name>~dlq`. To disable the default behavior for a topic and drop the record, set the `redpanda.iceberg.invalid.record.action` topic property to `drop`. You can also configure the default cluster-wide behavior for invalid records by setting the `iceberg_invalid_record_action` property.
322
-
endif::[]
317
+
- Redpanda cannot find the embedded schema ID in the Schema Registry.
318
+
- Redpanda fails to translate one or more schema data types to an Iceberg type.
319
+
- In `value_schema_id_prefix` mode, you do not use the Schema Registry wire format with the magic byte.
323
320
324
321
The DLQ table itself uses the `key_value` schema, consisting of two columns: the record metadata including the key, and a binary column for the record's value.
325
322
326
-
You can inspect the DLQ table for records that failed to write to the Iceberg table, and you can take further action on these records, such as transforming and reprocessing them, or debugging issues that occurred upstream.
323
+
NOTE: Topic property misconfiguration, such as xref:manage:iceberg/specify-iceberg-schema.adoc#override-value-schema-latest-default[overriding the default behavior of `value_schema_latest` mode] but not specifying the fully qualified Protobuf message name, does not cause records to be written to the DLQ table. Instead, Redpanda pauses the topic data translation to the Iceberg table until you fix the misconfiguration.
327
324
328
-
=== Reprocess DLQ records
325
+
=== Inspect DLQ table
326
+
327
+
You can inspect the DLQ table for records that failed to write to the Iceberg table, and you can take further action on these records, such as transforming and reprocessing them, or debugging issues that occurred upstream.
329
328
330
329
The following example produces a record to a topic named `ClickEvent` and does not use the Schema Registry wire format that includes the magic byte and schema ID:
331
330
@@ -356,7 +355,9 @@ FROM <catalog-name>."ClickEvent~dlq"; -- Fully qualified table name
The data is in binary format, and the first byte is not `0x00`, indicating that it was not produced with a schema.
358
+
The data is in binary format, and the first byte is not `0x00`, indicating that it was not produced with a schema.
359
+
360
+
=== Reprocess DLQ records
360
361
361
362
You can apply a transformation and reprocess the record in your data lakehouse to the original Iceberg table. In this case, you have a JSON value represented as a UTF-8 binary. Depending on your query engine, you might need to decode the binary value first before extracting the JSON fields. Some engines may automatically decode the binary value for you:
362
363
@@ -385,6 +386,15 @@ FROM (
385
386
386
387
You can now insert the transformed record back into the main Iceberg table. Redpanda recommends employing a strategy for exactly-once processing to avoid duplicates when reprocessing records.
387
388
389
+
=== Drop invalid records
390
+
391
+
ifndef::env-cloud[]
392
+
To disable the default behavior and drop an invalid record, set the xref:reference:properties/topic-properties.adoc#redpanda-iceberg-invalid-record-action[`redpanda.iceberg.invalid.record.action`] topic property to `drop`. You can also configure the default cluster-wide behavior for invalid records by setting the `iceberg_invalid_record_action` property.
393
+
endif::[]
394
+
ifdef::env-cloud[]
395
+
To disable the default behavior and drop an invalid record, set the `redpanda.iceberg.invalid.record.action` topic property to `drop`. You can also configure the default cluster-wide behavior for invalid records by setting the `iceberg_invalid_record_action` property.
396
+
endif::[]
397
+
388
398
== Performance considerations
389
399
390
400
When you enable Iceberg for any substantial workload and start translating topic data to the Iceberg format, you may see most of your cluster's CPU utilization increase. If this additional workload overwhelms the brokers and causes the Iceberg table lag to exceed the configured target lag, Redpanda automatically applies backpressure to producers to prevent Iceberg tables from lagging further. This ensures that Iceberg tables keep up with the volume of incoming data, but sacrifices ingress throughput of the cluster.
Copy file name to clipboardExpand all lines: modules/manage/partials/iceberg/query-iceberg-topics.adoc
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -77,7 +77,7 @@ endif::[]
77
77
78
78
=== Topic with schema (`value_schema_id_prefix` mode)
79
79
80
-
NOTE: The steps in this section also apply to the `value_schema_latest` mode, except for step 2. The `value_schema_latest` mode doesn't require the Schema Registry wire format, so you'll use your own producer code instead of xref:reference:rpk/rpk-topic/rpk-topic-produce[`rpk topic produce`].
80
+
NOTE: The steps in this section also apply to the `value_schema_latest` mode, except the produce step. The `value_schema_latest` mode is not compatible with the Schema Registry wire format. The xref:reference:rpk/rpk-topic/rpk-topic-produce[`rpk topic produce`] command embeds the wire format header, so you must use your own producer code with `value_schema_latest`.
81
81
82
82
Assume that you have created the `ClickEvent` topic, set `redpanda.iceberg.mode` to `value_schema_id_prefix`, and are connecting to a REST-based Iceberg catalog. The following is an Avro schema for `ClickEvent`:
83
83
@@ -139,7 +139,7 @@ In this example, assume that you have created the `ClickEvent_key_value` topic,
139
139
+
140
140
[,bash]
141
141
----
142
-
echo 'key1 {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent_key_value --format='%k %v\n'
142
+
echo '"key1" {"user_id":2324,"event_type":"BUTTON_CLICK","ts":"2024-11-25T20:23:59.380Z"}' | rpk topic produce ClickEvent_key_value --format='%k %v\n'
143
143
----
144
144
145
145
. The following Spark SQL query returns the semi-structured data in the `ClickEvent_key_value` table. The table consists of two columns: one named `redpanda`, containing the record key and other metadata, and another binary column named `value` for the record's value:
0 commit comments