-
Notifications
You must be signed in to change notification settings - Fork 47
[25.2] Iceberg - JSON Schema support #1207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[25.2] Iceberg - JSON Schema support #1207
Conversation
✅ Deploy Preview for redpanda-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing Touches🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
5e48557 to
9e0ef10
Compare
| Creates an Iceberg table whose structure matches the latest schema registered for the subject in the Schema Registry. You must register a schema in the Schema Registry. Unlike the `value_schema_id_prefix` mode, `value_schema_latest` does not require that producers use the wire format. | ||
| Creates an Iceberg table whose structure matches the latest schema registered for the subject in the Schema Registry. You must register a schema in the Schema Registry. For Protobuf, you must use the fully qualified schema name, which includes the package name, for example `com.example.manufacturing.SensorData`. | ||
|
|
||
| Producers cannot use the wire format in `value_schema_latest` mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a certain format that is expected instead? What can producers use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous wording was perfect here. Producers can use anything they like. You can link the "wire format" to https://docs.redpanda.com/current/manage/schema-reg/schema-reg-overview/#wire-format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Producers cannot use the wire format in value_schema_latest mode. Instead the serialized message is expected as-is in the record value." maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
672755d to
260d1b7
Compare
| Creates an Iceberg table whose structure matches the latest schema registered for the subject in the Schema Registry. You must register a schema in the Schema Registry. Unlike the `value_schema_id_prefix` mode, `value_schema_latest` does not require that producers use the wire format. | ||
| Creates an Iceberg table whose structure matches the latest schema registered for the subject in the Schema Registry. You must register a schema in the Schema Registry. For Protobuf, you must use the fully qualified schema name, which includes the package name, for example `com.example.manufacturing.SensorData`. | ||
|
|
||
| Producers cannot use the wire format in `value_schema_latest` mode. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous wording was perfect here. Producers can use anything they like. You can link the "wire format" to https://docs.redpanda.com/current/manage/schema-reg/schema-reg-overview/#wire-format
3267bec to
63e4a83
Compare
nvartolomei
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm for the iceberg text
|
|
||
| You can inspect the DLQ table for records that failed to write to the Iceberg table, and you can take further action on these records, such as transforming and reprocessing them, or debugging issues that occurred upstream. | ||
|
|
||
| NOTE: Topic property misconfiguration, such as setting `redpanda.iceberg.mode` to `value_schema_latest` but not specifying the fully qualified schema name, does not cause records to be written to the DLQ table. Instead, Redpanda pauses the topic data translation to the Iceberg table until you fix the misconfiguration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to specify the schema name if your schema uses the TopicNamingStrategy thoug: https://docs.redpanda.com/current/manage/iceberg/choose-iceberg-mode/#override-value-schema-latest-default
IIUC you have a choice of doing that or specifying the fully qualified name, so this is misleading.
At least the docs say that. cc @rockwotj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to specify that it's for overriding the default, and added a cross reference to the doc that explains this override.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
| === Topic with schema (`value_schema_id_prefix` mode) | ||
|
|
||
| NOTE: The steps in this section also apply to the `value_schema_latest` mode, except for step 2. The `value_schema_latest` mode doesn't require the Schema Registry wire format, so you'll use your own producer code instead of xref:reference:rpk/rpk-topic/rpk-topic-produce[`rpk topic produce`]. | ||
| NOTE: The steps in this section also apply to the `value_schema_latest` mode, except the produce step. The `value_schema_latest` mode is not compatible with the Schema Registry wire format. The xref:reference:rpk/rpk-topic/rpk-topic-produce[`rpk topic produce`] command embeds the wire format header, so you must use your own producer code with `value_schema_latest`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@r-vasquez I think it would be really great if you could produce NOT using the wire format always. A lot of our customers don't use it, so having be a mode in your profile where you can choose how RPK produces (and maybe consumes) would help customers a lot who don't use the wire format.
JakeSCahill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments for your consideration.
| +---------+--------------+--------------------------+ | ||
| ---- | ||
|
|
||
| == Manage dead-letter queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think users are more likely to be scanning for 'how do I debug errors with Iceberg writes'. This heading feels a little system-centric rather than goal focused.
| == Manage dead-letter queue | ||
|
|
||
| Errors may occur when translating records in the `value_schema_id_prefix` mode to the Iceberg table format; for example, if you do not use the Schema Registry wire format with the magic byte, if the schema ID in the record is not found in the Schema Registry, or if an Avro or Protobuf data type cannot be translated to an Iceberg type. | ||
| Errors may occur when translating records in the `value_schema_id_prefix` or `value_schema_latest` modes to the Iceberg table format; for example, if you do not use the Schema Registry wire format with the magic byte, if the schema ID in the record is not found in the Schema Registry, or if a schema data type cannot be translated to an Iceberg type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would restructure this section so that users learn in this order:
- what happens when RP encounters an error
- examples of errors (bulleted list)
- what a DLQ is
- structure of DLQ
- how to inspect it (example SQL statement)
- how to drop invalid records instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion @JakeSCahill , I've restructured this section. I think it makes even more sense now to break this out into its own standalone doc. I'll do that in a later PR.
9d28d0f to
4d5f898
Compare
Co-authored-by: Jake Cahill <[email protected]>
Co-authored-by: Jake Cahill <[email protected]>
Description
This pull request introduces updates to documentation and configuration related to Iceberg integration in Redpanda. The changes include support for JSON schemas, updates to Iceberg schema modes, and adjustments to topic configuration and query examples. Below is a summary of the most important changes:
Iceberg Schema and JSON Schema Support
Added support for JSON Schema Draft-07 in Iceberg integration, including requirements for schema dialect declaration and constraints for type definitions. Unsupported features like
$ref,default, and conditional typing are documented. [1] [2]Updated the
value_schema_latestmode to require fully qualified schema names for Protobuf and clarified that it is incompatible with the Schema Registry wire format. [1] [2]Documentation Updates
Renamed
choose-iceberg-mode.adoctospecify-iceberg-schema.adocand updated its content to reflect the new schema support and integration details. [1] [2]Updated navigation links in
nav.adocto point to the newly renamed and updated Iceberg schema documentation.Configuration and Examples
Modified topic configuration examples to align with the new schema modes and clarified usage of the
redpanda.iceberg.modeproperty. [1] [2]Corrected examples for producing data to topics, including adjustments to JSON formatting and commands.
Additional Enhancements
Updated Iceberg type mappings for Protobuf and JSON Schema, including corrections to mapping terminology (e.g., "repeated values" to "list types"). [1] [2]
Clarified error handling and dead-letter queue behavior for misconfigured topics, emphasizing the pause in data translation instead of writing to the DLQ.
Resolves https://redpandadata.atlassian.net/browse/
Review deadline: 23 July
Page previews
Specify Iceberg Schema
Query Iceberg Topics > Query examples > Topic with schema
What's New
Checks