A Pythonic interface for the Debezium Engine, allowing you to consume database Change Data Capture (CDC) events directly in your Python applications.
- Pure Python Interface: Interact with the powerful Debezium Engine using simple Python classes and methods.
- Pluggable Event Handlers: Easily create custom handlers to process CDC events according to your specific needs.
- Built-in Iceberg Handler: Stream change events directly into Apache Iceberg tables with zero boilerplate.
- Seamless Integration: Designed to work with popular Python data tools like dlt (data load tool).
- All Debezium Connectors: Supports all standard Debezium connectors (PostgreSQL, MySQL, SQL Server, Oracle, etc.).
This library acts as a bridge between the Python world and the Java-based Debezium Engine. It uses Pyjnius to manage the JVM and interact with Debezium's Java classes, exposing a clean, Pythonic API so you can focus on your data logic without writing Java code.
pydbzengine
comes with several built-in handlers to stream CDC events directly into common data destinations. You can also easily create your own by extending the BasePythonChangeHandler
.
The Iceberg handlers are designed to stream CDC events directly into Apache Iceberg tables, enabling you to build a robust data lakehouse architecture. They automatically handle data serialization into the appropriate Arrow/Parquet format and manage Iceberg transactions.
IcebergChangeHandler
: A straightforward handler that appends change data to a source-equivalent Iceberg table using a predefined schema.- Use Case: Best for creating a "bronze" layer where you want to capture the raw Debezium event. The
before
andafter
payloads are stored as complete JSON strings. - Schema: Uses a fixed schema where complex nested fields (
source
,before
,after
) are stored asStringType
.- With consuming data as json, all source system schema changes will be absorbed automatically.
- Automatic Table Creation & Partitioning: It automatically creates a new Iceberg table for each source table and partitions it by day on the
_consumed_at
timestamp for efficient time-series queries. - Enriched Metadata: It also adds
_consumed_at
,_dbz_event_key
, and_dbz_event_key_hash
columns for enhanced traceability.
- Use Case: Best for creating a "bronze" layer where you want to capture the raw Debezium event. The
DltChangeHandler
: This handler integrates seamlessly with thedlt
library. It passes Debezium events to adlt
pipeline, allowing you to load the data into any destinationdlt
supports (e.g., DuckDB, BigQuery, Snowflake, Redshift, and more).- Use Case: Perfect for users who want to leverage
dlt
's powerful features for schema inference, data normalization, and loading data into a data warehouse or database.
- Use Case: Perfect for users who want to leverage
BasePythonChangeHandler
: This is the abstract base class for creating your own custom handlers. By extending this class and implementing thehandle_batch
method, you can process change events with your own Python logic.- Use Case: Use this for any custom destination or logic not covered by the built-in handlers. Examples include:
- Sending events to a message queue like Kafka or RabbitMQ.
- Calling a custom API endpoint for each event.
- Performing real-time aggregations or analytics.
- Use Case: Use this for any custom destination or logic not covered by the built-in handlers. Examples include:
You must have a Java Development Kit (JDK) version 11 or newer installed and available in your system's PATH
.
# For core functionality
pip install pydbzengine
# To include dependencies for the Apache Iceberg handler
pip install 'pydbzengine[iceberg]'
# To include dependencies for the dlt handler example
pip install 'pydbzengine[dlt]'
- First install the packages,
pip install pydbzengine[dev]
- Second extend the BasePythonChangeHandler and implement your python consuming logic. see the example below
from typing import List
from pydbzengine import ChangeEvent, BasePythonChangeHandler
from pydbzengine import Properties, DebeziumJsonEngine
class PrintChangeHandler(BasePythonChangeHandler):
"""
A custom change event handler class.
This class processes batches of Debezium change events received from the engine.
The `handleJsonBatch` method is where you implement your logic for consuming
and processing these events. Currently, it prints basic information about
each event to the console.
"""
def handleJsonBatch(self, records: List[ChangeEvent]):
"""
Handles a batch of Debezium change events.
This method is called by the Debezium engine with a list of ChangeEvent objects.
Change this method to implement your desired processing logic. For example,
you might parse the event data, transform it, and load it into a database or
other destination.
Args:
records: A list of ChangeEvent objects representing the changes captured by Debezium.
"""
print(f"Received {len(records)} records")
for record in records:
print(f"destination: {record.destination()}")
print(f"key: {record.key()}")
print(f"value: {record.value()}")
print("--------------------------------------")
if __name__ == '__main__':
props = Properties()
props.setProperty("name", "engine")
props.setProperty("snapshot.mode", "initial_only")
# Add further Debezium connector configuration properties here. For example:
# props.setProperty("connector.class", "io.debezium.connector.mysql.MySqlConnector")
# props.setProperty("database.hostname", "your_database_host")
# props.setProperty("database.port", "3306")
# Create a DebeziumJsonEngine instance, passing the configuration properties and the custom change event handler.
engine = DebeziumJsonEngine(properties=props, handler=PrintChangeHandler())
# Start the Debezium engine to begin consuming and processing change events.
engine.run()
from pyiceberg.catalog import load_catalog
from pydbzengine import DebeziumJsonEngine
from pydbzengine.handlers.iceberg import IcebergChangeHandler
from pydbzengine import Properties
conf = {
"uri": "http://localhost:8181",
# "s3.path-style.access": "true",
"warehouse": "warehouse",
"s3.endpoint": "http://localhost:9000",
"s3.access-key-id": "minioadmin",
"s3.secret-access-key": "minioadmin",
}
catalog = load_catalog(name="rest", **conf)
handler = IcebergChangeHandler(catalog=catalog, destination_namespace=("iceberg", "debezium_cdc_data",))
dbz_props = Properties()
dbz_props.setProperty("name", "engine")
dbz_props.setProperty("snapshot.mode", "always")
# ....
# Add further Debezium connector configuration properties here. For example:
# dbz_props.setProperty("connector.class", "io.debezium.connector.mysql.MySqlConnector")
engine = DebeziumJsonEngine(properties=dbz_props, handler=handler)
engine.run()
For the full code please see dlt_consuming.py
from pydbzengine import DebeziumJsonEngine
from pydbzengine.helper import Utils
from pydbzengine.handlers.dlt import DltChangeHandler
from pydbzengine import Properties
import dlt
# Create a dlt pipeline and set destination. in this case DuckDb.
dlt_pipeline = dlt.pipeline(
pipeline_name="dbz_cdc_events_example",
destination="duckdb",
dataset_name="dbz_data"
)
handler = DltChangeHandler(dlt_pipeline=dlt_pipeline)
dbz_props = Properties()
dbz_props.setProperty("name", "engine")
dbz_props.setProperty("snapshot.mode", "always")
# ....
engine = DebeziumJsonEngine(properties=dbz_props, handler=handler)
# Run the Debezium engine asynchronously with a timeout.
# This runs for a limited time and then terminates automatically.
Utils.run_engine_async(engine=engine, timeout_sec=60)