Skip to main content
Paweł Mantur
Author | Architect | Engineer
View all authors

Design session with ChatGPT-5

· 10 min read
Paweł Mantur
Author | Architect | Engineer
ChatGPT
Assistant
Claude.ai
Assistant

Introduction

This article documents the outcomes of an AI-assisted architectural design session focused on building a real-time data unification platform. The challenge: handling heterogeneous data sources (MongoDB documents with varying schemas and Kafka event streams) and transforming them into a consistent canonical model for operational applications, while also providing analytical insights.

info

Full session transcript can be found here: shared link on chatgpt.com.
I was not originally intending to publish that, but the results are quite insightful, that's why I have asked Claude Code to have a look and create a blog article about that.
I wanted to use ChatGPT as well for the blog summary, but I've run out of ChatGPT5 credits 😅.

So here we are, Claude has prepared a summary of ChatGPT's work, a good cooperation under my supervision 😎

The Problem Statement

Primary Challenge: Build a system that can:

  • Discover schemas from heterogeneous MongoDB documents and Kafka events
  • Map disparate data formats to a canonical model in real-time
  • Serve both operational applications (latest state + events) and analytical workloads
  • Support both cloud-scale deployment and local development environments

Key Requirements:

  • Real-time canonicalization for operational apps (primary use case)
  • Visual schema mapping for simple transformations
  • Code-based approach for complex logic and historical backfills
  • Support for analytical metrics with near-real-time updates
  • Enterprise-ready architecture with pet-project-friendly local development

Architecture Evolution Through Design Session

Phase 1: AWS Managed Services Approach

Initial exploration focused on AWS managed services:

  • AWS Glue Data Catalog + Glue Crawlers for schema discovery
  • AWS Glue Streaming Jobs for real-time canonicalization
  • Amazon DMS for MongoDB CDC
  • Amazon DocumentDB for materialized canonical state

Key Insight: While powerful, this approach created vendor lock-in and complex local development setup.

Phase 2: Open Source + Kubernetes Native

Pivoted to open-source alternatives running on Kubernetes:

  • Apache Kafka (via Strimzi operator) for event streaming
  • Debezium for MongoDB change data capture
  • Apache Flink for stream processing
  • Confluent Schema Registry or Apicurio Registry for schema governance

Key Insight: Provides portability across cloud providers while maintaining enterprise capabilities.

Phase 3: Visual + Code-First Hybrid

Added visual tooling for business users while maintaining developer control:

  • Apache Hop for drag-and-drop schema mapping (simple cases)
  • DataHub for schema discovery and cataloging
  • Apache Flink for complex transformations and stateful processing

Key Insight: Hybrid approach balances accessibility for analysts with full control for engineers.

Final Architecture Design

Core Technology Stack

ComponentPurposeLocal (Docker)Cloud (Managed)
Data SourcesSource systemsMongoDB DockerMongoDB Atlas
Change Data CaptureStream changesDebezium DockerDebezium on EKS
Event StreamingMessage backboneKafka DockerAmazon MSK / Confluent Cloud
Schema ManagementSchema governanceSchema Registry DockerManaged Schema Registry
Stream ProcessingReal-time transformsFlink DockerFlink on EKS
Canonical StateLatest entity stateMongoDB DockerMongoDB Atlas
Metrics StoreAnalyticsPostgres/TimescaleDBClickHouse/Druid Managed
Schema DiscoveryData catalogingDataHub DockerDataHub on EKS
Visual MappingBusiness user toolsApache Hop DockerApache Hop on EKS

Architecture Diagrams

Local Development Setup

Cloud Production Setup

Green components indicate identical code/logic between environments

Key Architectural Decisions

Decision: Use Apache Flink for both streaming and batch processing with PyFlink as the primary language.

Rationale:

  • Single codebase for streaming metrics and historical backfills
  • Same team can maintain both pipeline types
  • Flink's unified API supports both bounded (batch) and unbounded (streaming) sources
  • Python expertise more common than Java/Scala in data teams

2. Schema-First Approach

Decision: Canonical schemas live in Schema Registry; raw schemas discovered dynamically.

Architecture:

Benefits:

  • Separates discovery (raw, evolving) from governance (canonical, stable)
  • Visual tools can operate on discovered schemas
  • Code generation possible from mapping specifications

3. Tiered Metrics Store Strategy

Decision: Start with Postgres/TimescaleDB locally, migrate to ClickHouse/Druid for production.

Local Setup:

  • Postgres/TimescaleDB for metrics storage
  • Simple SQL queries for analytics
  • Minimal infrastructure cost

Production Setup:

  • ClickHouse or Druid managed services
  • Columnar storage for high-cardinality dimensions
  • Horizontal scaling capabilities

Migration Path: Flink sink configuration swap only - no application code changes required.

4. Hybrid Mapping Approach

Decision: Visual tools (Apache Hop) for simple mappings, code (Flink) for complex logic.

Implementation:

  • Apache Hop generates mapping specifications (JSON)
  • Flink jobs can import and execute these specifications
  • Complex stateful logic remains in hand-written Flink code
  • Same mapping definitions drive both simple and advanced pipelines

Technical Implementation Details

Schema Discovery Algorithm

Streaming Schema Inference:

# Simplified PyFlink schema profiler logic
def infer_schema_from_stream(raw_topic):
field_stats = {}
type_distributions = {}

for message in sample_messages(raw_topic, window="1h"):
for field_path, value in flatten_json(message):
field_stats[field_path] = field_stats.get(field_path, 0) + 1
detected_type = infer_type(value)
type_distributions.setdefault(field_path, set()).add(detected_type)

# Generate JSON Schema with metadata
inferred_schema = build_json_schema(field_stats, type_distributions)
register_schema(schema_registry, f"raw.{topic}.inferred", inferred_schema)
push_profiling_stats(datahub, field_stats)

Unified Metrics Computation

Shared Aggregation Logic:

# PyFlink aggregation function used by both streaming and batch
class MetricAggregator:
def compute_metrics(self, events, dimensions, time_window):
return events \
.key_by(lambda e: extract_dimensions(e, dimensions)) \
.window(time_window) \
.aggregate(
count=lambda events: len(events),
sum=lambda events: sum(e.amount for e in events),
avg=lambda events: sum(e.amount for e in events) / len(events),
min=lambda events: min(e.amount for e in events),
max=lambda events: max(e.amount for e in events)
)

# Streaming version
streaming_metrics = kafka_source \
.map(deserialize_canonical_event) \
.apply(MetricAggregator().compute_metrics(dimensions=['region', 'product'],
time_window=TumblingWindow.hours(1)))

# Batch version
batch_metrics = mongodb_source \
.map(convert_to_canonical_event) \
.apply(MetricAggregator().compute_metrics(dimensions=['region', 'product'],
time_window=GlobalWindow()))

Mapping Specification Format

Generated by Apache Hop, Consumed by Flink:

{
"mappingId": "order-canonicalization-v1",
"source": {
"type": "kafka",
"topic": "raw.mongo.orders",
"schemaId": 1234
},
"target": {
"subject": "canonical.order.v1",
"schemaId": 5678
},
"fieldMappings": [
{
"src": "$.orderId",
"dst": "$.order_id",
"transform": "string"
},
{
"src": "$.customer.email",
"dst": "$.customer.email",
"transform": "lowercase"
},
{
"src": "$.items[*].sku",
"dst": "$.line_items[*].sku",
"transform": "identity"
}
],
"defaults": {
"currency": "USD",
"created_by": "system"
}
}

Key Design Outcomes

1. Technology Consolidation

Achievement: Reduced from 5+ different technologies to 3 core components:

  • Apache Flink (PyFlink): Unified streaming + batch processing
  • Kafka + Schema Registry: Event backbone + schema governance
  • DataHub + Apache Hop: Visual discovery + mapping

2. Development Experience Optimization

Achievement: Single team can maintain entire pipeline:

  • Same Python codebase for streaming and batch jobs
  • Visual tools for analysts, code for engineers
  • Consistent deployment patterns (Kubernetes operators)
  • Shared testing and monitoring approaches

3. Scalability Path

Achievement: Clear migration from pet project to enterprise:

  • Local: Docker containers, minimal cost
  • Production: Managed services (MSK, MongoDB Atlas, ClickHouse Cloud)
  • Code Reuse: 95% of application logic unchanged between environments

4. Schema Governance Model

Achievement: Separation of concerns for schema management:

  • Raw schemas: Discovered automatically, versioned, exploratory
  • Canonical schemas: Curated, governed, stable contracts
  • Mapping specifications: Version controlled, testable, auditable

Lessons Learned

1. AI-Assisted Architecture Design

Observation: The iterative conversation with AI helped explore trade-offs systematically:

  • Started broad (AWS managed services)
  • Refined constraints (open source, Kubernetes)
  • Optimized for specific requirements (single language, visual + code)

2. Importance of Local Development Parity

Insight: Having identical architecture locally and in production dramatically improves:

  • Developer productivity
  • Testing confidence
  • Debugging capabilities
  • Team onboarding

3. Visual Tools + Code Hybrid

Learning: Pure visual ETL tools don't scale for complex logic, but they excel for:

  • Schema exploration
  • Simple field mappings
  • Business user empowerment
  • Rapid prototyping

Complex stateful processing, joins, and business logic still require code.

4. Schema Registry as Integration Hub

Realization: Schema Registry becomes the contract layer between:

  • Discovery tools and mapping tools
  • Mapping tools and execution engines
  • Different team roles (analysts vs engineers)
  • Development and production environments

Implementation Recommendations

Phase 1: Foundation

  1. Set up local Kubernetes environment with Kafka + Schema Registry + MongoDB
  2. Implement basic Debezium CDC pipeline
  3. Create minimal PyFlink job for schema discovery
  4. Deploy DataHub for schema cataloging

Phase 2: Visual Mapping

  1. Deploy Apache Hop locally
  2. Integrate Hop with Schema Registry and DataHub
  3. Create sample mapping specifications
  4. Implement Flink job that consumes mapping specs

Phase 3: Metrics Pipeline

  1. Extend Flink jobs for time-windowed aggregations
  2. Set up Postgres/TimescaleDB metrics store
  3. Implement batch backfill logic using same PyFlink codebase

Phase 4 (hypothetical): Production Readiness

  1. Provision infra on AWS and Atlas
  2. Provisioning and securing Data Hub, Apache Hop and Schema Registry
  3. Add monitoring and alerting
  4. Implement CI/CD for Flink jobs and mapping specs
  5. Performance testing and optimization

Conclusion

This AI-assisted design session demonstrated how iterative architectural exploration can lead to well defined solutions that balance multiple constraints. The final architecture achieves:

  • Unified development experience through PyFlink for all processing
  • Visual accessibility via Apache Hop for business users
  • Enterprise scalability through managed cloud services
  • Cost-effective development via local Docker environment
  • Schema governance through systematic discovery and canonicalization

The hybrid approach of visual tools for simple cases and code for complex logic, combined with a clear local-to-cloud migration path, creates a sustainable architecture that can grow from pet project to enterprise-scale deployment.

References

Running a LLM on local computer

· 2 min read
Pawe�ł Mantur
Author | Architect | Engineer

Can you have a Chat-GPT experience hosted fully on your laptop? Yes! And it is straightforward to setup. There is a number of open source LLM models available, that are alternatives to GPT4 model created by OpenAI (model behind famous Chat-GPT).

This article uses Llama LLM models provided by Meta, as an example.

How to get the model?

Let's follow that tutorial (or version for Windows or Linux depending on what you use): https://www.llama.com/docs/llama-everywhere/running-meta-llama-on-mac/

In a nutshell, there are only 2 steps that you need do to convert your computer into a conversational AI friend:

  1. Download ollama
  2. Run the model:

alt text

And that's it!

How to use the model?

Now you can ask the model questions in the terminal:

alt text

It is amazing that fully functional LLM can fit into 4GB of disk space and generate answers on any topic. I have turned the Wi-Fi off to prove that it is not a trick ;-)

Model can be also consumed programmatically via API or with SDKs.

Downloading the models via llama-stack

In case you have started the journey like me, by downloading the models as explained here, it turns out to be not needed when using ollama.

ollama downloads its own copy of the model to ~/.ollama/models. If you have downloaded the model also with llama-stack CLI, you can delete it from ~/.llama/checkpoints/ to save disk space.

ADR - public S3 bucket access

· 4 min read
Paweł Mantur
Author | Architect | Engineer
info

This post is an example of ADR - Architecture Decision Record. ADRs document the reasons behind architectural decisions. ADRs promote more transparent and fact-based decision making culture. ADRs are also an useful artefact from technical documentation perspective.

Context

When implementing this website (see related article for architecture overview), among other decisions, I needed to decide on how to setup AWS S3 bucket that hosts website files.

Decision

Although it is considered to be potentially unsafe, I have decided for Option 2: allowing public access 🙀. But let's not panic, I will explain why it is safe in this case.

Considered Options

❌ Option 1: Blocking public access to S3 bucket

  • Pros:
    • Safer option - zero trust is a holy rule of security, especially for public access. AWS console by default blocks public access when creating S3 buckets, it is also possible to block public access for whole AWS account. Moreover AWS IAM Access Analyzer tool reports public access to buckets as a security threat finding. AWS console and documentation really forces to think about that decision.
  • Cons - solution gets more complicated:
    • Hosting a static website using Amazon S3 functionality of S3 cannot be used, as it requires public access
    • since S3 website endpoint cannot be created, Cloud Front needs to access S3 REST API as origin
    • Docusaurus build creates a directory for each page/article and puts a single index.html file inside that directory. For better SEO, and nicer urls, we do not have index.html in links. Web servers know that in such case (when url points to a directory in webserver) index.html needs to be served. S3 website endpoint also knows how to handle that properly. But since we are using S3 REST API under the hood, it just follows the request as-is - if request is for https://pawelmantur.pl/blog/s3-public-bucket it finds out tha there is blog/s3-public-bucket directory in my s3, but since public access does not have directory listing permissions granted - it returns 403. To return index.html, as a web-server like nginx would do - we need to add CloudFront Function to handle this case and modify request URL to append index.html to the URL.
    • It is possible and relatively simple to try, but it requires introducing new components to the solution that can be avoided by leveraging existing AWS S3 website endpoints capability

✅ Option 2: Allowing public access to perform s3:GetObject action

  • Pros:
    • The bucket's only purpose is to host website files and this website is public by design, I want people to access my blog, I want it to be public.
    • S3 website endpoint can be used
    • Static website built with Docusaurus works properly out of the box with S3 website endpoint, no function needed
    • no additional cost related to CloudFront Functions, although it is ignorable for thi website
    • Simpler setup with less components means less room for errors
  • Cons:
    • There is a risk that I will put some confidential files into this bucket, but the risk is mitigated by automation: the only way in which this bucket is updated is by GH Action, it syncs only the build directory of Docusaurus website, which by design is public. Well defined, automated and tested process.

Consequences

Since public s3 buckets were introduced to architecture, I will have to be cautious what content I am putting there and what are the policies granted for public access. But since we are talking about a single person doing a blog, we can agree that related risks can be accepted.

If that decision was made in context of large organization, the use cases for public sharing would need to have special governance. If organization has no business workflows that would require public file sharing and different solutions are available for static websites hosting, then public access to S3 should not be allowed to avoid a risky setup.

References

Watch out for Kafka costs

· 3 min read
Paweł Mantur
Author | Architect | Engineer

Not obvious Confluent Cloud costs

When using AWS or any other cloud we need to be aware about network traffic charges, especially cross-zone and cross-region data transfer fees.

Example deployment:

  • Cluster in Confluent Cloud, hosted in AWS (1 CKU is limited to single AZ, from 2 CKUs we have multi-AZ setup)
  • AWS Private Link for private connectivity with AWS
  • Kafka clients running in AWS EKS (multi-AZ)

Cross-AZ data transfer costs

Be aware that if Kafka broker node happens to be running in a different AZ than Kafka clients, then additional data transfer charges will apply for cross-AZ traffic.

Kafka has the concept of Racks that allows to co-locate Kafka clients and broker nodes. More details about this setting in context of AWS and Confluent can be fond here: https://docs.confluent.io/cloud/current/networking/fetch-from-follower.html

Data transfer costs within AZ

But even if we manage to keep connections within same AZ, is consuming data from Kafka for free?

Imagine architecture in which single topic contains data dedicated to multiple consumers. Every consumer reads only relevant data and filters-out (ignores) other messages. Sounds straightforward, but we need to be aware that each consumer to filter data, first needs to read the message. So even not relevant data creates traffic from broker to clients.

Kafka does not support filtering on broker side. There is open feature request for that.

If we have a lot of consumers we will have a lot of outbound traffic (topic throughput x number of consumers). Having additional infrastructure like AWS Private Lnk for such traffic will generate extra costs.

Extreme scenario - generating costs for nothing

Another interesting scenario is implementing a retry policy when message processing fails. For example when every message needs to be delivered to an endpoint which is down. If Kafka consumer tries to deliver the message very aggressively (for example every second or even worse in an infinite loop), and every retry is a new read from topic, then we can easily generate a lot of reads.

We may be fooled by most of the documentation that states that reading from Kafka is very efficient as it is basically about reading sequentially from log. From broker costs perspective, multiple consumers is not a significant costs factor compared to things like written data volumes, but we still need to be mindful of data transfer costs that may apply for reads. Confluent charges 0.05$/GB for Egress traffic. Total costs may grow quickly in a busy cluster with active producers and multiple reads of every message.

Schema Definition Formats

· 4 min read
Paweł Mantur
Author | Architect | Engineer
AI Friend
Assistant

Schema Definition Formats: JSON Schema, Avro, and Protocol Buffers

In data management, maintaining a specific structure is key for consistency and interoperability. Three popular schema formats are JSON Schema, Avro, and Protocol Buffers. Each has unique features and use cases. Let's explore their strengths and applications.

JSON Schema

Overview: JSON Schema is a powerful tool for validating the structure of JSON data. It allows you to define the expected format, type, and constraints of JSON documents, ensuring that the data adheres to a predefined schema.

Key Features:

  • Validation: JSON Schema provides a robust mechanism for validating JSON data against a schema. This helps in catching errors early and ensuring data integrity.
  • Documentation: The schema itself serves as a form of documentation, making it easier for developers to understand the expected structure of the data.
  • Interoperability: JSON Schema is widely supported across various programming languages and platforms, making it a versatile choice for many applications.

Use Cases:

  • API Validation: Ensuring that the data exchanged between client and server adheres to a specific format.
  • Configuration Files: Validating configuration files to ensure they meet the required structure and constraints.
  • Data Exchange: Facilitating data exchange between different systems by providing a clear contract for the data format.

Example:

{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "Product",
"type": "object",
"properties": {
"id": {
"type": "integer"
},
"name": {
"type": "string"
},
"price": {
"type": "number"
}
},
"required": ["id", "name", "price"]
}

Avro

Overview: Avro is a data serialization system that provides a compact, fast, and efficient format for data exchange. It is particularly well-suited for big data applications and is a key component of the Apache Hadoop ecosystem.

Key Features:

  • Compact Serialization: Avro uses a binary format for data serialization, which is more compact and efficient compared to text-based formats like JSON.
  • Schema Evolution: Avro supports schema evolution, allowing you to update the schema without breaking compatibility with existing data.
  • Interoperability: Avro schemas are defined using JSON, making them easy to read and understand. The binary format ensures efficient data storage and transmission.

Use Cases:

  • Big Data: Avro is widely used in big data applications, particularly within the Hadoop ecosystem, for efficient data storage and processing.
  • Data Streaming: Avro is commonly used in data streaming platforms like Apache Kafka for efficient data serialization and deserialization.
  • Inter-Service Communication: Facilitating communication between microservices by providing a compact and efficient data format.

Example:

{
"type": "record",
"name": "User",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "email", "type": "string"}
]
}

Protocol Buffers (Protobuf)

Overview: Protocol Buffers, developed by Google, is a language-neutral, platform-neutral, extensible mechanism for serializing structured data. It is known for its efficiency and performance.

Key Features:

  • Compact and Efficient: Protobuf uses a binary format that is both compact and efficient, making it suitable for high-performance applications.
  • Language Support: Protobuf supports multiple programming languages, including Java, C++, and Python.
  • Schema Evolution: Protobuf supports backward and forward compatibility, allowing for schema evolution without breaking existing data.

Use Cases:

  • Inter-Service Communication: Commonly used in microservices architectures for efficient data exchange.
  • Data Storage: Suitable for storing structured data in a compact format.
  • RPC Systems: Often used in Remote Procedure Call (RPC) systems like gRPC.

Example:

syntax = "proto3";

message Person {
int32 id = 1;
string name = 2;
string email = 3;
}

Conclusion

JSON Schema, Avro, and Protocol Buffers each offer powerful tools for managing data schemas, each with its unique strengths. JSON Schema excels in validation and documentation, making it ideal for APIs and configuration files. Avro provides efficient serialization and schema evolution, making it a preferred choice for big data and streaming applications. Protocol Buffers offer compact and efficient serialization, making them suitable for high-performance applications and inter-service communication. Understanding the strengths and use cases of each format can help you choose the right tool for your specific needs.

Change Data Capture and Debezium

· 4 min read
Paweł Mantur
Author | Architect | Engineer
AI Friend
Assistant

Understanding Change Data Capture (CDC) and Debezium

In today's data-driven world, keeping track of changes in data is crucial for maintaining data integrity, enabling real-time analytics, and ensuring seamless data integration across systems. Change Data Capture (CDC) is a powerful technique that addresses this need by capturing and tracking changes in data as they occur. One of the most popular tools for implementing CDC is Debezium. This blog article delves into the concept of CDC, its importance, and how Debezium can be used to implement it effectively.

What is Change Data Capture (CDC)?

Change Data Capture (CDC) is a process that identifies and captures changes made to data in a database. These changes can include inserts, updates, and deletes. Once captured, the changes can be propagated to other systems or used for various purposes such as data warehousing, real-time analytics, and data synchronization.

Why is CDC Important?

  1. Real-Time Data Integration:

    • CDC enables real-time data integration by capturing changes as they happen and propagating them to other systems. This ensures that all systems have the most up-to-date information.
  2. Efficient Data Processing:

    • By capturing only the changes rather than the entire dataset, CDC reduces the amount of data that needs to be processed and transferred. This leads to more efficient data processing and reduced latency.
  3. Data Consistency:

    • CDC helps maintain data consistency across different systems by ensuring that changes made in one system are reflected in others. This is particularly important in distributed systems and microservices architectures.
  4. Historical Data Analysis:

    • CDC allows for the capture of historical changes, enabling organizations to perform trend analysis and understand how data has evolved over time.

Introducing Debezium

Debezium is an open-source CDC tool that supports various databases such as MySQL, PostgreSQL, MongoDB, and more. It reads changes from transaction logs and streams them to other systems, making it a powerful tool for implementing CDC.

Key Features of Debezium:

  • Wide Database Support: Debezium supports multiple databases, making it versatile and suitable for various environments.
  • Kafka Integration: Debezium integrates seamlessly with Apache Kafka, allowing for efficient streaming of changes.
  • Schema Evolution: Debezium handles schema changes gracefully, ensuring that changes in the database schema do not disrupt data capture.
  • Real-Time Processing: Debezium captures and streams changes in real-time, enabling real-time data integration and analytics.

How Debezium Works

Debezium works by reading the transaction logs of the source database. These logs record all changes made to the data, including inserts, updates, and deletes. Debezium connectors capture these changes and stream them to a Kafka topic. From there, the changes can be consumed by various applications or systems.

Steps to Implement CDC with Debezium:

  1. Set Up Kafka:

    • Install and configure Apache Kafka, which will be used to stream the changes captured by Debezium.
  2. Deploy Debezium Connectors:

    • Deploy Debezium connectors for the source databases. Each connector is responsible for capturing changes from a specific database.
  3. Configure Connectors:

    • Configure the connectors with the necessary settings, such as the database connection details and the Kafka topic to which the changes should be streamed.
  4. Consume Changes:

    • Set up consumers to read the changes from the Kafka topics and process them as needed. This could involve updating a data warehouse, triggering real-time analytics, or synchronizing data across systems.

Example Configuration

Here is a basic example of configuring a Debezium connector for a MySQL database:

{
"name": "mysql-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "localhost",
"database.port": "3306",
"database.user":

"

debezium",
"database.password": "dbz",
"database.server.id": "184054",
"database.server.name": "fullfillment",
"database.include.list": "inventory",
"database.history.kafka.bootstrap.servers": "kafka:9092",
"database.history.kafka.topic": "schema-changes.inventory"
}
}

In this configuration:

  • connector.class specifies the Debezium connector class for MySQL.
  • database.hostname, database.port, database.user, and database.password provide the connection details for the MySQL database.
  • database.server.name is a logical name for the database server.
  • database.include.list specifies the databases to capture changes from.
  • database.history.kafka.bootstrap.servers and database.history.kafka.topic configure the Kafka settings for storing schema history.

Conclusion

Change Data Capture (CDC) is a vital technique for modern data management, enabling real-time data integration, efficient data processing, and maintaining data consistency across systems. Debezium is a powerful open-source tool for implementing CDC, offering wide database support, seamless Kafka integration, and real-time processing capabilities. By leveraging Debezium, organizations can capture and propagate data changes effectively, ensuring that their systems are always up-to-date and ready for real-time analytics and decision-making.