Blog | Software Architecture Recipes

Accountability Workshop

October 7, 2025 · 5 min read

Architect | Engineer | Blog Author

Recently I took part in a very engaging and positive attitude building workshop about Accountability. Workshop was based on the methodology from the book titled "The Oz Principle: Getting Results Through Individual and Organizational Accountability" by Roger Connors, Tom Smith, and Craig Hickman.

Every individual in the company had that full-day, in-person training as mandatory. It was a great initiative! Not only because of workshop content, but also a great reason to get groups of people together in the remote-first, location distributed setup.

The Oz Principle image source: PBS American Experience — Why The Wizard of Oz Is So Wonderful

🌟 Core Idea

Accountability is about taking ownership to achieve results — not assigning blame. It’s the process of moving from a victim mindset (“things happen to me”) to a responsible mindset (“I make things happen”).

image source: https://www.integratedconsulting.at/insights/accountability/

⚖️ The Line: The Key Distinction. The book uses the metaphor of a “line” to separate two mindsets:

🔻 Below the Line – Victim Behavior

Below the line represents denial, blame, and inaction — people who see themselves as victims of circumstances. These behaviors prevent progress and results.

Common “Below the Line” Behaviors:

Ignore or deny the problem.
Not my job — avoid responsibility.
Finger-pointing — blame others.
Confusion — act unclear about what’s expected.
Cover your tail — protect yourself instead of solving the issue.
Wait and hope — expect things to improve on their own.

👉 Mindset: “There’s nothing I can do.”

🔺 Above the Line – Accountable Behavior

Above the line represents ownership, action, and learning — people who take responsibility for results. They recognize challenges and do what’s necessary to overcome them.

The Four Steps of the “Steps to Accountability”:

See It: Acknowledge reality and recognize what’s really happening.
Own It: Accept personal responsibility for the situation and results.
Solve It: Think creatively about what you can do to make things better.
Do It: Take action and follow through to ensure results.

👉 Mindset: “What else can I do to achieve the desired result?”

✅ Key Accountability Rules

Stay above the line — choose ownership and proactive action.
Don’t blame, justify, or deny — notice and correct below-the-line habits.
Focus on results, not excuses — accountability is about outcomes, not intentions.
Ask empowering questions — shift from “Who’s to blame?” to “What can I do?”
Model accountability — leaders set the tone by going first.

💬 Feedback in The Oz Principle

Quite big part of the workshop was about giving feedback and giving it openly as part of the workshop, as an exercise 🙂

1. Feedback helps you “See It.”

One of the first steps in accountability is recognizing reality — and feedback is crucial for that.

The authors emphasize that individuals and organizations often fail to “see” the real issues because they ignore or reject feedback.

Seeking feedback from others provides perspective and clarity, helping people confront the truth instead of staying “below the line” in denial or blame.

Quote (paraphrased): “You can’t solve a problem you refuse to see. Feedback helps you face reality and understand your role in creating results.”

2. Accountable people actively seek feedback.

People who live above the line don’t wait for feedback — they ask for it.

They view it not as criticism but as information that helps them improve performance.

In contrast, “below the line” behavior often involves avoiding feedback, getting defensive, or blaming others when it’s given.

Principle: “Feedback is a mirror — it reflects the reality you might not see yourself. Accountable people look into that mirror often.”

3. Leaders create feedback-rich environments.

The book encourages leaders to model openness by both giving and receiving feedback constructively.

Feedback should be:

Timely – given when it matters.
Specific – focused on behaviors and results, not personalities.
Actionable – guiding toward what can be done differently.

This creates a culture where feedback drives learning and accountability, rather than fear or blame.

4. Feedback ties into the “See It → Own It → Solve It → Do It” cycle

How Feedback Helps in particular stages:

See It - Reveals blind spots and the real situation.
Own It - Encourages acceptance of responsibility instead of defensiveness.
Solve It - Offers new perspectives for finding solutions.
Do It - Reinforces progress and helps adjust actions to achieve results.

✅ Key Takeaway

In The Oz Principle, feedback is not about fault-finding - it’s about truth-finding. Seeking, accepting, and acting on feedback is one of the strongest signs that someone is living above the line and taking full accountability.

🚀 In Summary

Considering that every individual in the organization spends a full day on the workshop, it’s certainly not a cheap initiative. But is it worth it? In my opinion, a strong yes! Especially for organizations undergoing intense and challenging transformations, it is crucial that everyone feels accountable for their contributions. Otherwise, things start falling through the cracks, individuals and projects get stuck at the first obstacles, and initiatives appear far more complex and difficult to achieve - simply because too many people remain below the line for too long.

A single workshop won’t magically fix all difficulties, but it does create a shared reference point across the organization for what a constructive, accountable attitude looks like.

Vlog! Hermes overview

September 6, 2025 · One min read

Paweł Mantur

Architect | Engineer | Blog Author

Overview of Hermes - Kafka-powered, enterprise-ready message broker

Hermes is a message broker that greatly simplifies communication between services using publish-subscribe pattern. It is HTTP-native, exposing REST endpoints for message publishing as well as pushing messages to subscribers REST endpoints. Under the hood, Apache Kafka is used.

More about Hermes:
https://hermes-pubsub.readthedocs.io
https://github.com/allegro/hermes

In this video, I will run Hermes locally and play with it to provide an overview of its functionality. This project has a lot of similarities to what I have been working on at my job, which is why it caught my attention.

I will also demonstrate how AI tools like Claude Code can help with understanding how solutions like Hermes work internally and how to easily play with them by quickly generating sample client code.

Grab some 🍿 and watch!
I will not be offended if your playback speed is close to 1.75 😅

Diagram from the video:

Hermes Pub/Sub Architecture Diagram

Design session with ChatGPT-5

August 31, 2025 · 10 min read

Paweł Mantur

Architect | Engineer | Blog Author

ChatGPT

Assistant

Claude.ai

Assistant

Introduction

This article documents the outcomes of an AI-assisted architectural design session focused on building a real-time data unification platform. The challenge: handling heterogeneous data sources (MongoDB documents with varying schemas and Kafka event streams) and transforming them into a consistent canonical model for operational applications, while also providing analytical insights.

info

Full session transcript can be found here: shared link on chatgpt.com.
I was not originally intending to publish that, but the results are quite insightful, that's why I have asked Claude Code to have a look and create a blog article about that.
I wanted to use ChatGPT as well for the blog summary, but I've run out of ChatGPT5 credits 😅.

So here we are, Claude has prepared a summary of ChatGPT's work, a good cooperation under my supervision 😎

The Problem Statement

Primary Challenge: Build a system that can:

Discover schemas from heterogeneous MongoDB documents and Kafka events
Map disparate data formats to a canonical model in real-time
Serve both operational applications (latest state + events) and analytical workloads
Support both cloud-scale deployment and local development environments

Key Requirements:

Real-time canonicalization for operational apps (primary use case)
Visual schema mapping for simple transformations
Code-based approach for complex logic and historical backfills
Support for analytical metrics with near-real-time updates
Enterprise-ready architecture with pet-project-friendly local development

Architecture Evolution Through Design Session

Phase 1: AWS Managed Services Approach

Initial exploration focused on AWS managed services:

AWS Glue Data Catalog + Glue Crawlers for schema discovery
AWS Glue Streaming Jobs for real-time canonicalization
Amazon DMS for MongoDB CDC
Amazon DocumentDB for materialized canonical state

Key Insight: While powerful, this approach created vendor lock-in and complex local development setup.

Phase 2: Open Source + Kubernetes Native

Pivoted to open-source alternatives running on Kubernetes:

Apache Kafka (via Strimzi operator) for event streaming
Debezium for MongoDB change data capture
Apache Flink for stream processing
Confluent Schema Registry or Apicurio Registry for schema governance

Key Insight: Provides portability across cloud providers while maintaining enterprise capabilities.

Phase 3: Visual + Code-First Hybrid

Added visual tooling for business users while maintaining developer control:

Apache Hop for drag-and-drop schema mapping (simple cases)
DataHub for schema discovery and cataloging
Apache Flink for complex transformations and stateful processing

Key Insight: Hybrid approach balances accessibility for analysts with full control for engineers.

Final Architecture Design

Core Technology Stack

Component	Purpose	Local (Docker)	Cloud (Managed)
Data Sources	Source systems	MongoDB Docker	MongoDB Atlas
Change Data Capture	Stream changes	Debezium Docker	Debezium on EKS
Event Streaming	Message backbone	Kafka Docker	Amazon MSK / Confluent Cloud
Schema Management	Schema governance	Schema Registry Docker	Managed Schema Registry
Stream Processing	Real-time transforms	Flink Docker	Flink on EKS
Canonical State	Latest entity state	MongoDB Docker	MongoDB Atlas
Metrics Store	Analytics	Postgres/TimescaleDB	ClickHouse/Druid Managed
Schema Discovery	Data cataloging	DataHub Docker	DataHub on EKS
Visual Mapping	Business user tools	Apache Hop Docker	Apache Hop on EKS

Architecture Diagrams

Local Development Setup

Cloud Production Setup

Green components indicate identical code/logic between environments

Key Architectural Decisions

1. Unified PyFlink Runtime

Decision: Use Apache Flink for both streaming and batch processing with PyFlink as the primary language.

Rationale:

Single codebase for streaming metrics and historical backfills
Same team can maintain both pipeline types
Flink's unified API supports both bounded (batch) and unbounded (streaming) sources
Python expertise more common than Java/Scala in data teams

2. Schema-First Approach

Decision: Canonical schemas live in Schema Registry; raw schemas discovered dynamically.

Architecture:

Benefits:

Separates discovery (raw, evolving) from governance (canonical, stable)
Visual tools can operate on discovered schemas
Code generation possible from mapping specifications

3. Tiered Metrics Store Strategy

Decision: Start with Postgres/TimescaleDB locally, migrate to ClickHouse/Druid for production.

Local Setup:

Postgres/TimescaleDB for metrics storage
Simple SQL queries for analytics
Minimal infrastructure cost

Production Setup:

ClickHouse or Druid managed services
Columnar storage for high-cardinality dimensions
Horizontal scaling capabilities

Migration Path: Flink sink configuration swap only - no application code changes required.

4. Hybrid Mapping Approach

Decision: Visual tools (Apache Hop) for simple mappings, code (Flink) for complex logic.

Implementation:

Apache Hop generates mapping specifications (JSON)
Flink jobs can import and execute these specifications
Complex stateful logic remains in hand-written Flink code
Same mapping definitions drive both simple and advanced pipelines

Technical Implementation Details

Schema Discovery Algorithm

Streaming Schema Inference:

# Simplified PyFlink schema profiler logic
def infer_schema_from_stream(raw_topic):
    field_stats = {}
    type_distributions = {}
    
    for message in sample_messages(raw_topic, window="1h"):
        for field_path, value in flatten_json(message):
            field_stats[field_path] = field_stats.get(field_path, 0) + 1
            detected_type = infer_type(value)
            type_distributions.setdefault(field_path, set()).add(detected_type)
    
    # Generate JSON Schema with metadata
    inferred_schema = build_json_schema(field_stats, type_distributions)
    register_schema(schema_registry, f"raw.{topic}.inferred", inferred_schema)
    push_profiling_stats(datahub, field_stats)

Unified Metrics Computation

Shared Aggregation Logic:

# PyFlink aggregation function used by both streaming and batch
class MetricAggregator:
    def compute_metrics(self, events, dimensions, time_window):
        return events \
            .key_by(lambda e: extract_dimensions(e, dimensions)) \
            .window(time_window) \
            .aggregate(
                count=lambda events: len(events),
                sum=lambda events: sum(e.amount for e in events),
                avg=lambda events: sum(e.amount for e in events) / len(events),
                min=lambda events: min(e.amount for e in events),
                max=lambda events: max(e.amount for e in events)
            )

# Streaming version
streaming_metrics = kafka_source \
    .map(deserialize_canonical_event) \
    .apply(MetricAggregator().compute_metrics(dimensions=['region', 'product'], 
                                            time_window=TumblingWindow.hours(1)))

# Batch version  
batch_metrics = mongodb_source \
    .map(convert_to_canonical_event) \
    .apply(MetricAggregator().compute_metrics(dimensions=['region', 'product'],
                                            time_window=GlobalWindow()))

Mapping Specification Format

Generated by Apache Hop, Consumed by Flink:

{
  "mappingId": "order-canonicalization-v1",
  "source": {
    "type": "kafka",
    "topic": "raw.mongo.orders",
    "schemaId": 1234
  },
  "target": {
    "subject": "canonical.order.v1", 
    "schemaId": 5678
  },
  "fieldMappings": [
    {
      "src": "$.orderId",
      "dst": "$.order_id", 
      "transform": "string"
    },
    {
      "src": "$.customer.email",
      "dst": "$.customer.email",
      "transform": "lowercase"
    },
    {
      "src": "$.items[*].sku", 
      "dst": "$.line_items[*].sku",
      "transform": "identity"
    }
  ],
  "defaults": {
    "currency": "USD",
    "created_by": "system"
  }
}

Key Design Outcomes

1. Technology Consolidation

Achievement: Reduced from 5+ different technologies to 3 core components:

Apache Flink (PyFlink): Unified streaming + batch processing
Kafka + Schema Registry: Event backbone + schema governance
DataHub + Apache Hop: Visual discovery + mapping

2. Development Experience Optimization

Achievement: Single team can maintain entire pipeline:

Same Python codebase for streaming and batch jobs
Visual tools for analysts, code for engineers
Consistent deployment patterns (Kubernetes operators)
Shared testing and monitoring approaches

3. Scalability Path

Achievement: Clear migration from pet project to enterprise:

Local: Docker containers, minimal cost
Production: Managed services (MSK, MongoDB Atlas, ClickHouse Cloud)
Code Reuse: 95% of application logic unchanged between environments

4. Schema Governance Model

Achievement: Separation of concerns for schema management:

Raw schemas: Discovered automatically, versioned, exploratory
Canonical schemas: Curated, governed, stable contracts
Mapping specifications: Version controlled, testable, auditable

Lessons Learned

1. AI-Assisted Architecture Design

Observation: The iterative conversation with AI helped explore trade-offs systematically:

Started broad (AWS managed services)
Refined constraints (open source, Kubernetes)
Optimized for specific requirements (single language, visual + code)

2. Importance of Local Development Parity

Insight: Having identical architecture locally and in production dramatically improves:

Developer productivity
Testing confidence
Debugging capabilities
Team onboarding

3. Visual Tools + Code Hybrid

Learning: Pure visual ETL tools don't scale for complex logic, but they excel for:

Schema exploration
Simple field mappings
Business user empowerment
Rapid prototyping

Complex stateful processing, joins, and business logic still require code.

4. Schema Registry as Integration Hub

Realization: Schema Registry becomes the contract layer between:

Discovery tools and mapping tools
Mapping tools and execution engines
Different team roles (analysts vs engineers)
Development and production environments

Implementation Recommendations

Phase 1: Foundation

Set up local Kubernetes environment with Kafka + Schema Registry + MongoDB
Implement basic Debezium CDC pipeline
Create minimal PyFlink job for schema discovery
Deploy DataHub for schema cataloging

Phase 2: Visual Mapping

Deploy Apache Hop locally
Integrate Hop with Schema Registry and DataHub
Create sample mapping specifications
Implement Flink job that consumes mapping specs

Phase 3: Metrics Pipeline

Extend Flink jobs for time-windowed aggregations
Set up Postgres/TimescaleDB metrics store
Implement batch backfill logic using same PyFlink codebase

Phase 4 (hypothetical): Production Readiness

Provision infra on AWS and Atlas
Provisioning and securing Data Hub, Apache Hop and Schema Registry
Add monitoring and alerting
Implement CI/CD for Flink jobs and mapping specs
Performance testing and optimization

Conclusion

This AI-assisted design session demonstrated how iterative architectural exploration can lead to well defined solutions that balance multiple constraints. The final architecture achieves:

Unified development experience through PyFlink for all processing
Visual accessibility via Apache Hop for business users
Enterprise scalability through managed cloud services
Cost-effective development via local Docker environment
Schema governance through systematic discovery and canonicalization

The hybrid approach of visual tools for simple cases and code for complex logic, combined with a clear local-to-cloud migration path, creates a sustainable architecture that can grow from pet project to enterprise-scale deployment.

References

Running a LLM on local computer

January 6, 2025 · 2 min read

Paweł Mantur

Architect | Engineer | Blog Author

Can you have a Chat-GPT experience hosted fully on your laptop? Yes! And it is straightforward to setup. There is a number of open source LLM models available, that are alternatives to GPT4 model created by OpenAI (model behind famous Chat-GPT).

This article uses Llama LLM models provided by Meta, as an example.

How to get the model?

Let's follow that tutorial (or version for Windows or Linux depending on what you use): https://www.llama.com/docs/llama-everywhere/running-meta-llama-on-mac/

In a nutshell, there are only 2 steps that you need do to convert your computer into a conversational AI friend:

Download ollama
Run the model:

alt text

And that's it!

How to use the model?

Now you can ask the model questions in the terminal:

alt text

It is amazing that fully functional LLM can fit into 4GB of disk space and generate answers on any topic. I have turned the Wi-Fi off to prove that it is not a trick ;-)

Model can be also consumed programmatically via API or with SDKs.

Downloading the models via llama-stack

In case you have started the journey like me, by downloading the models as explained here, it turns out to be not needed when using ollama.

ollama downloads its own copy of the model to ~/.ollama/models. If you have downloaded the model also with llama-stack CLI, you can delete it from ~/.llama/checkpoints/ to save disk space.

ADR - public S3 bucket access

December 22, 2024 · 4 min read

Paweł Mantur

Architect | Engineer | Blog Author

info

This post is an example of ADR - Architecture Decision Record. ADRs document the reasons behind architectural decisions. ADRs promote more transparent and fact-based decision making culture. ADRs are also an useful artefact from technical documentation perspective.

Context

When implementing this website (see related article for architecture overview), among other decisions, I needed to decide on how to setup AWS S3 bucket that hosts website files.

Decision

Although it is considered to be potentially unsafe, I have decided for Option 2: allowing public access 🙀. But let's not panic, I will explain why it is safe in this case.

Considered Options

❌ Option 1: Blocking public access to S3 bucket

Pros:
- Safer option - zero trust is a holy rule of security, especially for public access. AWS console by default blocks public access when creating S3 buckets, it is also possible to block public access for whole AWS account. Moreover AWS IAM Access Analyzer tool reports public access to buckets as a security threat finding. AWS console and documentation really forces to think about that decision.
Cons - solution gets more complicated:
- Hosting a static website using Amazon S3 functionality of S3 cannot be used, as it requires public access
- since S3 website endpoint cannot be created, Cloud Front needs to access S3 REST API as origin
- Docusaurus build creates a directory for each page/article and puts a single index.html file inside that directory. For better SEO, and nicer urls, we do not have index.html in links. Web servers know that in such case (when url points to a directory in webserver) index.html needs to be served. S3 website endpoint also knows how to handle that properly. But since we are using S3 REST API under the hood, it just follows the request as-is - if request is for https://pawelmantur.pl/blog/s3-public-bucket it finds out tha there is blog/s3-public-bucket directory in my s3, but since public access does not have directory listing permissions granted - it returns 403. To return index.html, as a web-server like nginx would do - we need to add CloudFront Function to handle this case and modify request URL to append index.html to the URL.
- It is possible and relatively simple to try, but it requires introducing new components to the solution that can be avoided by leveraging existing AWS S3 website endpoints capability

✅ Option 2: Allowing public access to perform `s3:GetObject` action

Pros:
- The bucket's only purpose is to host website files and this website is public by design, I want people to access my blog, I want it to be public.
- S3 website endpoint can be used
- Static website built with Docusaurus works properly out of the box with S3 website endpoint, no function needed
- no additional cost related to CloudFront Functions, although it is ignorable for thi website
- Simpler setup with less components means less room for errors
Cons:
- There is a risk that I will put some confidential files into this bucket, but the risk is mitigated by automation: the only way in which this bucket is updated is by GH Action, it syncs only the build directory of Docusaurus website, which by design is public. Well defined, automated and tested process.

Consequences

Since public s3 buckets were introduced to architecture, I will have to be cautious what content I am putting there and what are the policies granted for public access. But since we are talking about a single person doing a blog, we can agree that related risks can be accepted.

If that decision was made in context of large organization, the use cases for public sharing would need to have special governance. If organization has no business workflows that would require public file sharing and different solutions are available for static websites hosting, then public access to S3 should not be allowed to avoid a risky setup.

References

Watch out for Kafka costs

September 1, 2024 · 3 min read

Paweł Mantur

Architect | Engineer | Blog Author

Not obvious Confluent Cloud costs

When using AWS or any other cloud we need to be aware about network traffic charges, especially cross-zone and cross-region data transfer fees.

Example deployment:

Cluster in Confluent Cloud, hosted in AWS (1 CKU is limited to single AZ, from 2 CKUs we have multi-AZ setup)
AWS Private Link for private connectivity with AWS
Kafka clients running in AWS EKS (multi-AZ)

Cross-AZ data transfer costs

Be aware that if Kafka broker node happens to be running in a different AZ than Kafka clients, then additional data transfer charges will apply for cross-AZ traffic.

Kafka has the concept of Racks that allows to co-locate Kafka clients and broker nodes. More details about this setting in context of AWS and Confluent can be fond here: https://docs.confluent.io/cloud/current/networking/fetch-from-follower.html

Data transfer costs within AZ

But even if we manage to keep connections within same AZ, is consuming data from Kafka for free?

Imagine architecture in which single topic contains data dedicated to multiple consumers. Every consumer reads only relevant data and filters-out (ignores) other messages. Sounds straightforward, but we need to be aware that each consumer to filter data, first needs to read the message. So even not relevant data creates traffic from broker to clients.

Kafka does not support filtering on broker side. There is open feature request for that.

If we have a lot of consumers we will have a lot of outbound traffic (topic throughput x number of consumers). Having additional infrastructure like AWS Private Lnk for such traffic will generate extra costs.

Extreme scenario - generating costs for nothing

Another interesting scenario is implementing a retry policy when message processing fails. For example when every message needs to be delivered to an endpoint which is down. If Kafka consumer tries to deliver the message very aggressively (for example every second or even worse in an infinite loop), and every retry is a new read from topic, then we can easily generate a lot of reads.

We may be fooled by most of the documentation that states that reading from Kafka is very efficient as it is basically about reading sequentially from log. From broker costs perspective, multiple consumers is not a significant costs factor compared to things like written data volumes, but we still need to be mindful of data transfer costs that may apply for reads. Confluent charges 0.05$/GB for Egress traffic. Total costs may grow quickly in a busy cluster with active producers and multiple reads of every message.

Schema Definition Formats

July 1, 2022 · 4 min read

Paweł Mantur

Architect | Engineer | Blog Author

AI Friend

Assistant

Schema Definition Formats: JSON Schema, Avro, and Protocol Buffers

In data management, maintaining a specific structure is key for consistency and interoperability. Three popular schema formats are JSON Schema, Avro, and Protocol Buffers. Each has unique features and use cases. Let's explore their strengths and applications.

JSON Schema

Overview: JSON Schema is a powerful tool for validating the structure of JSON data. It allows you to define the expected format, type, and constraints of JSON documents, ensuring that the data adheres to a predefined schema.

Key Features:

Validation: JSON Schema provides a robust mechanism for validating JSON data against a schema. This helps in catching errors early and ensuring data integrity.
Documentation: The schema itself serves as a form of documentation, making it easier for developers to understand the expected structure of the data.
Interoperability: JSON Schema is widely supported across various programming languages and platforms, making it a versatile choice for many applications.

Use Cases:

API Validation: Ensuring that the data exchanged between client and server adheres to a specific format.
Configuration Files: Validating configuration files to ensure they meet the required structure and constraints.
Data Exchange: Facilitating data exchange between different systems by providing a clear contract for the data format.

Example:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Product",
  "type": "object",
  "properties": {
    "id": {
      "type": "integer"
    },
    "name": {
      "type": "string"
    },
    "price": {
      "type": "number"
    }
  },
  "required": ["id", "name", "price"]
}

Avro

Overview: Avro is a data serialization system that provides a compact, fast, and efficient format for data exchange. It is particularly well-suited for big data applications and is a key component of the Apache Hadoop ecosystem.

Key Features:

Compact Serialization: Avro uses a binary format for data serialization, which is more compact and efficient compared to text-based formats like JSON.
Schema Evolution: Avro supports schema evolution, allowing you to update the schema without breaking compatibility with existing data.
Interoperability: Avro schemas are defined using JSON, making them easy to read and understand. The binary format ensures efficient data storage and transmission.

Use Cases:

Big Data: Avro is widely used in big data applications, particularly within the Hadoop ecosystem, for efficient data storage and processing.
Data Streaming: Avro is commonly used in data streaming platforms like Apache Kafka for efficient data serialization and deserialization.
Inter-Service Communication: Facilitating communication between microservices by providing a compact and efficient data format.

Example:

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"}
  ]
}

Protocol Buffers (Protobuf)

Overview: Protocol Buffers, developed by Google, is a language-neutral, platform-neutral, extensible mechanism for serializing structured data. It is known for its efficiency and performance.

Key Features:

Compact and Efficient: Protobuf uses a binary format that is both compact and efficient, making it suitable for high-performance applications.
Language Support: Protobuf supports multiple programming languages, including Java, C++, and Python.
Schema Evolution: Protobuf supports backward and forward compatibility, allowing for schema evolution without breaking existing data.

Use Cases:

Inter-Service Communication: Commonly used in microservices architectures for efficient data exchange.
Data Storage: Suitable for storing structured data in a compact format.
RPC Systems: Often used in Remote Procedure Call (RPC) systems like gRPC.

Example:

syntax = "proto3";

message Person {
  int32 id = 1;
  string name = 2;
  string email = 3;
}

Conclusion

JSON Schema, Avro, and Protocol Buffers each offer powerful tools for managing data schemas, each with its unique strengths. JSON Schema excels in validation and documentation, making it ideal for APIs and configuration files. Avro provides efficient serialization and schema evolution, making it a preferred choice for big data and streaming applications. Protocol Buffers offer compact and efficient serialization, making them suitable for high-performance applications and inter-service communication. Understanding the strengths and use cases of each format can help you choose the right tool for your specific needs.

Change Data Capture and Debezium

May 1, 2022 · 4 min read

Paweł Mantur

Architect | Engineer | Blog Author

AI Friend

Assistant

Understanding Change Data Capture (CDC) and Debezium

In today's data-driven world, keeping track of changes in data is crucial for maintaining data integrity, enabling real-time analytics, and ensuring seamless data integration across systems. Change Data Capture (CDC) is a powerful technique that addresses this need by capturing and tracking changes in data as they occur. One of the most popular tools for implementing CDC is Debezium. This blog article delves into the concept of CDC, its importance, and how Debezium can be used to implement it effectively.

What is Change Data Capture (CDC)?

Change Data Capture (CDC) is a process that identifies and captures changes made to data in a database. These changes can include inserts, updates, and deletes. Once captured, the changes can be propagated to other systems or used for various purposes such as data warehousing, real-time analytics, and data synchronization.

Why is CDC Important?

Real-Time Data Integration:
- CDC enables real-time data integration by capturing changes as they happen and propagating them to other systems. This ensures that all systems have the most up-to-date information.
Efficient Data Processing:
- By capturing only the changes rather than the entire dataset, CDC reduces the amount of data that needs to be processed and transferred. This leads to more efficient data processing and reduced latency.
Data Consistency:
- CDC helps maintain data consistency across different systems by ensuring that changes made in one system are reflected in others. This is particularly important in distributed systems and microservices architectures.
Historical Data Analysis:
- CDC allows for the capture of historical changes, enabling organizations to perform trend analysis and understand how data has evolved over time.

Introducing Debezium

Debezium is an open-source CDC tool that supports various databases such as MySQL, PostgreSQL, MongoDB, and more. It reads changes from transaction logs and streams them to other systems, making it a powerful tool for implementing CDC.

Key Features of Debezium:

Wide Database Support: Debezium supports multiple databases, making it versatile and suitable for various environments.
Kafka Integration: Debezium integrates seamlessly with Apache Kafka, allowing for efficient streaming of changes.
Schema Evolution: Debezium handles schema changes gracefully, ensuring that changes in the database schema do not disrupt data capture.
Real-Time Processing: Debezium captures and streams changes in real-time, enabling real-time data integration and analytics.

How Debezium Works

Debezium works by reading the transaction logs of the source database. These logs record all changes made to the data, including inserts, updates, and deletes. Debezium connectors capture these changes and stream them to a Kafka topic. From there, the changes can be consumed by various applications or systems.

Steps to Implement CDC with Debezium:

Set Up Kafka:
- Install and configure Apache Kafka, which will be used to stream the changes captured by Debezium.
Deploy Debezium Connectors:
- Deploy Debezium connectors for the source databases. Each connector is responsible for capturing changes from a specific database.
Configure Connectors:
- Configure the connectors with the necessary settings, such as the database connection details and the Kafka topic to which the changes should be streamed.
Consume Changes:
- Set up consumers to read the changes from the Kafka topics and process them as needed. This could involve updating a data warehouse, triggering real-time analytics, or synchronizing data across systems.

Example Configuration

Here is a basic example of configuring a Debezium connector for a MySQL database:

{
  "name": "mysql-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "localhost",
    "database.port": "3306",
    "database.user":

 "

debezium",
    "database.password": "dbz",
    "database.server.id": "184054",
    "database.server.name": "fullfillment",
    "database.include.list": "inventory",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.inventory"
  }
}

In this configuration:

connector.class specifies the Debezium connector class for MySQL.
database.hostname, database.port, database.user, and database.password provide the connection details for the MySQL database.
database.server.name is a logical name for the database server.
database.include.list specifies the databases to capture changes from.
database.history.kafka.bootstrap.servers and database.history.kafka.topic configure the Kafka settings for storing schema history.

Conclusion

Change Data Capture (CDC) is a vital technique for modern data management, enabling real-time data integration, efficient data processing, and maintaining data consistency across systems. Debezium is a powerful open-source tool for implementing CDC, offering wide database support, seamless Kafka integration, and real-time processing capabilities. By leveraging Debezium, organizations can capture and propagate data changes effectively, ensuring that their systems are always up-to-date and ready for real-time analytics and decision-making.

Low code solution with Azure Logic Apps and PowerBI

April 3, 2022 · 4 min read

I've been recently working on a small project to support a new business process. Time to market was critical for the customer, to be able to capture emerged businesss oppurtunity. Budget was also tight, to not over-invest before bussiness case is validated.

There was a strong preference from the customer to do the whole data management via Excel to "keep it simple". Not a surprise preference when you talk to sales people or CEO as in this case :) There was also a need to enrich data with information from 3rd party systems and provide a number of reports.

High level architecture of this small system looks like this:

It was not the goal to avoid coding at all when building the solution, but the goal was to have a low code approach to save time.

The biggest saving was avoiding custom UI development completely, but still having the solution highly interactive from the users' perspective. Please find below the description of how it was achieved.

For online sign-up form https://webflow.com/ was used. This tool allows to create websistes without the need to write any code. The only piece of JavaScript that had to be written was about making an AJAX request to custom API that would pass form data.

"CRM" via OneDrive and Excel

All the accounts were managed via Excel files. One file per parner company. That kind of approach has many benefits out of the box. Let's mention a few:

Intuitive and flexible data management via Excel
Access management and sharing capabilities provided by OneDrive
Online collaboration and change tracking built-in

Azure Logic Apps - the glue

The core business logic was developed as custom service implemented in .NET Core and C#. This service also had its own database. Data edited in Excel files needed to be synced with the database in various ways:

changes made via Excel files needed to be reflected in central database
when data was modified by the business logic (for example status was changed and data generated as a result of the business flow), changes needed to be reflected back in the Excel to have a consistent view
when a new account was registered in the system, new Excel file to manage it was automatically created in OneDrive

All of that use cases were implemented via Azure Logic Apps. Logic App is composed from a ready to use building blocks. Here's the example of single execution log of an example Logic App:

In this case, any time an Excel file is modified in OneDrive, a request is made to custom API to uplaod the file for processing the updates. Before the request, an access token is obtained. Processed file is saved for audit, and in case of error an email alert is sent.

Unther the hood Logic App is defined as a JSON file, so its definition can be stored in the code repository and deployed to Azure via ARM.

Power BI to provide data insights

Reporting was the ultimate goal of that project. Business needed to know about the performance of particular sales agents and internal employees for things like commission reporting and follow-up calls.

When comparing to developing a custom reporting solution, Power BI is super easy to create the UI to browse, filter and export data. Once the connection with database is established, data model can be defined to create interesting visualistations with extensive filtering options. All that features available for 9,99$/month/user.

If you know SQL, and relational data modelling, but are new to Power BI, I can recommend this tutorial to get up to the speed with Power BI:

https://www.youtube.com/watch?v=AGrl-H87pRU

Summary

Thanks to low-code and no-code tools like Azure Logic Apps, Power BI or Webflow, it was possible to deliver end-to-end solution that users were able to interact with, without any custom code to build UI. If that project included also UI and related backend developent to support UI, it would take a few times more to provide similar capabilities. We could imagine simple UI with less effort but it would be not even close to the rich capabilities provided by Power BI and Excel out of the box.

Happy low-coding! :)

.NET MAUI vs Xamarin.Forms

September 24, 2021 · 7 min read

I've been focusing recently on Xamarin and also following the updates on MAUI. MAUI was started as a fork of Xamarin.Forms and this is how it should be seen - as next version of Xamarin.Forms. There will be no version 6 of Forms (current is version 5). Instead, Xamarin.Forms apps will have to be upgraded to MAUI - Multi-platform App UI. Alternative to upgrading is staying on Xamarin.Forms 5, that will be supported only for 12 months after MAUI official release. So if we want to be on supported tech stack, then we need to start getting familiar with MAUI.

MAUI and also whole Xamarin framework will be part of .NET 6. It was initially planned to release MAUI already on November 2021 with the new .NET release. Now we know that production-quality release was postponed to Q2 2022. Instead of production-ready release, we will keep getting preview versions of MAUI. It also means that Xamarin.Forms will be supported longer (Q2 2022 + 12 months).

OK, but what changes we can expect with MAUI? Below the summary of key differences when compared to Xamarin Forms 5.

1. Single project experience

In Xamarin.Forms we need to have separate project(s) for each platform and also project(s) for shared code. In MAUI we have an option to work with single project only that can target multiple platforms:

In single project we can still have platform-specific directories (under "Platforms" directory) or even have platform specific code in single file by using pre-process directives:

This is not something that MAUI introduced, this is achieved thanks to so SDK-style projects which are available in .NET 5. Already in .NET 5 we can multi-target projects and instruct MS Build which files or directories should be target-specific.

Example of multiple target in SDK-style project in csproj:

Example of conditional compilation based on target:

So this is not a MAUI magic, it is just about using .NET 5 capabilities.

2. Central assets management

Consequence of consistent single-project experience is also the need to manage assets in a single project. MAUI accomplishes that for PNG and SVG images by doing compilation-time image resizing. We can still have platform-specific resources if needed, for example for other formats.

But again, it is not a revolutionary change. MAUI is just integrating ResizetizerNT as integrated part of the framework.

So this MAUI feature is another low-hanging fruit. It was possible to achieve that also with Forms, but now you do not have to add additional libraries.

3. New decoupled architecture

New architecture seems to be the change where most of the efforts of MAUI team goes. This change is actually significant. It is a big refactoring of Forms driven by new architecture called Slim Renderers.

Slim renderers was kind of temporary name, so let's not get used to that. The term that we should remember and get familiar with is Handler. Role of Renderers from Xamarin.Forms is taken by Handlers in MAUI.

What's the main difference? We can summarise it with one world: decoupling. This is how it looks in Xamarin.Forms:

Renderers, that produce native views tree depend on Forms controls. In the above diagram you can see example for Entry control on iOS and Android platform but the same idea applies for other controls (like Button, Label, Grid etc) and other platforms (Windows, MacOS).

MAUI introduces new abstraction layer of controls interfaces. Handlers depend only on interfaces, but not on the implementation of UI controls:

This approach allow to decouple the responsibility of rendering platform-specific code that is handled by handlers from the implementation of the UI framework. MAUI is split into Microsoft.Maui.Core namespace and Microsoft.Maui.Controls namespace:

What is important to notice, is that support for XAML or bindings implementation was also decoupled from handling platform specific code. It makes the architecture much more open for crating alternative UI frameworks based on Maui.Core, but using different paradigms. We can already see experimental framework Comet using that approach and proposing MVU pattern instead of MVVM:

There is also a rumour around MAUI project that Fabulous framework could follow that path, but an interesting thing is that Fabulous team does not seem to share the enthusiasm ;) It will be interesting to see how the idea of supporting F# in MAUI will evolve.

But it is important to notice that MAUI does not have built-in MVU support. MAUI Controls are designed to support MVVM pattern that we know from Xamarin.Forms. What is changing, is the open architecture enabling alternative approaches, but alternatives are not built-in into MAUI.

4. Mappers

Ok, so there is no Renderers, there are Handlers. So how to introduce custom rendering when needed? Since there is no custom renderers, can we still have custom rendering? Yes, but we need to get familiar with one more new term in MAUI: Mappers.

Mapper is a public static dictionary-like data type exposed by Handlers:

It maps each of the properties defined on controls interface into a platform-specific handler function to render given property.

If we need custom behaviour we can just map our own function from our application:

See this repository created by Javier Suárez with examples and detailed explanations on how to migrate from Forms to MAUI: https://github.com/jsuarezruiz/xamarin-forms-to-net-maui

And do not worry about your existing custom renderers, they will still work thanks to MAUI compatibility package. Although it is recommended to migrate them into handlers, to get the benefits of improved performance.

5. Performance improvements

The goal of new architecture is not only to decouple layers, but also to improve performance. Handlers are more lightweight compared to renderers, as each property is handled in a separate handler instead of having one big rendering function to update the whole component.

MAUI also avoids assembly scanning at startup to find custom renderers. Custom handlers for your custom controls are registered in an explicit way on app Startup:

One more performance improvement should be reduced view nesting. Forms have the concept of fast renderers, in MAUI all handlers should be by design "fast".

But MAUI release was not postponed without any reasons. The team is still working on performance improvements. First benchmarks are showing that MAUI apps start even slower than Forms apps, see this issue for details: https://github.com/dotnet/maui/issues/822. In this case observed difference is not dramatic, it is 100ms, but still, we should not take for granted that MAUI is already faster.

6. BlazorWebView

Do you like to use Blazor for web UIs? Great news, with MAUI you will be able to use Blazor components also on all the platforms supported by MAUI (Android, iOS, Windows, MacOS). Components will render locally into HTML. HTML UI will run in web view, but it will avoid web assembly or SingalR, so we can expect relatively good performance.

And what is most important, Blazor components will be able to access native APIs from code behind! In think this is a great feature that opens scenarios for interesting hybrid architecture (combining native and web stack in a single app).

See this video for details: Introduction to .NET MAUI Blazor | The Xamarin Show

7. C# Hot Reload

And last but not least, since MAUI will be part of .NET 6 it will get also all other benefits that are coming with .NET 6. And one of them is hot reload for C# code. In Forms we have hot reload only for XAML, so it is a great productivity improvement, especially for UI development.

Summary

MAUI introduces significant changes but this framework can be still considered as evolution of Forms. Fingers crossed that it will reach Forms 5 stability and will make the framework even better thanks to the above improvements.

🌟 Core Idea​

🔻 Below the Line – Victim Behavior​

🔺 Above the Line – Accountable Behavior​

✅ Key Accountability Rules​

💬 Feedback in The Oz Principle​

1. Feedback helps you “See It.”​

2. Accountable people actively seek feedback.​

3. Leaders create feedback-rich environments.​

4. Feedback ties into the “See It → Own It → Solve It → Do It” cycle​

✅ Key Takeaway​

🚀 In Summary​

Overview of Hermes - Kafka-powered, enterprise-ready message broker​

Diagram from the video:​

Introduction​

The Problem Statement​

Architecture Evolution Through Design Session​

Phase 1: AWS Managed Services Approach​

Phase 2: Open Source + Kubernetes Native​

Phase 3: Visual + Code-First Hybrid​

Final Architecture Design​

Core Technology Stack​

Architecture Diagrams​

Local Development Setup​

Cloud Production Setup​

Key Architectural Decisions​

1. Unified PyFlink Runtime​

2. Schema-First Approach​

3. Tiered Metrics Store Strategy​

4. Hybrid Mapping Approach​

Technical Implementation Details​

Schema Discovery Algorithm​

Unified Metrics Computation​

Mapping Specification Format​

Key Design Outcomes​

1. Technology Consolidation​

2. Development Experience Optimization​

3. Scalability Path​

4. Schema Governance Model​

Lessons Learned​

1. AI-Assisted Architecture Design​

2. Importance of Local Development Parity​

3. Visual Tools + Code Hybrid​

4. Schema Registry as Integration Hub​

Implementation Recommendations​

Phase 1: Foundation​

Phase 2: Visual Mapping​

Phase 3: Metrics Pipeline​

Phase 4 (hypothetical): Production Readiness​

Conclusion​

References​

How to get the model?​

How to use the model?​

Downloading the models via llama-stack​

Context​

Decision​

Considered Options​

❌ Option 1: Blocking public access to S3 bucket​

✅ Option 2: Allowing public access to perform s3:GetObject action​

Consequences​

References​

Not obvious Confluent Cloud costs​

Cross-AZ data transfer costs​

Data transfer costs within AZ​

Extreme scenario - generating costs for nothing​

Schema Definition Formats: JSON Schema, Avro, and Protocol Buffers​

JSON Schema​

Avro​

Protocol Buffers (Protobuf)​

Conclusion​

Understanding Change Data Capture (CDC) and Debezium​

What is Change Data Capture (CDC)?​

Why is CDC Important?​

Introducing Debezium​

How Debezium Works​

Example Configuration​

Conclusion​

Online sign-up form​

"CRM" via OneDrive and Excel​

Azure Logic Apps - the glue​

Power BI to provide data insights​

🌟 Core Idea

🔻 Below the Line – Victim Behavior

🔺 Above the Line – Accountable Behavior

✅ Key Accountability Rules

💬 Feedback in The Oz Principle

1. Feedback helps you “See It.”

2. Accountable people actively seek feedback.

3. Leaders create feedback-rich environments.

4. Feedback ties into the “See It → Own It → Solve It → Do It” cycle

✅ Key Takeaway

🚀 In Summary

Overview of Hermes - Kafka-powered, enterprise-ready message broker

Diagram from the video:

Introduction

The Problem Statement

Architecture Evolution Through Design Session

Phase 1: AWS Managed Services Approach

Phase 2: Open Source + Kubernetes Native

Phase 3: Visual + Code-First Hybrid

Final Architecture Design

Core Technology Stack

Architecture Diagrams

Local Development Setup

Cloud Production Setup

Key Architectural Decisions

1. Unified PyFlink Runtime

2. Schema-First Approach

3. Tiered Metrics Store Strategy

4. Hybrid Mapping Approach

Technical Implementation Details

Schema Discovery Algorithm

Unified Metrics Computation

Mapping Specification Format

Key Design Outcomes

1. Technology Consolidation

2. Development Experience Optimization

3. Scalability Path

4. Schema Governance Model

Lessons Learned

1. AI-Assisted Architecture Design

2. Importance of Local Development Parity

3. Visual Tools + Code Hybrid

4. Schema Registry as Integration Hub

Implementation Recommendations

Phase 1: Foundation

Phase 2: Visual Mapping

Phase 3: Metrics Pipeline

Phase 4 (hypothetical): Production Readiness

Conclusion

References

How to get the model?

How to use the model?

Downloading the models via llama-stack

Context

Decision

Considered Options

❌ Option 1: Blocking public access to S3 bucket

✅ Option 2: Allowing public access to perform `s3:GetObject` action

Consequences

References

Not obvious Confluent Cloud costs

Cross-AZ data transfer costs

Data transfer costs within AZ

Extreme scenario - generating costs for nothing

Schema Definition Formats: JSON Schema, Avro, and Protocol Buffers

JSON Schema

Avro

Protocol Buffers (Protobuf)

Conclusion

Understanding Change Data Capture (CDC) and Debezium

What is Change Data Capture (CDC)?

Why is CDC Important?

Introducing Debezium

How Debezium Works

Example Configuration

Conclusion

Online sign-up form

"CRM" via OneDrive and Excel

Azure Logic Apps - the glue

Power BI to provide data insights