One post tagged with "datahub" | Software Architecture Recipes

Design session with ChatGPT-5

August 31, 2025 · 10 min read

Paweł Mantur

Architect | Engineer | Blog Author

ChatGPT

Assistant

Claude.ai

Assistant

Introduction

This article documents the outcomes of an AI-assisted architectural design session focused on building a real-time data unification platform. The challenge: handling heterogeneous data sources (MongoDB documents with varying schemas and Kafka event streams) and transforming them into a consistent canonical model for operational applications, while also providing analytical insights.

info

Full session transcript can be found here: shared link on chatgpt.com.
I was not originally intending to publish that, but the results are quite insightful, that's why I have asked Claude Code to have a look and create a blog article about that.
I wanted to use ChatGPT as well for the blog summary, but I've run out of ChatGPT5 credits 😅.

So here we are, Claude has prepared a summary of ChatGPT's work, a good cooperation under my supervision 😎

The Problem Statement

Primary Challenge: Build a system that can:

Discover schemas from heterogeneous MongoDB documents and Kafka events
Map disparate data formats to a canonical model in real-time
Serve both operational applications (latest state + events) and analytical workloads
Support both cloud-scale deployment and local development environments

Key Requirements:

Real-time canonicalization for operational apps (primary use case)
Visual schema mapping for simple transformations
Code-based approach for complex logic and historical backfills
Support for analytical metrics with near-real-time updates
Enterprise-ready architecture with pet-project-friendly local development

Architecture Evolution Through Design Session

Phase 1: AWS Managed Services Approach

Initial exploration focused on AWS managed services:

AWS Glue Data Catalog + Glue Crawlers for schema discovery
AWS Glue Streaming Jobs for real-time canonicalization
Amazon DMS for MongoDB CDC
Amazon DocumentDB for materialized canonical state

Key Insight: While powerful, this approach created vendor lock-in and complex local development setup.

Phase 2: Open Source + Kubernetes Native

Pivoted to open-source alternatives running on Kubernetes:

Apache Kafka (via Strimzi operator) for event streaming
Debezium for MongoDB change data capture
Apache Flink for stream processing
Confluent Schema Registry or Apicurio Registry for schema governance

Key Insight: Provides portability across cloud providers while maintaining enterprise capabilities.

Phase 3: Visual + Code-First Hybrid

Added visual tooling for business users while maintaining developer control:

Apache Hop for drag-and-drop schema mapping (simple cases)
DataHub for schema discovery and cataloging
Apache Flink for complex transformations and stateful processing

Key Insight: Hybrid approach balances accessibility for analysts with full control for engineers.

Final Architecture Design

Core Technology Stack

Component	Purpose	Local (Docker)	Cloud (Managed)
Data Sources	Source systems	MongoDB Docker	MongoDB Atlas
Change Data Capture	Stream changes	Debezium Docker	Debezium on EKS
Event Streaming	Message backbone	Kafka Docker	Amazon MSK / Confluent Cloud
Schema Management	Schema governance	Schema Registry Docker	Managed Schema Registry
Stream Processing	Real-time transforms	Flink Docker	Flink on EKS
Canonical State	Latest entity state	MongoDB Docker	MongoDB Atlas
Metrics Store	Analytics	Postgres/TimescaleDB	ClickHouse/Druid Managed
Schema Discovery	Data cataloging	DataHub Docker	DataHub on EKS
Visual Mapping	Business user tools	Apache Hop Docker	Apache Hop on EKS

Architecture Diagrams

Local Development Setup

Cloud Production Setup

Green components indicate identical code/logic between environments

Key Architectural Decisions

1. Unified PyFlink Runtime

Decision: Use Apache Flink for both streaming and batch processing with PyFlink as the primary language.

Rationale:

Single codebase for streaming metrics and historical backfills
Same team can maintain both pipeline types
Flink's unified API supports both bounded (batch) and unbounded (streaming) sources
Python expertise more common than Java/Scala in data teams

2. Schema-First Approach

Decision: Canonical schemas live in Schema Registry; raw schemas discovered dynamically.

Architecture:

Benefits:

Separates discovery (raw, evolving) from governance (canonical, stable)
Visual tools can operate on discovered schemas
Code generation possible from mapping specifications

3. Tiered Metrics Store Strategy

Decision: Start with Postgres/TimescaleDB locally, migrate to ClickHouse/Druid for production.

Local Setup:

Postgres/TimescaleDB for metrics storage
Simple SQL queries for analytics
Minimal infrastructure cost

Production Setup:

ClickHouse or Druid managed services
Columnar storage for high-cardinality dimensions
Horizontal scaling capabilities

Migration Path: Flink sink configuration swap only - no application code changes required.

4. Hybrid Mapping Approach

Decision: Visual tools (Apache Hop) for simple mappings, code (Flink) for complex logic.

Implementation:

Apache Hop generates mapping specifications (JSON)
Flink jobs can import and execute these specifications
Complex stateful logic remains in hand-written Flink code
Same mapping definitions drive both simple and advanced pipelines

Technical Implementation Details

Schema Discovery Algorithm

Streaming Schema Inference:

# Simplified PyFlink schema profiler logic
def infer_schema_from_stream(raw_topic):
    field_stats = {}
    type_distributions = {}
    
    for message in sample_messages(raw_topic, window="1h"):
        for field_path, value in flatten_json(message):
            field_stats[field_path] = field_stats.get(field_path, 0) + 1
            detected_type = infer_type(value)
            type_distributions.setdefault(field_path, set()).add(detected_type)
    
    # Generate JSON Schema with metadata
    inferred_schema = build_json_schema(field_stats, type_distributions)
    register_schema(schema_registry, f"raw.{topic}.inferred", inferred_schema)
    push_profiling_stats(datahub, field_stats)

Unified Metrics Computation

Shared Aggregation Logic:

# PyFlink aggregation function used by both streaming and batch
class MetricAggregator:
    def compute_metrics(self, events, dimensions, time_window):
        return events \
            .key_by(lambda e: extract_dimensions(e, dimensions)) \
            .window(time_window) \
            .aggregate(
                count=lambda events: len(events),
                sum=lambda events: sum(e.amount for e in events),
                avg=lambda events: sum(e.amount for e in events) / len(events),
                min=lambda events: min(e.amount for e in events),
                max=lambda events: max(e.amount for e in events)
            )

# Streaming version
streaming_metrics = kafka_source \
    .map(deserialize_canonical_event) \
    .apply(MetricAggregator().compute_metrics(dimensions=['region', 'product'], 
                                            time_window=TumblingWindow.hours(1)))

# Batch version  
batch_metrics = mongodb_source \
    .map(convert_to_canonical_event) \
    .apply(MetricAggregator().compute_metrics(dimensions=['region', 'product'],
                                            time_window=GlobalWindow()))

Mapping Specification Format

Generated by Apache Hop, Consumed by Flink:

{
  "mappingId": "order-canonicalization-v1",
  "source": {
    "type": "kafka",
    "topic": "raw.mongo.orders",
    "schemaId": 1234
  },
  "target": {
    "subject": "canonical.order.v1", 
    "schemaId": 5678
  },
  "fieldMappings": [
    {
      "src": "$.orderId",
      "dst": "$.order_id", 
      "transform": "string"
    },
    {
      "src": "$.customer.email",
      "dst": "$.customer.email",
      "transform": "lowercase"
    },
    {
      "src": "$.items[*].sku", 
      "dst": "$.line_items[*].sku",
      "transform": "identity"
    }
  ],
  "defaults": {
    "currency": "USD",
    "created_by": "system"
  }
}

Key Design Outcomes

1. Technology Consolidation

Achievement: Reduced from 5+ different technologies to 3 core components:

Apache Flink (PyFlink): Unified streaming + batch processing
Kafka + Schema Registry: Event backbone + schema governance
DataHub + Apache Hop: Visual discovery + mapping

2. Development Experience Optimization

Achievement: Single team can maintain entire pipeline:

Same Python codebase for streaming and batch jobs
Visual tools for analysts, code for engineers
Consistent deployment patterns (Kubernetes operators)
Shared testing and monitoring approaches

3. Scalability Path

Achievement: Clear migration from pet project to enterprise:

Local: Docker containers, minimal cost
Production: Managed services (MSK, MongoDB Atlas, ClickHouse Cloud)
Code Reuse: 95% of application logic unchanged between environments

4. Schema Governance Model

Achievement: Separation of concerns for schema management:

Raw schemas: Discovered automatically, versioned, exploratory
Canonical schemas: Curated, governed, stable contracts
Mapping specifications: Version controlled, testable, auditable

Lessons Learned

1. AI-Assisted Architecture Design

Observation: The iterative conversation with AI helped explore trade-offs systematically:

Started broad (AWS managed services)
Refined constraints (open source, Kubernetes)
Optimized for specific requirements (single language, visual + code)

2. Importance of Local Development Parity

Insight: Having identical architecture locally and in production dramatically improves:

Developer productivity
Testing confidence
Debugging capabilities
Team onboarding

3. Visual Tools + Code Hybrid

Learning: Pure visual ETL tools don't scale for complex logic, but they excel for:

Schema exploration
Simple field mappings
Business user empowerment
Rapid prototyping

Complex stateful processing, joins, and business logic still require code.

4. Schema Registry as Integration Hub

Realization: Schema Registry becomes the contract layer between:

Discovery tools and mapping tools
Mapping tools and execution engines
Different team roles (analysts vs engineers)
Development and production environments

Implementation Recommendations

Phase 1: Foundation

Set up local Kubernetes environment with Kafka + Schema Registry + MongoDB
Implement basic Debezium CDC pipeline
Create minimal PyFlink job for schema discovery
Deploy DataHub for schema cataloging

Phase 2: Visual Mapping

Deploy Apache Hop locally
Integrate Hop with Schema Registry and DataHub
Create sample mapping specifications
Implement Flink job that consumes mapping specs

Phase 3: Metrics Pipeline

Extend Flink jobs for time-windowed aggregations
Set up Postgres/TimescaleDB metrics store
Implement batch backfill logic using same PyFlink codebase

Phase 4 (hypothetical): Production Readiness

Provision infra on AWS and Atlas
Provisioning and securing Data Hub, Apache Hop and Schema Registry
Add monitoring and alerting
Implement CI/CD for Flink jobs and mapping specs
Performance testing and optimization

Conclusion

This AI-assisted design session demonstrated how iterative architectural exploration can lead to well defined solutions that balance multiple constraints. The final architecture achieves:

Unified development experience through PyFlink for all processing
Visual accessibility via Apache Hop for business users
Enterprise scalability through managed cloud services
Cost-effective development via local Docker environment
Schema governance through systematic discovery and canonicalization

The hybrid approach of visual tools for simple cases and code for complex logic, combined with a clear local-to-cloud migration path, creates a sustainable architecture that can grow from pet project to enterprise-scale deployment.

Introduction​

The Problem Statement​

Architecture Evolution Through Design Session​

Phase 1: AWS Managed Services Approach​

Phase 2: Open Source + Kubernetes Native​

Phase 3: Visual + Code-First Hybrid​

Final Architecture Design​

Core Technology Stack​

Architecture Diagrams​

Local Development Setup​

Cloud Production Setup​

Key Architectural Decisions​

1. Unified PyFlink Runtime​

2. Schema-First Approach​

3. Tiered Metrics Store Strategy​

4. Hybrid Mapping Approach​

Technical Implementation Details​

Schema Discovery Algorithm​

Unified Metrics Computation​

Mapping Specification Format​

Key Design Outcomes​

1. Technology Consolidation​

2. Development Experience Optimization​

3. Scalability Path​

4. Schema Governance Model​

Lessons Learned​

1. AI-Assisted Architecture Design​

2. Importance of Local Development Parity​

3. Visual Tools + Code Hybrid​

4. Schema Registry as Integration Hub​

Implementation Recommendations​

Phase 1: Foundation​

Phase 2: Visual Mapping​

Phase 3: Metrics Pipeline​

Phase 4 (hypothetical): Production Readiness​

Conclusion​

References​