Skip to main content

One post tagged with "datahub"

View All Tags

Design session with ChatGPT-5

· 10 min read
Paweł Mantur
Author | Architect | Engineer
ChatGPT
Assistant
Claude.ai
Assistant

Introduction

This article documents the outcomes of an AI-assisted architectural design session focused on building a real-time data unification platform. The challenge: handling heterogeneous data sources (MongoDB documents with varying schemas and Kafka event streams) and transforming them into a consistent canonical model for operational applications, while also providing analytical insights.

info

Full session transcript can be found here: shared link on chatgpt.com.
I was not originally intending to publish that, but the results are quite insightful, that's why I have asked Claude Code to have a look and create a blog article about that.
I wanted to use ChatGPT as well for the blog summary, but I've run out of ChatGPT5 credits 😅.

So here we are, Claude has prepared a summary of ChatGPT's work, a good cooperation under my supervision 😎

The Problem Statement

Primary Challenge: Build a system that can:

  • Discover schemas from heterogeneous MongoDB documents and Kafka events
  • Map disparate data formats to a canonical model in real-time
  • Serve both operational applications (latest state + events) and analytical workloads
  • Support both cloud-scale deployment and local development environments

Key Requirements:

  • Real-time canonicalization for operational apps (primary use case)
  • Visual schema mapping for simple transformations
  • Code-based approach for complex logic and historical backfills
  • Support for analytical metrics with near-real-time updates
  • Enterprise-ready architecture with pet-project-friendly local development

Architecture Evolution Through Design Session

Phase 1: AWS Managed Services Approach

Initial exploration focused on AWS managed services:

  • AWS Glue Data Catalog + Glue Crawlers for schema discovery
  • AWS Glue Streaming Jobs for real-time canonicalization
  • Amazon DMS for MongoDB CDC
  • Amazon DocumentDB for materialized canonical state

Key Insight: While powerful, this approach created vendor lock-in and complex local development setup.

Phase 2: Open Source + Kubernetes Native

Pivoted to open-source alternatives running on Kubernetes:

  • Apache Kafka (via Strimzi operator) for event streaming
  • Debezium for MongoDB change data capture
  • Apache Flink for stream processing
  • Confluent Schema Registry or Apicurio Registry for schema governance

Key Insight: Provides portability across cloud providers while maintaining enterprise capabilities.

Phase 3: Visual + Code-First Hybrid

Added visual tooling for business users while maintaining developer control:

  • Apache Hop for drag-and-drop schema mapping (simple cases)
  • DataHub for schema discovery and cataloging
  • Apache Flink for complex transformations and stateful processing

Key Insight: Hybrid approach balances accessibility for analysts with full control for engineers.

Final Architecture Design

Core Technology Stack

ComponentPurposeLocal (Docker)Cloud (Managed)
Data SourcesSource systemsMongoDB DockerMongoDB Atlas
Change Data CaptureStream changesDebezium DockerDebezium on EKS
Event StreamingMessage backboneKafka DockerAmazon MSK / Confluent Cloud
Schema ManagementSchema governanceSchema Registry DockerManaged Schema Registry
Stream ProcessingReal-time transformsFlink DockerFlink on EKS
Canonical StateLatest entity stateMongoDB DockerMongoDB Atlas
Metrics StoreAnalyticsPostgres/TimescaleDBClickHouse/Druid Managed
Schema DiscoveryData catalogingDataHub DockerDataHub on EKS
Visual MappingBusiness user toolsApache Hop DockerApache Hop on EKS

Architecture Diagrams

Local Development Setup

Cloud Production Setup

Green components indicate identical code/logic between environments

Key Architectural Decisions

Decision: Use Apache Flink for both streaming and batch processing with PyFlink as the primary language.

Rationale:

  • Single codebase for streaming metrics and historical backfills
  • Same team can maintain both pipeline types
  • Flink's unified API supports both bounded (batch) and unbounded (streaming) sources
  • Python expertise more common than Java/Scala in data teams

2. Schema-First Approach

Decision: Canonical schemas live in Schema Registry; raw schemas discovered dynamically.

Architecture:

Benefits:

  • Separates discovery (raw, evolving) from governance (canonical, stable)
  • Visual tools can operate on discovered schemas
  • Code generation possible from mapping specifications

3. Tiered Metrics Store Strategy

Decision: Start with Postgres/TimescaleDB locally, migrate to ClickHouse/Druid for production.

Local Setup:

  • Postgres/TimescaleDB for metrics storage
  • Simple SQL queries for analytics
  • Minimal infrastructure cost

Production Setup:

  • ClickHouse or Druid managed services
  • Columnar storage for high-cardinality dimensions
  • Horizontal scaling capabilities

Migration Path: Flink sink configuration swap only - no application code changes required.

4. Hybrid Mapping Approach

Decision: Visual tools (Apache Hop) for simple mappings, code (Flink) for complex logic.

Implementation:

  • Apache Hop generates mapping specifications (JSON)
  • Flink jobs can import and execute these specifications
  • Complex stateful logic remains in hand-written Flink code
  • Same mapping definitions drive both simple and advanced pipelines

Technical Implementation Details

Schema Discovery Algorithm

Streaming Schema Inference:

# Simplified PyFlink schema profiler logic
def infer_schema_from_stream(raw_topic):
field_stats = {}
type_distributions = {}

for message in sample_messages(raw_topic, window="1h"):
for field_path, value in flatten_json(message):
field_stats[field_path] = field_stats.get(field_path, 0) + 1
detected_type = infer_type(value)
type_distributions.setdefault(field_path, set()).add(detected_type)

# Generate JSON Schema with metadata
inferred_schema = build_json_schema(field_stats, type_distributions)
register_schema(schema_registry, f"raw.{topic}.inferred", inferred_schema)
push_profiling_stats(datahub, field_stats)

Unified Metrics Computation

Shared Aggregation Logic:

# PyFlink aggregation function used by both streaming and batch
class MetricAggregator:
def compute_metrics(self, events, dimensions, time_window):
return events \
.key_by(lambda e: extract_dimensions(e, dimensions)) \
.window(time_window) \
.aggregate(
count=lambda events: len(events),
sum=lambda events: sum(e.amount for e in events),
avg=lambda events: sum(e.amount for e in events) / len(events),
min=lambda events: min(e.amount for e in events),
max=lambda events: max(e.amount for e in events)
)

# Streaming version
streaming_metrics = kafka_source \
.map(deserialize_canonical_event) \
.apply(MetricAggregator().compute_metrics(dimensions=['region', 'product'],
time_window=TumblingWindow.hours(1)))

# Batch version
batch_metrics = mongodb_source \
.map(convert_to_canonical_event) \
.apply(MetricAggregator().compute_metrics(dimensions=['region', 'product'],
time_window=GlobalWindow()))

Mapping Specification Format

Generated by Apache Hop, Consumed by Flink:

{
"mappingId": "order-canonicalization-v1",
"source": {
"type": "kafka",
"topic": "raw.mongo.orders",
"schemaId": 1234
},
"target": {
"subject": "canonical.order.v1",
"schemaId": 5678
},
"fieldMappings": [
{
"src": "$.orderId",
"dst": "$.order_id",
"transform": "string"
},
{
"src": "$.customer.email",
"dst": "$.customer.email",
"transform": "lowercase"
},
{
"src": "$.items[*].sku",
"dst": "$.line_items[*].sku",
"transform": "identity"
}
],
"defaults": {
"currency": "USD",
"created_by": "system"
}
}

Key Design Outcomes

1. Technology Consolidation

Achievement: Reduced from 5+ different technologies to 3 core components:

  • Apache Flink (PyFlink): Unified streaming + batch processing
  • Kafka + Schema Registry: Event backbone + schema governance
  • DataHub + Apache Hop: Visual discovery + mapping

2. Development Experience Optimization

Achievement: Single team can maintain entire pipeline:

  • Same Python codebase for streaming and batch jobs
  • Visual tools for analysts, code for engineers
  • Consistent deployment patterns (Kubernetes operators)
  • Shared testing and monitoring approaches

3. Scalability Path

Achievement: Clear migration from pet project to enterprise:

  • Local: Docker containers, minimal cost
  • Production: Managed services (MSK, MongoDB Atlas, ClickHouse Cloud)
  • Code Reuse: 95% of application logic unchanged between environments

4. Schema Governance Model

Achievement: Separation of concerns for schema management:

  • Raw schemas: Discovered automatically, versioned, exploratory
  • Canonical schemas: Curated, governed, stable contracts
  • Mapping specifications: Version controlled, testable, auditable

Lessons Learned

1. AI-Assisted Architecture Design

Observation: The iterative conversation with AI helped explore trade-offs systematically:

  • Started broad (AWS managed services)
  • Refined constraints (open source, Kubernetes)
  • Optimized for specific requirements (single language, visual + code)

2. Importance of Local Development Parity

Insight: Having identical architecture locally and in production dramatically improves:

  • Developer productivity
  • Testing confidence
  • Debugging capabilities
  • Team onboarding

3. Visual Tools + Code Hybrid

Learning: Pure visual ETL tools don't scale for complex logic, but they excel for:

  • Schema exploration
  • Simple field mappings
  • Business user empowerment
  • Rapid prototyping

Complex stateful processing, joins, and business logic still require code.

4. Schema Registry as Integration Hub

Realization: Schema Registry becomes the contract layer between:

  • Discovery tools and mapping tools
  • Mapping tools and execution engines
  • Different team roles (analysts vs engineers)
  • Development and production environments

Implementation Recommendations

Phase 1: Foundation

  1. Set up local Kubernetes environment with Kafka + Schema Registry + MongoDB
  2. Implement basic Debezium CDC pipeline
  3. Create minimal PyFlink job for schema discovery
  4. Deploy DataHub for schema cataloging

Phase 2: Visual Mapping

  1. Deploy Apache Hop locally
  2. Integrate Hop with Schema Registry and DataHub
  3. Create sample mapping specifications
  4. Implement Flink job that consumes mapping specs

Phase 3: Metrics Pipeline

  1. Extend Flink jobs for time-windowed aggregations
  2. Set up Postgres/TimescaleDB metrics store
  3. Implement batch backfill logic using same PyFlink codebase

Phase 4 (hypothetical): Production Readiness

  1. Provision infra on AWS and Atlas
  2. Provisioning and securing Data Hub, Apache Hop and Schema Registry
  3. Add monitoring and alerting
  4. Implement CI/CD for Flink jobs and mapping specs
  5. Performance testing and optimization

Conclusion

This AI-assisted design session demonstrated how iterative architectural exploration can lead to well defined solutions that balance multiple constraints. The final architecture achieves:

  • Unified development experience through PyFlink for all processing
  • Visual accessibility via Apache Hop for business users
  • Enterprise scalability through managed cloud services
  • Cost-effective development via local Docker environment
  • Schema governance through systematic discovery and canonicalization

The hybrid approach of visual tools for simple cases and code for complex logic, combined with a clear local-to-cloud migration path, creates a sustainable architecture that can grow from pet project to enterprise-scale deployment.

References