Design session with ChatGPT-5
Introduction
This article documents the outcomes of an AI-assisted architectural design session focused on building a real-time data unification platform. The challenge: handling heterogeneous data sources (MongoDB documents with varying schemas and Kafka event streams) and transforming them into a consistent canonical model for operational applications, while also providing analytical insights.
Full session transcript can be found here: shared link on chatgpt.com.
I was not originally intending to publish that, but the results are quite insightful, that's why I have asked Claude Code to have a look and create a blog article about that.
I wanted to use ChatGPT as well for the blog summary, but I've run out of ChatGPT5 credits 😅.
So here we are, Claude has prepared a summary of ChatGPT's work, a good cooperation under my supervision 😎
The Problem Statement
Primary Challenge: Build a system that can:
- Discover schemas from heterogeneous MongoDB documents and Kafka events
- Map disparate data formats to a canonical model in real-time
- Serve both operational applications (latest state + events) and analytical workloads
- Support both cloud-scale deployment and local development environments
Key Requirements:
- Real-time canonicalization for operational apps (primary use case)
- Visual schema mapping for simple transformations
- Code-based approach for complex logic and historical backfills
- Support for analytical metrics with near-real-time updates
- Enterprise-ready architecture with pet-project-friendly local development
Architecture Evolution Through Design Session
Phase 1: AWS Managed Services Approach
Initial exploration focused on AWS managed services:
- AWS Glue Data Catalog + Glue Crawlers for schema discovery
- AWS Glue Streaming Jobs for real-time canonicalization
- Amazon DMS for MongoDB CDC
- Amazon DocumentDB for materialized canonical state
Key Insight: While powerful, this approach created vendor lock-in and complex local development setup.
Phase 2: Open Source + Kubernetes Native
Pivoted to open-source alternatives running on Kubernetes:
- Apache Kafka (via Strimzi operator) for event streaming
- Debezium for MongoDB change data capture
- Apache Flink for stream processing
- Confluent Schema Registry or Apicurio Registry for schema governance
Key Insight: Provides portability across cloud providers while maintaining enterprise capabilities.
Phase 3: Visual + Code-First Hybrid
Added visual tooling for business users while maintaining developer control:
- Apache Hop for drag-and-drop schema mapping (simple cases)
- DataHub for schema discovery and cataloging
- Apache Flink for complex transformations and stateful processing
Key Insight: Hybrid approach balances accessibility for analysts with full control for engineers.
Final Architecture Design
Core Technology Stack
Component | Purpose | Local (Docker) | Cloud (Managed) |
---|---|---|---|
Data Sources | Source systems | MongoDB Docker | MongoDB Atlas |
Change Data Capture | Stream changes | Debezium Docker | Debezium on EKS |
Event Streaming | Message backbone | Kafka Docker | Amazon MSK / Confluent Cloud |
Schema Management | Schema governance | Schema Registry Docker | Managed Schema Registry |
Stream Processing | Real-time transforms | Flink Docker | Flink on EKS |
Canonical State | Latest entity state | MongoDB Docker | MongoDB Atlas |
Metrics Store | Analytics | Postgres/TimescaleDB | ClickHouse/Druid Managed |
Schema Discovery | Data cataloging | DataHub Docker | DataHub on EKS |
Visual Mapping | Business user tools | Apache Hop Docker | Apache Hop on EKS |
Architecture Diagrams
Local Development Setup
Cloud Production Setup
Green components indicate identical code/logic between environments
Key Architectural Decisions
1. Unified PyFlink Runtime
Decision: Use Apache Flink for both streaming and batch processing with PyFlink as the primary language.
Rationale:
- Single codebase for streaming metrics and historical backfills
- Same team can maintain both pipeline types
- Flink's unified API supports both bounded (batch) and unbounded (streaming) sources
- Python expertise more common than Java/Scala in data teams
2. Schema-First Approach
Decision: Canonical schemas live in Schema Registry; raw schemas discovered dynamically.
Architecture:
Benefits:
- Separates discovery (raw, evolving) from governance (canonical, stable)
- Visual tools can operate on discovered schemas
- Code generation possible from mapping specifications
3. Tiered Metrics Store Strategy
Decision: Start with Postgres/TimescaleDB locally, migrate to ClickHouse/Druid for production.
Local Setup:
- Postgres/TimescaleDB for metrics storage
- Simple SQL queries for analytics
- Minimal infrastructure cost
Production Setup:
- ClickHouse or Druid managed services
- Columnar storage for high-cardinality dimensions
- Horizontal scaling capabilities
Migration Path: Flink sink configuration swap only - no application code changes required.
4. Hybrid Mapping Approach
Decision: Visual tools (Apache Hop) for simple mappings, code (Flink) for complex logic.
Implementation:
- Apache Hop generates mapping specifications (JSON)
- Flink jobs can import and execute these specifications
- Complex stateful logic remains in hand-written Flink code
- Same mapping definitions drive both simple and advanced pipelines
Technical Implementation Details
Schema Discovery Algorithm
Streaming Schema Inference:
# Simplified PyFlink schema profiler logic
def infer_schema_from_stream(raw_topic):
field_stats = {}
type_distributions = {}
for message in sample_messages(raw_topic, window="1h"):
for field_path, value in flatten_json(message):
field_stats[field_path] = field_stats.get(field_path, 0) + 1
detected_type = infer_type(value)
type_distributions.setdefault(field_path, set()).add(detected_type)
# Generate JSON Schema with metadata
inferred_schema = build_json_schema(field_stats, type_distributions)
register_schema(schema_registry, f"raw.{topic}.inferred", inferred_schema)
push_profiling_stats(datahub, field_stats)
Unified Metrics Computation
Shared Aggregation Logic:
# PyFlink aggregation function used by both streaming and batch
class MetricAggregator:
def compute_metrics(self, events, dimensions, time_window):
return events \
.key_by(lambda e: extract_dimensions(e, dimensions)) \
.window(time_window) \
.aggregate(
count=lambda events: len(events),
sum=lambda events: sum(e.amount for e in events),
avg=lambda events: sum(e.amount for e in events) / len(events),
min=lambda events: min(e.amount for e in events),
max=lambda events: max(e.amount for e in events)
)
# Streaming version
streaming_metrics = kafka_source \
.map(deserialize_canonical_event) \
.apply(MetricAggregator().compute_metrics(dimensions=['region', 'product'],
time_window=TumblingWindow.hours(1)))
# Batch version
batch_metrics = mongodb_source \
.map(convert_to_canonical_event) \
.apply(MetricAggregator().compute_metrics(dimensions=['region', 'product'],
time_window=GlobalWindow()))
Mapping Specification Format
Generated by Apache Hop, Consumed by Flink:
{
"mappingId": "order-canonicalization-v1",
"source": {
"type": "kafka",
"topic": "raw.mongo.orders",
"schemaId": 1234
},
"target": {
"subject": "canonical.order.v1",
"schemaId": 5678
},
"fieldMappings": [
{
"src": "$.orderId",
"dst": "$.order_id",
"transform": "string"
},
{
"src": "$.customer.email",
"dst": "$.customer.email",
"transform": "lowercase"
},
{
"src": "$.items[*].sku",
"dst": "$.line_items[*].sku",
"transform": "identity"
}
],
"defaults": {
"currency": "USD",
"created_by": "system"
}
}
Key Design Outcomes
1. Technology Consolidation
Achievement: Reduced from 5+ different technologies to 3 core components:
- Apache Flink (PyFlink): Unified streaming + batch processing
- Kafka + Schema Registry: Event backbone + schema governance
- DataHub + Apache Hop: Visual discovery + mapping
2. Development Experience Optimization
Achievement: Single team can maintain entire pipeline:
- Same Python codebase for streaming and batch jobs
- Visual tools for analysts, code for engineers
- Consistent deployment patterns (Kubernetes operators)
- Shared testing and monitoring approaches
3. Scalability Path
Achievement: Clear migration from pet project to enterprise:
- Local: Docker containers, minimal cost
- Production: Managed services (MSK, MongoDB Atlas, ClickHouse Cloud)
- Code Reuse: 95% of application logic unchanged between environments
4. Schema Governance Model
Achievement: Separation of concerns for schema management:
- Raw schemas: Discovered automatically, versioned, exploratory
- Canonical schemas: Curated, governed, stable contracts
- Mapping specifications: Version controlled, testable, auditable
Lessons Learned
1. AI-Assisted Architecture Design
Observation: The iterative conversation with AI helped explore trade-offs systematically:
- Started broad (AWS managed services)
- Refined constraints (open source, Kubernetes)
- Optimized for specific requirements (single language, visual + code)
2. Importance of Local Development Parity
Insight: Having identical architecture locally and in production dramatically improves:
- Developer productivity
- Testing confidence
- Debugging capabilities
- Team onboarding
3. Visual Tools + Code Hybrid
Learning: Pure visual ETL tools don't scale for complex logic, but they excel for:
- Schema exploration
- Simple field mappings
- Business user empowerment
- Rapid prototyping
Complex stateful processing, joins, and business logic still require code.
4. Schema Registry as Integration Hub
Realization: Schema Registry becomes the contract layer between:
- Discovery tools and mapping tools
- Mapping tools and execution engines
- Different team roles (analysts vs engineers)
- Development and production environments
Implementation Recommendations
Phase 1: Foundation
- Set up local Kubernetes environment with Kafka + Schema Registry + MongoDB
- Implement basic Debezium CDC pipeline
- Create minimal PyFlink job for schema discovery
- Deploy DataHub for schema cataloging
Phase 2: Visual Mapping
- Deploy Apache Hop locally
- Integrate Hop with Schema Registry and DataHub
- Create sample mapping specifications
- Implement Flink job that consumes mapping specs
Phase 3: Metrics Pipeline
- Extend Flink jobs for time-windowed aggregations
- Set up Postgres/TimescaleDB metrics store
- Implement batch backfill logic using same PyFlink codebase
Phase 4 (hypothetical): Production Readiness
- Provision infra on AWS and Atlas
- Provisioning and securing Data Hub, Apache Hop and Schema Registry
- Add monitoring and alerting
- Implement CI/CD for Flink jobs and mapping specs
- Performance testing and optimization
Conclusion
This AI-assisted design session demonstrated how iterative architectural exploration can lead to well defined solutions that balance multiple constraints. The final architecture achieves:
- Unified development experience through PyFlink for all processing
- Visual accessibility via Apache Hop for business users
- Enterprise scalability through managed cloud services
- Cost-effective development via local Docker environment
- Schema governance through systematic discovery and canonicalization
The hybrid approach of visual tools for simple cases and code for complex logic, combined with a clear local-to-cloud migration path, creates a sustainable architecture that can grow from pet project to enterprise-scale deployment.