Paweł Mantur

Solutions Architect

View all authors

ADR - public S3 bucket access

December 22, 2024 · 4 min read

Paweł Mantur

Solutions Architect

info

This post is an example of ADR - Architecture Decision Record. ADRs document the reasons behind architectural decisions. ADRs promote more transparent and fact-based decision making culture. ADRs are also an useful artefact from technical documentation perspective.

Context

When implementing this website (see related article for architecture overview), among other decisions, I needed to decide on how to setup AWS S3 bucket that hosts website files.

Decision

Although it is considered to be potentially unsafe, I have decided for Option 2: allowing public access 🙀. But let's not panic, I will explain why it is safe in this case.

Considered Options

❌ Option 1: Blocking public access to S3 bucket

Pros:
- Safer option - zero trust is a holy rule of security, especially for public access. AWS console by default blocks public access when creating S3 buckets, it is also possible to block public access for whole AWS account. Moreover AWS IAM Access Analyzer tool reports public access to buckets as a security threat finding. AWS console and documentation really forces to think about that decision.
Cons - solution gets more complicated:
- Hosting a static website using Amazon S3 functionality of S3 cannot be used, as it requires public access
- since S3 website endpoint cannot be created, Cloud Front needs to access S3 REST API as origin
- Docusaurus build creates a directory for each page/article and puts a single index.html file inside that directory. For better SEO, and nicer urls, we do not have index.html in links. Web servers know that in such case (when url points to a directory in webserver) index.html needs to be served. S3 website endpoint also knows how to handle that properly. But since we are using S3 REST API under the hood, it just follows the request as-is - if request is for https://pawelmantur.pl/blog/s3-public-bucket it finds out tha there is blog/s3-public-bucket directory in my s3, but since public access does not have directory listing permissions granted - it returns 403. To return index.html, as a web-server like nginx would do - we need to add CloudFront Function to handle this case and modify request URL to append index.html to the URL.
- It is possible and relatively simple to try, but it requires introducing new components to the solution that can be avoided by leveraging existing AWS S3 website endpoints capability

✅ Option 2: Allowing public access to perform `s3:GetObject` action

Pros:
- The bucket's only purpose is to host website files and this website is public by design, I want people to access my blog, I want it to be public.
- S3 website endpoint can be used
- Static website built with Docusaurus works properly out of the box with S3 website endpoint, no function needed
- no additional cost related to CloudFront Functions, although it is ignorable for thi website
- Simpler setup with less components means less room for errors
Cons:
- There is a risk that I will put some confidential files into this bucket, but the risk is mitigated by automation: the only way in which this bucket is updated is by GH Action, it syncs only the build directory of Docusaurus website, which by design is public. Well defined, automated and tested process.

Consequences

Since public s3 buckets were introduced to architecture, I will have to be cautious what content I am putting there and what are the policies granted for public access. But since we are talking about a single person doing a blog, we can agree that related risks can be accepted.

If that decision was made in context of large organization, the use cases for public sharing would need to have special governance. If organization has no business workflows that would require public file sharing and different solutions are available for static websites hosting, then public access to S3 should not be allowed to avoid a risky setup.

References

Watch out for Kafka costs

September 1, 2024 · 3 min read

Paweł Mantur

Solutions Architect

Not obvious Confluent Cloud costs

When using AWS or any other cloud we need to be aware about network traffic charges, especially cross-zone and cross-region data transfer fees.

Example deployment:

Cluster in Confluent Cloud, hosted in AWS (1 CKU is limited to single AZ, from 2 CKUs we have multi-AZ setup)
AWS Private Link for private connectivity with AWS
Kafka clients running in AWS EKS (multi-AZ)

Cross-AZ data transfer costs

Be aware that if Kafka broker node happens to be running in a different AZ than Kafka clients, then additional data transfer charges will apply for cross-AZ traffic.

Kafka has the concept of Racks that allows to co-locate Kafka clients and broker nodes. More details about this setting in context of AWS and Confluent can be fond here: https://docs.confluent.io/cloud/current/networking/fetch-from-follower.html

Data transfer costs within AZ

But even if we manage to keep connections within same AZ, is consuming data from Kafka for free?

Imagine architecture in which single topic contains data dedicated to multiple consumers. Every consumer reads only relevant data and filters-out (ignores) other messages. Sounds straightforward, but we need to be aware that each consumer to filter data, first needs to read the message. So even not relevant data creates traffic from broker to clients.

Kafka does not support filtering on broker side. There is open feature request for that.

If we have a lot of consumers we will have a lot of outbound traffic (topic throughput x number of consumers). Having additional infrastructure like AWS Private Lnk for such traffic will generate extra costs.

Extreme scenario - generating costs for nothing

Another interesting scenario is implementing a retry policy when message processing fails. For example when every message needs to be delivered to an endpoint which is down. If Kafka consumer tries to deliver the message very aggressively (for example every second or even worse in an infinite loop), and every retry is a new read from topic, then we can easily generate a lot of reads.

We may be fooled by most of the documentation that states that reading from Kafka is very efficient as it is basically about reading sequentially from log. From broker costs perspective, multiple consumers is not a significant costs factor compared to things like written data volumes, but we still need to be mindful of data transfer costs that may apply for reads. Confluent charges 0.05$/GB for Egress traffic. Total costs may grow quickly in a busy cluster with active producers and multiple reads of every message.

Schema Definition Formats

July 1, 2022 · 4 min read

Paweł Mantur

Solutions Architect

AI Friend

Assistant

Schema Definition Formats: JSON Schema, Avro, and Protocol Buffers

In data management, maintaining a specific structure is key for consistency and interoperability. Three popular schema formats are JSON Schema, Avro, and Protocol Buffers. Each has unique features and use cases. Let's explore their strengths and applications.

JSON Schema

Overview: JSON Schema is a powerful tool for validating the structure of JSON data. It allows you to define the expected format, type, and constraints of JSON documents, ensuring that the data adheres to a predefined schema.

Key Features:

Validation: JSON Schema provides a robust mechanism for validating JSON data against a schema. This helps in catching errors early and ensuring data integrity.
Documentation: The schema itself serves as a form of documentation, making it easier for developers to understand the expected structure of the data.
Interoperability: JSON Schema is widely supported across various programming languages and platforms, making it a versatile choice for many applications.

Use Cases:

API Validation: Ensuring that the data exchanged between client and server adheres to a specific format.
Configuration Files: Validating configuration files to ensure they meet the required structure and constraints.
Data Exchange: Facilitating data exchange between different systems by providing a clear contract for the data format.

Example:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "title": "Product",
  "type": "object",
  "properties": {
    "id": {
      "type": "integer"
    },
    "name": {
      "type": "string"
    },
    "price": {
      "type": "number"
    }
  },
  "required": ["id", "name", "price"]
}

Avro

Overview: Avro is a data serialization system that provides a compact, fast, and efficient format for data exchange. It is particularly well-suited for big data applications and is a key component of the Apache Hadoop ecosystem.

Key Features:

Compact Serialization: Avro uses a binary format for data serialization, which is more compact and efficient compared to text-based formats like JSON.
Schema Evolution: Avro supports schema evolution, allowing you to update the schema without breaking compatibility with existing data.
Interoperability: Avro schemas are defined using JSON, making them easy to read and understand. The binary format ensures efficient data storage and transmission.

Use Cases:

Big Data: Avro is widely used in big data applications, particularly within the Hadoop ecosystem, for efficient data storage and processing.
Data Streaming: Avro is commonly used in data streaming platforms like Apache Kafka for efficient data serialization and deserialization.
Inter-Service Communication: Facilitating communication between microservices by providing a compact and efficient data format.

Example:

{
  "type": "record",
  "name": "User",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "email", "type": "string"}
  ]
}

Protocol Buffers (Protobuf)

Overview: Protocol Buffers, developed by Google, is a language-neutral, platform-neutral, extensible mechanism for serializing structured data. It is known for its efficiency and performance.

Key Features:

Compact and Efficient: Protobuf uses a binary format that is both compact and efficient, making it suitable for high-performance applications.
Language Support: Protobuf supports multiple programming languages, including Java, C++, and Python.
Schema Evolution: Protobuf supports backward and forward compatibility, allowing for schema evolution without breaking existing data.

Use Cases:

Inter-Service Communication: Commonly used in microservices architectures for efficient data exchange.
Data Storage: Suitable for storing structured data in a compact format.
RPC Systems: Often used in Remote Procedure Call (RPC) systems like gRPC.

Example:

syntax = "proto3";

message Person {
  int32 id = 1;
  string name = 2;
  string email = 3;
}

Conclusion

JSON Schema, Avro, and Protocol Buffers each offer powerful tools for managing data schemas, each with its unique strengths. JSON Schema excels in validation and documentation, making it ideal for APIs and configuration files. Avro provides efficient serialization and schema evolution, making it a preferred choice for big data and streaming applications. Protocol Buffers offer compact and efficient serialization, making them suitable for high-performance applications and inter-service communication. Understanding the strengths and use cases of each format can help you choose the right tool for your specific needs.

Change Data Capture and Debezium

May 1, 2022 · 4 min read

Paweł Mantur

Solutions Architect

AI Friend

Assistant

Understanding Change Data Capture (CDC) and Debezium

In today's data-driven world, keeping track of changes in data is crucial for maintaining data integrity, enabling real-time analytics, and ensuring seamless data integration across systems. Change Data Capture (CDC) is a powerful technique that addresses this need by capturing and tracking changes in data as they occur. One of the most popular tools for implementing CDC is Debezium. This blog article delves into the concept of CDC, its importance, and how Debezium can be used to implement it effectively.

What is Change Data Capture (CDC)?

Change Data Capture (CDC) is a process that identifies and captures changes made to data in a database. These changes can include inserts, updates, and deletes. Once captured, the changes can be propagated to other systems or used for various purposes such as data warehousing, real-time analytics, and data synchronization.

Why is CDC Important?

Real-Time Data Integration:
- CDC enables real-time data integration by capturing changes as they happen and propagating them to other systems. This ensures that all systems have the most up-to-date information.
Efficient Data Processing:
- By capturing only the changes rather than the entire dataset, CDC reduces the amount of data that needs to be processed and transferred. This leads to more efficient data processing and reduced latency.
Data Consistency:
- CDC helps maintain data consistency across different systems by ensuring that changes made in one system are reflected in others. This is particularly important in distributed systems and microservices architectures.
Historical Data Analysis:
- CDC allows for the capture of historical changes, enabling organizations to perform trend analysis and understand how data has evolved over time.

Introducing Debezium

Debezium is an open-source CDC tool that supports various databases such as MySQL, PostgreSQL, MongoDB, and more. It reads changes from transaction logs and streams them to other systems, making it a powerful tool for implementing CDC.

Key Features of Debezium:

Wide Database Support: Debezium supports multiple databases, making it versatile and suitable for various environments.
Kafka Integration: Debezium integrates seamlessly with Apache Kafka, allowing for efficient streaming of changes.
Schema Evolution: Debezium handles schema changes gracefully, ensuring that changes in the database schema do not disrupt data capture.
Real-Time Processing: Debezium captures and streams changes in real-time, enabling real-time data integration and analytics.

How Debezium Works

Debezium works by reading the transaction logs of the source database. These logs record all changes made to the data, including inserts, updates, and deletes. Debezium connectors capture these changes and stream them to a Kafka topic. From there, the changes can be consumed by various applications or systems.

Steps to Implement CDC with Debezium:

Set Up Kafka:
- Install and configure Apache Kafka, which will be used to stream the changes captured by Debezium.
Deploy Debezium Connectors:
- Deploy Debezium connectors for the source databases. Each connector is responsible for capturing changes from a specific database.
Configure Connectors:
- Configure the connectors with the necessary settings, such as the database connection details and the Kafka topic to which the changes should be streamed.
Consume Changes:
- Set up consumers to read the changes from the Kafka topics and process them as needed. This could involve updating a data warehouse, triggering real-time analytics, or synchronizing data across systems.

Example Configuration

Here is a basic example of configuring a Debezium connector for a MySQL database:

{
  "name": "mysql-connector",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "database.hostname": "localhost",
    "database.port": "3306",
    "database.user":

 "

debezium",
    "database.password": "dbz",
    "database.server.id": "184054",
    "database.server.name": "fullfillment",
    "database.include.list": "inventory",
    "database.history.kafka.bootstrap.servers": "kafka:9092",
    "database.history.kafka.topic": "schema-changes.inventory"
  }
}

In this configuration:

connector.class specifies the Debezium connector class for MySQL.
database.hostname, database.port, database.user, and database.password provide the connection details for the MySQL database.
database.server.name is a logical name for the database server.
database.include.list specifies the databases to capture changes from.
database.history.kafka.bootstrap.servers and database.history.kafka.topic configure the Kafka settings for storing schema history.

Conclusion

Change Data Capture (CDC) is a vital technique for modern data management, enabling real-time data integration, efficient data processing, and maintaining data consistency across systems. Debezium is a powerful open-source tool for implementing CDC, offering wide database support, seamless Kafka integration, and real-time processing capabilities. By leveraging Debezium, organizations can capture and propagate data changes effectively, ensuring that their systems are always up-to-date and ready for real-time analytics and decision-making.

Context​

Decision​

Considered Options​

❌ Option 1: Blocking public access to S3 bucket​

✅ Option 2: Allowing public access to perform s3:GetObject action​

Consequences​

References​

Not obvious Confluent Cloud costs​

Cross-AZ data transfer costs​

Data transfer costs within AZ​

Extreme scenario - generating costs for nothing​

Schema Definition Formats: JSON Schema, Avro, and Protocol Buffers​

JSON Schema​

Avro​

Protocol Buffers (Protobuf)​

Conclusion​

Understanding Change Data Capture (CDC) and Debezium​

What is Change Data Capture (CDC)?​

Why is CDC Important?​

Introducing Debezium​

How Debezium Works​

Example Configuration​

Conclusion​

Context

Decision

Considered Options

❌ Option 1: Blocking public access to S3 bucket

✅ Option 2: Allowing public access to perform `s3:GetObject` action

Consequences

References

Not obvious Confluent Cloud costs

Cross-AZ data transfer costs

Data transfer costs within AZ

Extreme scenario - generating costs for nothing

Schema Definition Formats: JSON Schema, Avro, and Protocol Buffers

JSON Schema

Avro

Protocol Buffers (Protobuf)

Conclusion

Understanding Change Data Capture (CDC) and Debezium

What is Change Data Capture (CDC)?

Why is CDC Important?

Introducing Debezium

How Debezium Works

Example Configuration

Conclusion