🔏
Tech
  • 🟢App aspects
    • Software architecture
      • Caching
      • Anti-patterns
      • System X-ability
      • Coupling
      • Event driven architecture
        • Command Query Responsibility Segregation (CQRS)
        • Change Data Capture (CDC)
      • Distributed transactions
      • App dev notes
        • Architecture MVP
      • TEMP. Check list
      • Hexagonal arch
      • Communication
        • REST vs messaging
        • gRPC
        • WebSocket
      • Load balancers
      • Storage limits
      • Event storming
    • Authentication
    • Deployment strategy
  • Databases
    • Classification
    • DB migration tools
    • PostreSQL
    • Decision guidance
    • Index
      • Hash indexes
      • SSTable, LSM-Trees
      • B-Tree
      • Engines, internals
    • Performance
  • System design
    • Interview preparation
      • Plan
        • Instagram
        • Tinder
        • Digital wallet
        • Dropbox
        • Live video streaming
        • Uber
        • Whatsup
        • Tiktok
        • Twitter
        • Proximity service
    • Algorithms
    • Acronyms
  • 🟢Programming languages
    • Java
      • Features
        • Field hiding
        • HashCode() and Equals()
        • Reference types
        • Pass by value
        • Atomic variables
      • Types
      • IO / NIO
        • Java NIO
          • Buffer
          • Channel
        • Java IO: Streams
          • Input streams
            • BufferedInputStream
            • DataInputStream
            • ObjectInputStream
            • FilterInputStream
            • ByteArrayInputStream
        • Java IO: Pipes
        • Java IO: Byte & Char Arrays
        • Java IO: Input Parsing
          • PushbackReader
          • StreamTokenizer
          • LineNumberReader
          • PushbackInputStream
        • System.in, System.out, System.error
        • Java IO: Files
          • FileReader
          • FileWriter
          • FileOutputStream
          • FileInputStream
      • Multithreading
        • Thread liveness
        • False sharing
        • Actor model
        • Singleton
        • Future, CompletableFuture
        • Semaphore
      • Coursera: parallel programming
      • Coursera: concurrent programming
      • Serialization
      • JVM internals
      • Features track
        • Java 8
      • Distributed programming
      • Network
      • Patterns
        • Command
      • Garbage Collectors
        • GC Types
        • How GC works
        • Tools for GC
    • Kotlin
      • Scope functions
      • Inline value classes
      • Coroutines
      • Effective Kotlin
    • Javascript
      • Javascript vs Java
      • TypeScript
    • SQL
      • select for update
    • Python
      • __init.py__
  • OS components
    • Network
      • TCP/IP model
        • IP address in action
      • OSI model
  • 🟢Specifications
    • JAX-RS
    • REST
      • Multi part
  • 🟢Protocols
    • HTTP
    • OAuth 2.0
    • LDAP
    • SAML
  • 🟢Testing
    • Selenium anatomy
    • Testcafe
  • 🟢Tools
    • JDBC
      • Connection pool
    • Gradle
    • vim
    • git
    • IntelliJ Idea
    • Elastic search
    • Docker
    • Terraform
    • CDK
    • Argo CD
      • app-of-app setup
    • OpenTelemetry
    • Prometheus
    • Kafka
      • Consumer lag
  • 🟢CI
    • CircleCi
  • 🟢Platforms
    • AWS
      • VPC
      • EC2
      • RDS
      • S3
      • IAM
      • CloudWatch
      • CloudTrail
      • ELB
      • SNS
      • Route 53
      • CloudFront
      • Athena
      • EKS
    • Kubernetes
      • Networking
      • RBAC
      • Architecture
      • Pod
        • Resources
      • How to try
      • Kubectl
      • Service
      • Tooling
        • ArgoCD
        • Helm
        • Istio
    • GraalVM
    • Node.js
    • Camunda
      • Service tasks
      • Transactions
      • Performance
      • How it executes
  • 🟢Frameworks
    • Hibernate
      • JPA vs Spring Data
    • Micronaut
    • Spring
      • Security
      • JDBC, JPA, Hibernate
      • Transactions
      • Servlet containers, clients
  • 🟢Awesome
    • Нейробиология
    • Backend
      • System design
    • DevOps
    • Data
    • AI
    • Frontend
    • Mobile
    • Testing
    • Mac
    • Books & courses
      • Path: Java Concurrency
    • Algorithms
      • Competitive programming
    • Processes
    • Finance
    • Electronics
  • 🟢Electronics
    • Arduino
    • IoT
  • Artificial intelligence
    • Artificial Intelligence (AI)
  • 🚀Performance
    • BE
  • 📘Computer science
    • Data structures
      • Array
      • String
      • LinkedList
      • Tree
    • Algorithms
      • HowTo algorithms for interview
  • 🕸️Web dev (Frontend)
    • Trends
    • Web (to change)
  • 📈Data science
    • Time series
Powered by GitBook
On this page
  • Change detection
  • How does a log-based CDC system work?
  • Requirements for a production-grade CDC system
  • Make it event driven
  • Change capture
  • Change propagation
  • Advantages
  • Use cases
  • Tools in the market
  • Debezium
  • Maxwell

Was this helpful?

  1. App aspects
  2. Software architecture
  3. Event driven architecture

Change Data Capture (CDC)

PreviousCommand Query Responsibility Segregation (CQRS)NextDistributed transactions

Last updated 2 years ago

Was this helpful?

Modern applications need to keep their data in multiple places, in a redundant and denormalised manner.

  • One datastorage in the system has to be a source of truth (aka System of Records)

  • Other storages will accept these data in a transformed way (Derived data)

  • All data in all these storages has to be in sync

The CDC process has three stages.

Change detection

There are few options to detect the changes done to the source database.

  1. Polling the LAST_UPDATED column of tables periodically to detect changes.

  2. Database triggers to capture row-level operations.

  3. Watch the database transaction log for changes.

Polling is slow; it puts more load on the source database. Also, the implementation requires you to add dedicated columns to the source. Triggers deliver the results in real-time. But they are resource-intensive and can drag the database.

Watching the transaction log of the database is fast and does not impose a performance impact. Many CDC systems today have adopted this approach for their implementations.

How does a log-based CDC system work?

Users and applications make changes to the source database in terms of inserts, updates, and deletes. The database records these changes in the transaction log in the order of their occurrence.

CDC system watches the transaction log for any changes and propagates them into the target system while preserving the change order. The target system then replays the changes to update its internal state.

Requirements for a production-grade CDC system

A production-grade CDC system should satisfy the following needs.

  • Message ordering guarantee — The order of changes MUST BE preserved so that they are propagated to the target systems as is.

  • Pub/sub — Should support asynchronous, pub/sub style change propagation to consumers.

  • Reliable and resilient delivery — At-leat-once delivery of changes. Cannot tolerate a message loss.

  • Message transformation support — Should support light-weight message transformations as the event payload need to match with the target system’s input format.

Make it event driven

The transaction log mining component captures the changes from the source database. It converts them into events and publishes them to the message bus. That happens in real-time while changes are made to the source database.

Change capture

Change events are then written to the message bus. It provides highly scalable, and reliable change event storage while preserving the order of received events. Also, depending on the implementation, the message bus can provide at-least-once or exactly-once delivery guarantees.

Change events are usually written to a topic so that any interested consumers can subscribe to receive updates.

Change propagation

Any downstream application can subscribe to the above topic to receive change events. Usually, an intermediate component consumes the events, applies a light-weight transformation to the event payload, and publishes it to the target system. For example, a connector reads events from the topic, applies transformation, and updates a search index.

Depending on the use case, consumers can do event-driven consumption or streaming consumption.

Advantages

The event-driven CDC approach adds the following benefits over traditional ETL and polling-based solutions.

  • Changes are detected, captured, and propagated in real-time as they happen. That enables downstream consumers to act upon changes quickly. Compared to traditional batch-oriented systems, that is a huge gain.

  • The loosely coupled nature allows adding or removing components to the architecture with minimum impact. Source and target systems can be upgraded or replaced without affecting each other.

  • The message bus in the middle provides reliable delivery of change events. Also, it can buffer incoming events if the rate of event production is higher than consumption. That will be beneficial for slow consumers.

  • Unlike polling and trigger-based methods, no performance impact on the source system.

Use cases

  • Cache invalidation

  • Search index building

  • Database migration (version upgrade, migrate to a different vendor, on-premise to the cloud)

  • Offline analytical processing

  • Data synchronisation in Microservices

Tools in the market

There are both open source and commercial tools available in the CDC market. Many open-source tools are flexible enough to co-exist with popular messing systems and target systems whereas commercial tools sometimes ask you to buy their entire platform.

Mentioned below are some renowned open-source tools in the market.

Debezium

Maxwell

seems an ideal fit to achieve the above system requirements. EDA brings to the table the asynchrony, loose coupling, and scalability. By combing the EDA features, we can rethink the CDC architecture as follows.

A message broker like RabbitMQ or ActiveMQ can provide transient event storage while a streaming platform such as Kafka or Kinesis provides a durable event streaming capability. Choosing either of them should be use case driven. You can refer to my about their difference.

is an open-source CDC platform built on top of Apache Kafka. Debezium has connectors to pull a change stream from databases like PostgreSQL, MySQL, MongoDB, Cassandra, and send that to Kafka. Kafka Connect is used as connectors for change detection and propagation.

Even though Debezium utilises Kafka in its architecture, it offers other deployment options to cater to different infrastructure needs. Debezium can be used as a standalone server (with Debezium server), or you can embed it into your application code as a library. Visit for more information on that.

reads MySQL binlogs and writes row updates as JSON to Kafka, Kinesis, or other streaming platforms. Maxwell has low operational overhead, requiring nothing but MySQL and a place to write.

🟢
Event-Driven Architecture (EDA)
earlier post
Debezium
here
Maxwell
Based on the article