Change Data Capture (CDC)

Modern applications need to keep their data in multiple places, in a redundant and denormalised manner.

One datastorage in the system has to be a source of truth (aka System of Records)
Other storages will accept these data in a transformed way (Derived data)
All data in all these storages has to be in sync

The CDC process has three stages.

Change detection

There are few options to detect the changes done to the source database.

Polling the LAST_UPDATED column of tables periodically to detect changes.
Database triggers to capture row-level operations.
Watch the database transaction log for changes.

Polling is slow; it puts more load on the source database. Also, the implementation requires you to add dedicated columns to the source. Triggers deliver the results in real-time. But they are resource-intensive and can drag the database.

Watching the transaction log of the database is fast and does not impose a performance impact. Many CDC systems today have adopted this approach for their implementations.

How does a log-based CDC system work?

Users and applications make changes to the source database in terms of inserts, updates, and deletes. The database records these changes in the transaction log in the order of their occurrence.

CDC system watches the transaction log for any changes and propagates them into the target system while preserving the change order. The target system then replays the changes to update its internal state.

Requirements for a production-grade CDC system

A production-grade CDC system should satisfy the following needs.

Message ordering guarantee — The order of changes MUST BE preserved so that they are propagated to the target systems as is.
Pub/sub — Should support asynchronous, pub/sub style change propagation to consumers.
Reliable and resilient delivery — At-leat-once delivery of changes. Cannot tolerate a message loss.
Message transformation support — Should support light-weight message transformations as the event payload need to match with the target system’s input format.

Event-Driven Architecture (EDA) seems an ideal fit to achieve the above system requirements. EDA brings to the table the asynchrony, loose coupling, and scalability. By combing the EDA features, we can rethink the CDC architecture as follows.

Make it event driven

The transaction log mining component captures the changes from the source database. It converts them into events and publishes them to the message bus. That happens in real-time while changes are made to the source database.

Change capture

Change events are then written to the message bus. It provides highly scalable, and reliable change event storage while preserving the order of received events. Also, depending on the implementation, the message bus can provide at-least-once or exactly-once delivery guarantees.

Change events are usually written to a topic so that any interested consumers can subscribe to receive updates.

A message broker like RabbitMQ or ActiveMQ can provide transient event storage while a streaming platform such as Kafka or Kinesis provides a durable event streaming capability. Choosing either of them should be use case driven. You can refer to my earlier post about their difference.

Change propagation

Any downstream application can subscribe to the above topic to receive change events. Usually, an intermediate component consumes the events, applies a light-weight transformation to the event payload, and publishes it to the target system. For example, a connector reads events from the topic, applies transformation, and updates a search index.

Depending on the use case, consumers can do event-driven consumption or streaming consumption.

Advantages

The event-driven CDC approach adds the following benefits over traditional ETL and polling-based solutions.

Changes are detected, captured, and propagated in real-time as they happen. That enables downstream consumers to act upon changes quickly. Compared to traditional batch-oriented systems, that is a huge gain.
The loosely coupled nature allows adding or removing components to the architecture with minimum impact. Source and target systems can be upgraded or replaced without affecting each other.
The message bus in the middle provides reliable delivery of change events. Also, it can buffer incoming events if the rate of event production is higher than consumption. That will be beneficial for slow consumers.
Unlike polling and trigger-based methods, no performance impact on the source system.

Use cases

Cache invalidation
Search index building
Database migration (version upgrade, migrate to a different vendor, on-premise to the cloud)
Offline analytical processing
Data synchronisation in Microservices

Tools in the market

There are both open source and commercial tools available in the CDC market. Many open-source tools are flexible enough to co-exist with popular messing systems and target systems whereas commercial tools sometimes ask you to buy their entire platform.

Mentioned below are some renowned open-source tools in the market.

Debezium

Debezium is an open-source CDC platform built on top of Apache Kafka. Debezium has connectors to pull a change stream from databases like PostgreSQL, MySQL, MongoDB, Cassandra, and send that to Kafka. Kafka Connect is used as connectors for change detection and propagation.

Even though Debezium utilises Kafka in its architecture, it offers other deployment options to cater to different infrastructure needs. Debezium can be used as a standalone server (with Debezium server), or you can embed it into your application code as a library. Visit here for more information on that.

Maxwell

Maxwell reads MySQL binlogs and writes row updates as JSON to Kafka, Kinesis, or other streaming platforms. Maxwell has low operational overhead, requiring nothing but MySQL and a place to write.

PreviousCommand Query Responsibility Segregation (CQRS)NextDistributed transactions

Last updated 2 years ago

Was this helpful?