Unraveling Change Data Capture (CDC) - A Guide to Different Methods and Techniques

July 26, 2023 · 7 min read

OLake Maintainer

Unraveling Change Data Capture (CDC) - A Guide to Different Methods and Techniques cover image

Check out OLake, our latest open source offering for replicating databases to lakehouses.

As data sets continue to expand and applications demand real-time responsiveness, traditional ETL processes face challenges in meeting these requirements. So, how can databases adapt to the demands of real-time data processing?

In our previous blog we talked about the meaning and advantages of CDC. As a quick refresher, CDC serves as the foundation for synchronizing another system with incremental changes or creating an audit trail of data modifications.

This audit trail can be utilized for various purposes, such as updating a data warehouse or conducting analyses to identify patterns of changes.

By leveraging CDC, businesses can enhance data integration, analysis, and decision-making processes effectively.

Methods of Change data Capture

The implementation of Change Data Capture (CDC) can vary depending on specific requirements and technologies employed. Here are several common methods used to implement CDC

Log-based change data capture (CDC)

In databases designed for online transaction processing (OLTP), a transaction log is utilized to record changes. This log serves a crucial role in ensuring loss-less database recovery in the event of a system or database crash.

All committed and recoverable changes can be traced back through the database transaction logs.

To capture changes from the transaction log, a log-based mechanism is employed. This mechanism parses the changes asynchronously from the transactions that submit them. While some database technologies offer an API for log-based CDC, others necessitate a deep understanding of the system to extract changes effectively.

Certain vendors, like Datazip, offer log-based CDC using binary log readers. These readers directly parse the transaction log without intermediate API layers, ensuring efficient and unrestricted change data capture.

Binary log readers are also designed to have minimal impact on database processing and provide greater flexibility in deployment options. For instance, binary log reading can be performed on a standby system or based on log backups.

Log-based CDC is particularly beneficial for data replication use cases that require transactional consistency in line with the data source tables.

As transaction boundaries and commit order are preserved in the transaction log, log-based CDC inherently meets this requirement.

Moreover, log-based CDC offers the capability to access historical changes, enabling you to retrieve data from earlier time points.

Ideal for : Log-based CDC is ideal for your critical and high-traffic databases that cannot tolerate any slowdown in database processing.

These systems host your most vital datasets, essential for analytical use cases, and log-based CDC efficiently manages real-time high-volume change data.

When to use other methods : However, in certain situations, other types of CDC might be more suitable. For instance, log-based CDC might not be feasible for certain database technologies or deployments, such as cloud-based databases or applications delivered as hosted solutions or appliances.

In such cases, alternative CDC methods need to be considered to ensure seamless data integration and synchronization.

Trigger-based change data capture (CDC)

One of the traditional ways to implement CDC is by using database triggers. Triggers are custom code snippets that are executed automatically when a specific event (like an insert, update, or delete) occurs in a database table.

By creating triggers on the source tables, you can capture the changes and record them in a separate CDC table.

Here are some common implementations for trigger-based capture:

Database Triggers: As the name suggests, triggers are custom code snippets or stored procedures that are defined to execute automatically in response to specific data modification events in the source database. These triggers are configured on the tables of interest to capture inserts, updates, or deletes.
Row-Level Triggers: Row-level triggers capture changes at the individual row level. They are designed to trigger and record changes for each affected row when an insert, update, or delete operation occurs.
After Triggers: After triggers are executed after the data modification event has taken place. These triggers are often used to capture changes once the data has been successfully inserted, updated, or deleted in the source table.
Instead-Of Triggers: Instead-of triggers allow for custom logic to be executed instead of the standard insert, update, or delete operations. This allows for more control over data capture and manipulation before the actual modification occurs.
Shadow Tables: In some cases, shadow tables or audit tables are created to store the changes captured by triggers. These tables mirror the structure of the source tables and serve as a historical record of the data changes.
Batch Processing: Trigger-based capture can be combined with batch processing to optimize data transfer. Rather than immediately replicating each change, the triggers capture the changes, and the replication occurs in batches at specific intervals, reducing the load on the target systems.

Ideal for : Scenarios where transaction and database performance are not paramount, and the system possesses sufficient processing capacity.

Consider other types of CDC when : Critical data processing demands or high data volumes make it impractical to use database triggers. Additionally, certain off-the-shelf applications may not permit the addition of triggers.

If your use case necessitates transactional consistency, trigger-based CDC may not be the most suitable option.

Difference or Snapshot based Change Data Capture (Diff CDC)

Difference-based Change Data Capture (CDC) is a specific method of capturing and replicating data changes that focuses on capturing only the differences or changes made to the source data since the last capture process.

To conduct a comprehensive difference analysis, it is necessary to retrieve all the data for comparison. The difference is established by comparing two snapshots of the data.

This approach is particularly efficient for data replication as it significantly reduces the amount of data that needs to be transferred and processed, leading to reduced network and storage costs.

Ideal for : Relatively small data sets and scenarios where real-time or near-real-time data synchronization is crucial, enabling timely updates and ensuring that target systems remain up-to-date with the latest changes from the source.

Consider other types of CDC when : your application necessitates retaining the complete history of data changes or if your use case demands precise transactional consistency.

Updated at / Modified at / Timestamp / Watermark based Change Data Capture

Certain applications keep track of every row change by recording the timestamp of each modification in a separate column, such as "LAST_MODIFIED." Extracting changes becomes a straightforward process by maintaining a record of the most recently extracted changes and filtering based on this timestamp column.

However, a significant drawback of this approach is its inability to identify deleted rows. To address this limitation, some technologies store a low watermark in every row, like PostgreSQL's use of XMIN.

In cases where the low watermark value increases monotonically, Change Data Capture (CDC) can utilize this alternative low watermark approach to capture and track data changes effectively.

Extracting the changes uses a lot of resources. Of course watermark may be indexed to lower the impact of the select statement at the cost of storing (and continuously updating) the additional index.

Ideal for : traditional data warehouse applications that rely on Extract, Transform, and Load (ETL) jobs, especially when the source tables do not process deletions and when applications have a dependable low watermark column across all tables.

Consider other types of CDC when : you need to process deletions and prioritize exact transactional consistency, timestamp or low watermark-based CDC may not be the most suitable option.

Change Data Capture with Datazip

At Datazip, we understand the importance of data integration and its impact on business decision-making. That's why our super data app is designed to provide an end-to-end data platform, offering everything from seamless data integration to interactive dashboards.

Datazip offers a comprehensive platform equipped with over 150+ pre-built connectors, empowering data teams to seamlessly centralize and transform data from numerous SaaS and on-premises data sources into cloud destinations.

Our connectors are thoughtfully enhanced with right methods of CDC technology, guaranteeing efficient and high-volume data movement, thereby catering to a wide range of deployment options.

With Datazip's cutting-edge technology, data integration becomes a hassle-free and efficient process, enabling organizations to unlock the true potential of their data.

Ready to experience the benefits of CDC? Sign up today to start your free trial in your own private cloud!

Methods of Change data Capture​

Log-based change data capture (CDC)​

Trigger-based change data capture (CDC)​

Difference or Snapshot based Change Data Capture (Diff CDC)​

Updated at / Modified at / Timestamp / Watermark based Change Data Capture​

Change Data Capture with Datazip​