Data Integration: Change Data Capture

CDC (Change Data Capture) is a modern data architecture that provides a highly efficient way to move data across a wide area network. And, since it’s moving data in real-time, it also supports real-time analytics and data science.

By capturing and seamlessly transferring every transactional change from the source database to the target in real-time, this cutting-edge technology ensures that systems remain perfectly synchronised. This not only guarantees reliable data replication but also enables hassle-free cloud migrations without any downtime.

Benefits:

- Modern data architecture that support real-time data delivery.

– Zero footprint architecture via log-based configuration.
– Cloud optimised

- Multiple approaches to CDC to support all sources regardless if they are on-premises, cloud-based or SaaS applications.

Benefits of Change Data Capture

CDC offers numerous advantages in your overall data integration strategy. Whether you're transferring data to a data warehouse or data lake, establishing an operational data store or a real-time replica of the source data, or even implementing a cutting-edge data fabric architecture, CDC empowers your organisation to extract maximum value from your data and cater for all use-cases. It enables seamless integration and accelerated data analysis, all while optimising system resources. Here are some key benefits:

Eliminates the need for bulk load updating and inconvenient batch windows by enabling incremental loading or real-time streaming of data changes into your target repository.
Log-based CDC is a highly efficient approach for limiting impact on the source extract when loading new data.
Since CDC moves data in real-time, it facilitates zero-downtime database migrations and supports real-time analytics, fraud protection, and synchronizing data across geographically distributed systems.
CDC is a very efficient way to move data across a wide area network, so it’s perfect for the cloud.
Change data capture is also well suited for moving data into a stream processing solution like Apache Kafka.
CDC ensures that data in multiple systems stays in sync. This is especially important if you're making time-sensitive decisions in a high-velocity data environment.

Change Data Capture Approaches

There are a few ways to implement a change data capture into your data pipelines. Historically, developers and DBAs utilised techniques such as table differencing, change-value selection, and database triggers to capture changes made to a database. These methods, however, can be inefficient or intrusive and tend to place substantial overhead on source servers. This is why DBAs quickly embraced embedded CDC features which are log-based. These features utilise a background process to scan database transaction logs in order to capture changed data. Therefore, transactions are unaffected, and the performance impact on source servers is minimised.

The most popular method is to use a transaction log which records changes made to the database data and metadata. Here we discuss the three primary approaches.

Log-based CDC: This is the most efficient way to implement CDC, and recognised as the golden standard. When a new transaction comes into a database, it gets logged into a log file with no impact on the source system. And you can pick up those changes and then move those changes from the log.
Query-based CDC. Here you query the data in the source to pick up changes. This approach is more invasive to the source systems because you need something like a timestamp in the data itself.
Trigger-based CDC. In this approach, you change the source application to trigger the write to a change table and then move it. This approach reduces database performance because it requires multiple writes each time a row is updated, inserted, or deleted.

ETL vs ELT

A growing topic of discussion when planning a data pipeline is what architecture to use, particularly when coupled with change date capture, ETL (Extract, Transform and then Load) or ELT (Extract, Load and then Transform).

ETL is the traditional architecture that's still most commonly implemented today. It's great for small to medium data sets, and it benefits from being immediately available once loaded into the data warehouse. It does have one increasingly major flaw, it can be time consuming to process the data, causing a prolonged reload window. As a result, ELT has gained traction more recently due to growing data volumes, data pipelines moving to the cloud and as real-time use-cases arise.

Extract Load Transform is the newer, modern approach for data pipelines. It allows you to extract and land the data into the data warehouse, making it quicker and more efficient to free the data. With transformations being performed on a as-needed basis. So it's perfectly suited for real-time use-cases, large data sets, and moving the analytical load away from source systems.

Real-time Change Data Capture

Benefits of Change Data Capture

Change Data Capture Approaches

ETL vs ELT

ETL

(Extract Transform Load)

ELT

(Extract Load Transform)