Change Capture
A synthetic change data capture (or CDC) pattern allows you to filter incoming records and filter out the records that have not changed.
When the source system returns a very large amount of data, this pattern provides the least possible performance when used by itself
since the data must be read in full before being evaluated row-by-row for possible changes.
This pattern is different than the CDC Stream pattern, in which the source system has already performed the necessary operations to
provide a change stream. Other variations include the CDC Watermark or the Window Capture pattern.
Pattern Overview
This pattern describes a read operation from the source system that is compared to a previous read in order to
determine which records are new, updated, or optionally deleted. This pattern relies on reading the same set of data
from the source, every time; in other words, if nothing changes, the exact same set of records should be returned.
Understanding the performance aspect of using this pattern is key in determining whether it is fit for the scenario at hand,
both in terms of resource consumption and frequency of the capture. For example, this pattern may work well if the
source system returns 10,000 records and the CDC operation needs to run every hour, but may be a performance concern
if 1,000,000 records need to be read every 5 seconds.
In some cases, you may be able to use an immutable filter to limit the universe of meaningful records in scope for
the CDC operation. For example, you may decide that records older than a specific date can always be ignored.
DataZen Implementation
The implementation of the Synthetic CDC pattern is the same regardless of the source system, although two implementations are available: For more information, see the DataZen Synthetic CDC section.
Implementation | DataZen-Managed | Available For | Supported Change Capture | Comments |
---|---|---|---|---|
Job CDC | Yes | Source Data | Inserted, Updated, Deleted | The default Synthetic CDC engine implemented by DataZen takes place as the last step, after reading source records and applying the source Data Pipeline |
Inline CDC | No | Source or Target Data | Inserted, Updated | The inline Synthetic CDC operation within a Data Pipeline can be applied to the source stream and/or the target stream independently; some built-in features (such as CDC reinitialization) may not be available and are managed manually by an administrator |
Job CDC
To use the DataZen-managed Synthetic feature, simply select the field(s) that are used to identify a unique record in the CDC Key Columns textbox.
By default, the CDC operation will identify new and updated records. If you would like to also capture deleted records, check the
Identify and propagate deleted records.
The way to identify deleted records changes based on the source system; when the source system is a relational database, additional options are available.
Also, when creating the job the first time, the default behavior will cause DataZen to automatically create a Change Log with all source records identified
as new records. If, for some reason, the target system already has data, and you would like to simply initialize the CDC internal table (and discard the Change Log),
check the Initialize the CDC table only only first run. This option may be useful if you are recreating a CDC job and you know the target system already
has all the necessary records.
The Job Pipeline lifecycle is such that the Synthetic CDC operation happens after the completion of the Data Pipeline defined on the source system. This ensures that the necessary data translations, masking, or other operations take place before the change capture process.
Inline CDC
The Synthetic CDC operation can be added to a Data Pipeline directly, as part of the data transformation handling logic.
Unlike the DataZen-Managed CDC, this operation uses a customer-supplied database table, allowing administrators to
monitor which records are changed, and optionally force changes by modifying the CDC tracking table directly.
A few limitations apply.
See the Inline CDC pipeline component for more information.