Enzo DataZen Architecture

August 2021 DataZen


DataZen is a Change Data Capture (CDC) technology that allows organizations to copy, replicate, and exchange data from any source system into any target platform, even if those systems are hosted by different companies. This document provides an overview of key design principles used by DataZen that enable a flexible and fully decoupled replication architecture, including:

  • Universal CDC Log: A generic change data format enables heterogeneous cross-system replication (messaging, flat files, APIs, HTTP endpoints, and databases)

  • Synthetic Change Data Capture: Automatic change detection (added, updated, deleted) over time of any source system

  • File-based replication: Portable CDC Log enables B2B data replication, full initial synchronization, replay functionality, and multicasting

Universal CDC Log

DataZen is designed to replicate data from any source to any target system, regardless of the inherent compatibility of these systems, or whether these systems are located in the same data center. For example, DataZen can replicate data from an Oracle database to a SQL Server database, or from a SharePoint List to Parquet files, or from Parquet files to a messaging platform such as an Azure Event Hub. This capability is possible thanks to the universal change format used by DataZen when creating the CDC Log (the CDC Log is sometimes referred to as a Data Sync File).


Change Log Architecture


The generic CDC Log created by DataZen enables any to any replication.




Replicate MySQL tables to any target system


CDC Log files can be played back on any target system.

The CDC Log is built by extracting data from the source system and turning every record into an internal data row, and when possible, extracting the detailed schema of the source table. This is achieved either directly by DataZen's built-in drivers, Enzo Server's Data Virtualization, or ODBC drivers. This turns any source system into a virtual database with an optional list of primary keys for unique record identification. Whenever possible, DataZen will also extract the source system schema for the data being retrieved. The changed data is then stored in a compressed internal format within the CDC Log directly, which may be broken up into chunks of individual change tables so they can, individually, fit in memory. Each CDC Log can hold both Upserted (inserted and updated) and Deleted records.

Each CDC Log is assigned a unique Execution Id, which represents the timestamp when the log was created. The Execution Id is used to determine the order of execution of the log files, and is both part of the naming convention of the CDC Log File, and stored within the log file itself for auditing purposes. The CDC Log also contains additional metadata information, such as the source Job Name, summary change log information and other properties that are used when applying the log to target systems.



DataZen allows administrators to secure the CDC Change Log using PGP encryption; this ensures that the log file can safely be copied across public networks and can only be played back by parties that have the associated decryption key.

Synthetic Change Data Capture

DataZen creates the CDC Log by comparing data retrieved from the source system to the last known values in each row by using an efficient hashing mechanism. The hash of the last known values are stored in an internal Hash Table along with the hash value of the primary keys. This ensures that while DataZen can detect changes in any given row from a source system, the actual values of each row are not stored in DataZen's internal tables for security reasons.


CDC Engine


Using its internal Hash Table, DataZen's CDC Engine can quickly identify new, updated, and deleted records. These changes are stored in the CDC Log described previously. The ability to compare the state of each record using Hash Tables allows DataZen to generate CDC Logs against any source system. Creating CDC Logs by comparing records from their last known state is referred to as a Synthetic Change Data Capture; the CDC Log is constructed by inspecting the state of each record at specific intervals.

Because a Synthetic CDC inspects changes made to the data at a given point in time, not all changes made to the source system may be detected. For example, if a record has been added then deleted almost immediately, the CDC Engine may not know that the record was ever created in the first place because only the net changes will be identified.

Certain systems provide their own CDC tables, such as SharePoint Online or SQL Server; when available, DataZen can be configured to query the source system's own CDC table to capture all changes.

Advanced options are available to fetch only the records that were added or updated in the source system by leveraging date/time fields when available. This is particularly important for slower or remote source systems such as SharePoint Online; when doing so, the CDC Engine may need to make a second call into the source system to identify deleted records. DataZen offers advanced options to query source systems for deleted records.

File-based Replication

DataZen uses the CDC Log files described previously as the basis for replicating data; since these files encapsulate data, schema, and general configuration settings, they are fully self describing and can be copied anywhere and played back in another DataZen environment, against any target system, at any time. DataZen uses Reader Agents to create the CDC Logs, and Writer Agents to play them back.


Reads and Writers


DataZen's file-based replication model enables the following capabilities:

B2B Data Replication

The ability to copy CDC Logs anywhere (including cloud folders) allows two or more companies to exchange/replicate data regardless of the source and target system. The CDC Logs can be stored in Azure Blobs, AWS S3 Buckets, or an FTP site for example. The CDC Logs can also be PGP encrypted for additional security.

Log Replay & Multicasting

DataZen offers the option to replay a single CDC Log file, or replay all available CDC Log files in sequence. Since CDC Log files are available for replay, they can be processed multiple times against multiple target systems independently.

Full Initialization

By its very nature, the CDC Engine creates an initial log file that contains all the identified source records the first time it runs. This enables DataZen to create an initialization log that can be played against any target system, just like any other CDC Log.

Shared-Nothing Architecture

The replication model used by DataZen leverages the benefits of a shared-nothing architecture, a distributed computing model in which the target systems are unaware of the existence of the source systems, and vice-versa. Source and Target systems can operate independently and any part of the replication topology can be upgraded without a system-wide shutdown.

Schema Independence

Since each CDC Log contains schema information, each target can select which data elements to replicate. This allows each target to have a different schema if desired. One target could be an HTTP Endpoint, a second target could be a relational database, and a third one could be Parquet files for example.

Micro Batch Processing

Because CDC Logs are created on a schedule, capturing changes in bulk at specific intervals, DataZen implements a micro batch processing pattern that reduces network chattiness and as a result improves replication performance.

Conclusion



This article introduces you to DataZen, a flexible any-to-any replication technology that uses universal Synthetic CDC Log files and a shared-nothing architecture for maximum flexibility. This in turn allows companies to leverage many capabilities, such as secured Business-to-Business replication, replay capabilities, full initialization, and multicasting.



To try DataZen at no charge, download the latest edition of DataZen on our website.





Want to see DataZen in action?





NEW !!! INTRODUCING DATAZEN

Secured Corporate, Cloud and B2B Data Replication

  LEARN MORE   CONTACT US





Any Source

Combined with Enzo Unified, DataZen allows you to replicate data from virtually any source system, including Twitter, SharePoint Online lists, flat files, NoSQL data, or any ODBC data sources.



Flexible Cross Database Replication

Replicate tables between Oracle, MySQL, SQL Server, Teradata, DB2 or any other relational database, either located on the same network, or across the globe.




Secured B2B Data Replication

Securely share specific tables/views from any internal database with business partners using PGP encryption and cloud drives.



Cloud Adoption

By detecting changes at the source, DataZen forwards only new, updated and deleted records to the target systems, saving bandwidth and helping with cloud adoption.







To learn more about configuration options, and to learn about the capabilities of DataZen, download the User Guide now.

  USER GUIDE