SQL CDC Introduction

Overview

This feature is in preview

SQL CDC is a scripting language that uses a lightweight version of the SQL language in order to facilitate access to data from a large number of endpoints, sources and targets, including HTTP/S platforms, and provide data engineering functions that enable advanced ETL processing scenarios. Once an SQL CDC pipeline has been created, it can be managed like any other pipeline in DataZen, and can be scheduled for execution or triggered from other pipelines. The implementation of this design follows a prescribed architecture established by our DataZen implementation approach and is centered on a Change Log integration paradigm. As a result, the SQL CDC language has specific constraints that ensure your pipelines are built for ease of maintenance and flexibility.

Three fundamental implementation patterns are supported with SQL CDC:

  • Read Pipeline: Allows you to read source data with optional change cature and transformation of data
  • Write Pipeline: Allows you to load previously created change logs and forward them to a target of your choice with optional further target-specific transformations
  • Read + Write Pipeline (a.k.a. Direct Pipeline): Allows you to read source data with optional change cature and transformation of data, and forward the change logs to a target of your choice with optional further target-specific transformations

SQL CDC allows you to transform data inline, without the need to stage your data in most instances. Transforming data is done through a Component, which can be used in three different places depending on your needs: as part of the SELECT operation, after the SELECT operation but before the CAPTURE command, and after either the CAPTURE or LOAD operation:

Inside SELECT When included inside a SELECT operation, the transformation block within APPLY PIPELINE executes one or more components, one after the other, for every page of data being read from the source
After SELECT, before CAPTURE When added after a SELECT operation, but before a CAPTURE operation, the transformation block within APPLY PIPELINE, or any individually component specified will execute one after the other, for the entire data set collected after the completion of the SELECT command
After CAPTURE, LOAD When added after a CAPTURE or LOAD operation, but before a PUSH operation, the transformation block within APPLY PIPELINE, or any individually component specified will execute one after the other, for the entire data set available at that time before the PUSH operation starts; it should be noted that if both UPSERT and DELETE streams are available, the pipeline components will execute on both streams. If no data leaves the transformation components, the PUSH operation will not be executed.

Read Pipeline

SQL CDC is built as a data pipeline engine that is made of a Read Pipeline, one or more data engineering functions, a optionally followed by a change capture operation. The output of a read pipeline is a change log that can then be played against one or more targets, using write pipelines.

Syntax

The SQL CDC batch syntax for a read pipeline is as follows:

{ USING ... }
SELECT ... APPLY PIPELINE (...) ;
{ APPLY ... }; 
{ CAPTURE ... };
{ TRIGGER ... };
                    

The USING operation is only supported for SELECT HTTP operations.

Applying a pipeline inside the SELECT operation for HTTP, Databases, and Big Data sources implies that all its transformations will be executed for every page of data retrieved from the source; by contrast, any transformation declared after the SELECT command will execute once for the entire data set retrieved. Both can be used in the same batch.

The CAPTURE operation is optional, but if ommited, the change log will not be persisted upon batch completion.

Comments

For example, the following read pipeline gets the latest RSS feed available from an HTTP/S connection previously defined, called RSS Feed, and automatically transforms the XML output into rows and columns using the item nodes returned.

SELECT TOP 100 * FROM HTTP [RSS Feed] (GET /econi.xml) APPLY TX (//item);

It is then possible to apply further transformations on the fly and/or dump the data in a relational database as needed. The following script retrieves all available RSS feeds, and executes an inline SQL operation using the @pipelinedata() function to further transform the data.

SELECT * FROM HTTP [RSS Feed] (GET /econi.xml) APPLY TX (//item);
EXEC ON DB [sqldb] (SELECT TOP 50 CAST([guid] as NVARCHAR(36)) [guid], CAST([title] as NVARCHAR(500)) [title] FROM @pipelinedata());
                    
If no CDC operation is applied on the source data, all the records identified will be saved into a change log. However, a CDC operation can optionally be applied to identify changes from the last execution of the pipeline, and if changes are detected, a change log is created containing the changes only. In this example, a CDC operation is performed using the guid column and the change log is to be stored in the path specified (this can also be a cloud connection to an AWS S3 bucket or an Azure ADLS 2.0 container).

SELECT * FROM HTTP [RSS Feed] (GET /econi.xml) APPLY TX (//item);
EXEC ON DB [sqldb] (SELECT TOP 50 CAST([guid] as NVARCHAR(36)) [guid], CAST([title] as NVARCHAR(500)) [title] FROM @pipelinedata());
CAPTURE INSERT, UPDATE ON KEYS 'guid' LOG 'pipeline001' PATH 'c:\tmp\logs\';

Write Pipeline

Write pipelines start by reading the last available change log previously created by a read pipeline, apply optional ETL processing, and push the data to a specific target.

Syntax

The SQL CDC batch syntax for a write pipeline is as follows:

LOAD ... 
{ APPLY ... }; 
PUSH ...
{ TRIGGER ... }
                    

The LOAD operation must the first operation in a write pipeline and only a single load operation is allowed.

Any number of transformations is allowed following the LOAD operation.

A single PUSH operation is allowed in a Write pipeline and only TRIGGERS may be declared afterwards.

Comments

The first step involves loading a change log using the LOAD command. By default, this operation loads the next available change log; this information is stored as metadata in the pipeline definition.

LOAD 'pipeline001' PATH 'c:\tmp\logs\';
Once loaded, you can apply ETL functions similarly to the read pipeline, and finally call the PUSH operation. In this example, a different file name is created every time the pipeline executes, and the data is further distributed across folders in ADLS 2.0:

LOAD 'pipeline001' PATH 'c:\tmp\logs\';
PUSH INTO DRIVE [adls2.0] FORMAT 'parquet' COMPRESSION 'snappy' CONTAINER 'scada' FILE '[yyyy]/[mm]/[dd]/target_@executionid.parquet';

Direct Pipeline

Direct pipelines have both a read and write pipeline combined into a single SQL CDC Batch operation. Although the read pipeline still creates change logs when data is available, direct pipelines have the option to discard the change log upon successful execution of the write operation.

Syntax

The SQL CDC batch syntax for a write pipeline is as follows:

{ USING ... }
SELECT ... APPLY PIPELINE (...) ;
{ APPLY ... }; 
{ CAPTURE ... };
{ APPLY ... }; 
PUSH ...
{ TRIGGER ... }
                    

The USING operation is only supported for SELECT HTTP operations.

Any number of transformations is allowed following the SELECT operation. If a CAPTURE operation is specified, any number of additional transformations is allowed following the CAPTURE operation; these transformations will not be captured.

A single PUSH operation is allowed and only TRIGGERS may be declared afterwards.

Comments

Here is an example of a SQL CDC batch operation that reads and writes captured changes:

SELECT * FROM HTTP [RSS Feed] (GET /econi.xml) APPLY TX (//item);
CAPTURE INSERT, UPDATE ON KEYS 'guid' LOG 'pipeline001' PATH 'c:\tmp\logs\';
PUSH INTO DRIVE [adls2.0] FORMAT 'parquet' COMPRESSION 'snappy' CONTAINER 'scada' FILE '[yyyy]/[mm]/[dd]/target_@executionid.parquet';