SQL CDC Introduction
Overview
This feature is in preview
SQL CDC is a scripting language that uses a lightweight version of
the SQL language in order to facilitate access to data from a large number of
endpoints, sources and targets, including HTTP/S platforms, and provide data engineering functions that
enable advanced ETL processing scenarios. Once an SQL CDC pipeline has been created, it can
be managed like any other pipeline in DataZen, and can be scheduled for execution or triggered
from other pipelines. The implementation of this design follows a prescribed architecture established by our DataZen implementation
approach and is centered on a Change Log integration paradigm. As a result, the SQL CDC language has specific
constraints that ensure your pipelines are built for ease of maintenance and flexibility.
Three fundamental implementation patterns are supported with SQL CDC:
- Read Pipeline: Allows you to read source data with optional change cature and transformation of data
- Write Pipeline: Allows you to load previously created change logs and forward them to a target of your choice with optional further target-specific transformations
- Read + Write Pipeline (a.k.a. Direct Pipeline): Allows you to read source data with optional change cature and transformation of data, and forward the change logs to a target of your choice with optional further target-specific transformations
SQL CDC allows you to transform data inline, without the need to stage your data in most instances. Transforming data is done through a Component, which can be used in three different places depending on your needs: as part of the SELECT operation, after the SELECT operation but before the CAPTURE command, and after either the CAPTURE or LOAD operation:
Inside SELECT | When included inside a SELECT operation, the transformation block within APPLY PIPELINE executes one or more components, one after the other, for every page of data being read from the source |
After SELECT, before CAPTURE | When added after a SELECT operation, but before a CAPTURE operation, the transformation block within APPLY PIPELINE, or any individually component specified will execute one after the other, for the entire data set collected after the completion of the SELECT command |
After CAPTURE, LOAD | When added after a CAPTURE or LOAD operation, but before a PUSH operation, the transformation block within APPLY PIPELINE, or any individually component specified will execute one after the other, for the entire data set available at that time before the PUSH operation starts; it should be noted that if both UPSERT and DELETE streams are available, the pipeline components will execute on both streams. If no data leaves the transformation components, the PUSH operation will not be executed. |
Read Pipeline
SQL CDC is built as a data pipeline engine that is made of a Read Pipeline, one or more data engineering functions, a optionally followed by a change capture operation. The output of a read pipeline is a change log that can then be played against one or more targets, using write pipelines.
Syntax
The SQL CDC batch syntax for a read pipeline is as follows:
{ USING ... } SELECT ... APPLY PIPELINE (...) ; { APPLY ... }; { CAPTURE ... }; { TRIGGER ... };
The USING operation is only supported for SELECT HTTP operations.
Applying a pipeline inside the SELECT operation for HTTP, Databases, and Big Data sources implies that all its transformations will be executed for every page of data retrieved from the source; by contrast, any transformation declared after the SELECT command will execute once for the entire data set retrieved. Both can be used in the same batch.
The CAPTURE operation is optional, but if ommited, the change log will not be persisted upon batch completion.
Comments
For example, the following read pipeline gets the latest RSS feed available
from an HTTP/S connection previously defined, called RSS Feed
,
and automatically transforms the XML output into rows and columns using the
item
nodes returned.
SELECT TOP 100 * FROM HTTP [RSS Feed] (GET /econi.xml) APPLY TX (//item);
It is then possible to apply further transformations on the fly and/or dump the data in a relational database as needed. The following script retrieves all available RSS feeds, and executes an inline SQL operation using the
@pipelinedata()
function to
further transform the data.
SELECT * FROM HTTP [RSS Feed] (GET /econi.xml) APPLY TX (//item); EXEC ON DB [sqldb] (SELECT TOP 50 CAST([guid] as NVARCHAR(36)) [guid], CAST([title] as NVARCHAR(500)) [title] FROM @pipelinedata());If no CDC operation is applied on the source data, all the records identified will be saved into a change log. However, a CDC operation can optionally be applied to identify changes from the last execution of the pipeline, and if changes are detected, a change log is created containing the changes only. In this example, a CDC operation is performed using the
guid
column and the change log is
to be stored in the path specified (this can also be a cloud connection to an AWS S3 bucket or
an Azure ADLS 2.0 container).
SELECT * FROM HTTP [RSS Feed] (GET /econi.xml) APPLY TX (//item); EXEC ON DB [sqldb] (SELECT TOP 50 CAST([guid] as NVARCHAR(36)) [guid], CAST([title] as NVARCHAR(500)) [title] FROM @pipelinedata()); CAPTURE INSERT, UPDATE ON KEYS 'guid' LOG 'pipeline001' PATH 'c:\tmp\logs\';
Write Pipeline
Write pipelines start by reading the last available change log previously created by a read pipeline, apply optional ETL processing, and push the data to a specific target.
Syntax
The SQL CDC batch syntax for a write pipeline is as follows:
LOAD ... { APPLY ... }; PUSH ... { TRIGGER ... }
The LOAD operation must the first operation in a write pipeline and only a single load operation is allowed.
Any number of transformations is allowed following the LOAD operation.
A single PUSH operation is allowed in a Write pipeline and only TRIGGERS may be declared afterwards.
Comments
The first step involves loading a change log using the LOAD
command. By default, this operation loads the next available change log; this information
is stored as metadata in the pipeline definition.
LOAD 'pipeline001' PATH 'c:\tmp\logs\';Once loaded, you can apply ETL functions similarly to the read pipeline, and finally call the PUSH operation. In this example, a different file name is created every time the pipeline executes, and the data is further distributed across folders in ADLS 2.0:
LOAD 'pipeline001' PATH 'c:\tmp\logs\'; PUSH INTO DRIVE [adls2.0] FORMAT 'parquet' COMPRESSION 'snappy' CONTAINER 'scada' FILE '[yyyy]/[mm]/[dd]/target_@executionid.parquet';
Direct Pipeline
Direct pipelines have both a read and write pipeline combined into a single SQL CDC Batch operation. Although the read pipeline still creates change logs when data is available, direct pipelines have the option to discard the change log upon successful execution of the write operation.
Syntax
The SQL CDC batch syntax for a write pipeline is as follows:
{ USING ... } SELECT ... APPLY PIPELINE (...) ; { APPLY ... }; { CAPTURE ... }; { APPLY ... }; PUSH ... { TRIGGER ... }
The USING operation is only supported for SELECT HTTP operations.
Any number of transformations is allowed following the SELECT operation. If a CAPTURE operation is specified, any number of additional transformations is allowed following the CAPTURE operation; these transformations will not be captured.
A single PUSH operation is allowed and only TRIGGERS may be declared afterwards.
Comments
Here is an example of a SQL CDC batch operation that reads and writes captured changes:
SELECT * FROM HTTP [RSS Feed] (GET /econi.xml) APPLY TX (//item); CAPTURE INSERT, UPDATE ON KEYS 'guid' LOG 'pipeline001' PATH 'c:\tmp\logs\'; PUSH INTO DRIVE [adls2.0] FORMAT 'parquet' COMPRESSION 'snappy' CONTAINER 'scada' FILE '[yyyy]/[mm]/[dd]/target_@executionid.parquet';