Data Pipeline

Jobs can define a Data Pipeline that can translate, transform, and enhance a dataset on the fly before being saved to the change log and/or sent to the target system. When applied at the source, the Data Pipeline executes before the Synthetic CDC operation (if engaged), but after the high watermark is applied against the source.

The execution of the Data Pipeline follows the order in which the pipeline components are listed. You can change the order of each component by using the up/down arrows.

To preview the data pipeline, click on the Run Data Pipeline button. Since the Target System uses the output of the data pipeline, if one is defined, it is important to execute the data pipeline before configuring the target system options.

When defining a Direct Job, two separate Data Pipelines can be defined: one on the source and another on the target. The separation of the reader and writer operation allows multiple data pipelines to be executed, providing the flexibility needed to apply different processing logic based on the target. For example, if one of the targets is a TEST environment, the target data pipeline for the TEST system could mask the data before pushing the data.

Building a Data Pipeline

The data pipeline interface provides a canvas for adding components that can translate or transform the data using pre-built data engineering functions. When the data pipeline is empty, you can run the pipeline to preview the source data; clicking on the Run Data Pipeline button executes the full pipeline. You can also inspect the schema of the data set at any time.

Running the data pipeline requires the reader (or writer) to have some sample data available; a warning will be displayed if no data is currently available.

To add pipeline components, click on the Add button and choose the component to add.

Using a Component

When building or modifying a job, you can inspect and test your data pipeline. In order to test the data pipeline, you need to have some data available from the source system; this is normally done by running a preview operation. Once some data is available, you can run a data pipeline in full, up to a specific component, or even run a single component. In addition, the data set schema can be inspected at that point by clicking on the Inspect Data Schema button. Keep in mind however that when running the data pipeline partially, the schema may be incomplete and certain options in the Target screen may not show the final schema.

Each component comes with a menu allowing you to perform the following operations:

Run to here: Run all previous components including the selected component, then stop processing the data pipeline
Run me: Run the selected component only; for this option to be available, the previous component must have been executed at least once
Run from here: Run all remaining components including the selected component
Inspect Data: Open a debugging window showing the data entering and leaving the component

You can disable and reenable a component by clicking on the top left icon; this allows you to test your pipeline without the component if needed.

When clicking on Inspect Data a window will appear for the selected component. You can open this window for multiple components. As data flows through the component, the input and output data sets will appear. If nothing is displayed, the component has not yet executed. To keep this window in front, click on the pin icon.

Local vs. Agent Execution

Data Pipelines execute on the machine they are executed on. If a job is running in the cloud for example, the Data Pipeline components will also be executed in the cloud. However, running a Data Pipeline from DataZen Manager while creating or editing a job will execute the components from your local machine. Due to security differences between your local machine and the agent's, it may be possible for some components to fail on the DataZen Agent, for example if firewall rules prevent connections from cloud locations.

Pipeline Components

The pipeline components available provide advanced data engineering functions that accelerate integration projects. You can also add your own custom .NET components as needed for more advanced scenarios or high performance needs.

You can extend Data Pipelines in two ways: use an HTTP/S Endpoint component or a .NET Extension. Extending you data pipelines gives you the ability to enhance your company's data engineering and data management capabilities. For example, if your company has published a Python AWS Lambda Function or an Azure Function using your company's proprietary AI/ML model, you could easily call the HTTP/S endpoint inline as part of your data pipeline execution.

The Custom .NET Extension option is not available for shared Cloud Agents. However, dedicated cloud agents and self-hosted agents can use this option.

Category	Component	Comments
Data Transformation	Data Filter	Applies a client-side filter to the data by adding a SQL Where clause, a JSON/XML filter, or a regular expression as a Data Filter
Data Transformation	Data Masking	Applies masking logic to a selected data column, such as credit card number or a phone number. Supports generating random numbers, free-form masking, and generic / full masking
Data Transformation	Data Hashing	Applies a hash algorithm to a selected data column (must be a string data type); supported hashing algorithms are MD5, SHA1, SHA256, SHA384 and SHA512
Data Transformation	HTTP/S Endpoint Function	Calls an external HTTP/S function or endpoint, and adds the results to the output or merges it with the input data.
Schema Management	Apply Schema	Transforms the data set schema as specified with optional default values and DataZen function calls.
Schema Management	Dynamic Data Column	Adds a column dynamically using a simple SQL formula, or a DataZen Function.
Schema Management	Keep/Remove Columns	Quickly remove undesired columns from the data set.
Transformation	JSON/XML to Table	Convert an XML or JSON document into a data set of rows and columns.
Transformation	CSV to Table	Convert a flat file document into a data set of rows and columns.
Change Capture	Apply Synthetic CDC	Applies an inline Synthetic Change Capture.
Staging	Sink Data to SQL	Sinks the current data set to a SQL Server table, optionally appending, truncating, or recreating it with automatic schema management.
Staging	Run SQL Command	Runs a SQL batch on the fly and optionally uses the output as the new data set in the pipeline. Can use @pipelinedata() to access the current pipeline data set.
Other	Data Quality	Inspects and applies data quality rules on the current data set.
Other	Custom .NET Extensions	Calls an external .NET DLL, passing the current data set, and replaces the pipeline data set with the output provided if desired.