Data Masking

This component allows you to apply or enforce a specific schema as part of the data pipeline, with the optional ability to add and remove columns, modify data types, and force calculated values.

Overview

This component implements a declarative syntax, with similarities to the C# syntax for data types (with a few exceptions). Each line, unless empty or starting with two slashes (//) or within a comment block (/* */) represent a field. You can choose to list a few columns, in which case missing fields will be added, and existing ones will be modified. However, checking the Remove columns not specified below will discard all the columns not specifically listed.

HINT: A quick way to inject all available fields at once is to click on the Fill all known columns link

Attempting to remove fields that are used as part of calculated default values by other fields will throw an error

If you would like to make sure the schema is applied successfully, change the Must success option to Yes. Otherwise, warnings will be logged, but data pipeline processing will continue.

A field uses the following syntax:

<type> [name]|<description>|<expression>

Elements in angle brackets are optional, and vertical pipes are optional if information to the right is not provided.
The default data type, if missing, is string.

Here are a few simple examples:

string firstName Declares a field of type string called firstName
string firstName||John Declares a field of type string called firstName with a value of John
//string firstName||John Ignore this line
int [total length] Declares a field called Total Length and converts any existing field of this name to an integer data type if possible

Supported Data Types

The following data types are supported:

bool, boolean byte byte[] char datetime decimal double
float, single guid int long short string
In addition, the following three data types support custom lengths: string, decimal, byte[] and char. To specify a maximum length, use () or []. Decimals accept two digits. When specifying a length, 4000 is the maximum value.
string[255] firstName Forces the field to only hold up to 255 characters
byte[1024] data Declares a field with up to 1024 bytes
decimal(18, 3) amount Declares a decimal field with a precision of 18 and a scale of 3

DataZen Functions

If the Apply DataZen Functions by Row is unchecked, DataZen functions are evaluated once for the entire schema script provided. For example, if #now() is on two separate fields, the current date/time will be calculated and replaced in the script before processing the schema logic. If, however, the Apply DataZen Functions by Row is checked, no pre-processing takes place and the functions will be evaluated for each row in the data pipeline.

Expression Values

When specifying an expression as part of the field definition, the behavior changes based on the Apply DataZen Functions by Row. Regardless of this setting, providing an expression overrides any current data from the current data pipeline dataset if the field already exists with the expression result.

DataZen Functions

While processing DataZen functions "per row" may incur additional processing time, you can use this option to modify certain values if needed, such as providing a default if the field is NULL on any given row. Applying a DataZen function like this does not turn the field into a read-only field; it just modifies its value.

string(128) [name]||#isnullorempty({{name}}, NA)

@sql()

The @sql() operator can be used to indicate that a simple SQL operation needs to be performed on the field, such as calculating the difference between to fields on the same row. This makes this field read only and can no longer be modified for the duration of the pipeline; in addition, any other fields that are used as part of the calculation are required. Attempting to remove a dependent field will throw an error. The expression inside the @sql() operator must follow the .NET DataColumn Expression rules.

For example, you can add a new calculated value, total_amnount, to an existing data set that has two columns called unit_price and quantity:

float unit_price||@sql(unit_price * quantity)

Using Runtime Values

You can also use pipeline runtime variables directly as values. For example, to add a new column called string executionId, you can use this declaration:

string executionId||@executionId

See the Runtime Variables section for more information.

Description

The middle section of a field specification may be used to add some inline documentation about this field. While optional, it is highly recommended in order to keep track of intent. Adding a description has no functional implication to the data set.

string executionId|This is the unique execution id of this job that we use later to group record counts|@executionId