Data Masking
This component allows you to apply or enforce a specific schema as part of the data pipeline, with the optional ability to add and remove columns, modify data types, and force calculated values.
Overview
This component implements a declarative syntax, with similarities to the C# syntax for data types (with a few exceptions). Each line, unless empty or starting with two slashes (//) or within a comment block (/* */) represent a field. You can choose to list a few columns, in which case missing fields will be added, and existing ones will be modified. However, checking the Remove columns not specified below will discard all the columns not specifically listed.
HINT: A quick way to inject all available fields at once is to click on the Fill all known columns link
Attempting to remove fields that are used as part of calculated default values by other fields will throw an error
If you would like to make sure the schema is applied successfully, change the Must success option to Yes. Otherwise, warnings will be logged, but data pipeline processing will continue.
A field uses the following syntax:
<type> [name]|<description>|<expression>
Elements in angle brackets are optional, and vertical pipes are optional if information to the right is not provided.
The default data type, if missing, is string
.
Here are a few simple examples:
string firstName |
Declares a field of type string called firstName |
string firstName||John |
Declares a field of type string called firstName with a value of John |
//string firstName||John |
Ignore this line |
int [total length] |
Declares a field called Total Length and converts any existing field of this name to an integer data type if possible |
Supported Data Types
The following data types are supported:
In addition, the following three data types support custom lengths: string, decimal, byte[] and char.
To specify a maximum length, use () or []. Decimals accept two digits. When specifying a length, 4000 is the maximum value.bool, boolean
byte
byte[]
char
datetime
decimal
double float, single
guid
int
long
short
string
string[255] firstName |
Forces the field to only hold up to 255 characters |
byte[1024] data |
Declares a field with up to 1024 bytes |
decimal(18, 3) amount |
Declares a decimal field with a precision of 18 and a scale of 3 |
DataZen Functions
If the Apply DataZen Functions by Row is unchecked, DataZen functions are evaluated once for the entire schema script provided.
For example, if #now()
is on two separate fields, the current date/time will be calculated and replaced in the script
before processing the schema logic. If, however, the Apply DataZen Functions by Row is checked, no pre-processing takes place and
the functions will be evaluated for each row in the data pipeline.
Expression Values
When specifying an expression as part of the field definition, the behavior changes based on the Apply DataZen Functions by Row. Regardless of this setting, providing an expression overrides any current data from the current data pipeline dataset if the field already exists with the expression result.
DataZen Functions
While processing DataZen functions "per row" may incur additional processing time, you can use this option to modify certain
values if needed, such as providing a default if the field is NULL on any given row. Applying a DataZen function like this
does not turn the field into a read-only field; it just modifies its value.
string(128) [name]||#isnullorempty({{name}}, NA)
@sql()
The @sql() operator can be used to indicate that a simple SQL operation needs to be performed on the field, such as calculating the difference between to fields on the same row. This makes this field read only and can no longer be modified for the duration of the pipeline; in addition, any other fields that are used as part of the calculation are required. Attempting to remove a dependent field will throw an error. The expression inside the @sql() operator must follow the .NET DataColumn Expression rules.
For example, you can add a new calculated value, total_amnount
, to an existing data set that has
two columns called unit_price
and quantity
:
float unit_price||@sql(unit_price * quantity)
Using Runtime Values
You can also use pipeline runtime variables directly as values. For example, to add a new column called string executionId
,
you can use this declaration:
string executionId||@executionId
See the Runtime Variables section for more information.
Description
The middle section of a field specification may be used to add some inline documentation about this field. While optional, it
is highly recommended in order to keep track of intent. Adding a description has no functional implication to the data set.
string executionId|This is the unique execution id of this job that we use later to group record counts|@executionId