APPLY SCHEMA
Overview
The Apply Schema operation allows you to enforce a schema on the data, coerce data types, add or remove multiple columns, and set default values to new columns.
The data types used by this component leverage the C# or SQL Server notation.
This component expects a new line for each field; specifying multiple field names on a single row is not supported.
Field names are case-sensitive.
To assign a default value to a new field, you can use either notation (notice the difference of the default value itself):
string name = 'john'
string name||johnUsing DataZen functions is supported; however, this component supports late-bound functions, which are evaluated per row instead of per script. This example shows how to assign a default value using both methods; uid1 will be assigned the same guid value for all rows, while uid2 will have a different guid for each row of data. If you specify the FX_BY_ROW option, script-level functions will execute per row like late-bound functions.
string [uid1] = '#rndguid()' string [uid2] = '@rndguid()'
You can perform calculations per row on new fields. For example, the following will create a new field, ID2, made up of
two field values and performing a late-bound function on them:
string [fullName] = '@toupper({{lastname}}), @toupper({{firstname}})'
You can perform a limited number of SQL calculations when using the @sql() operator. This will use a .NET
Expression calculation that executes in memory and creates a calculated field; as a result, this operation
has certain limitations. Because this operation creates a calculated field, all fields specified in the
calculation must also be present in the list of fields returned.
See
Microsoft's documentation on how to build a valid expression.
long id bigint id2 = '@sql(CONVERT([id], 'System.Int32') * 2)'
The STRICT_COLUMNS
option allows you to control what happens to the fields not specified in this
component block. When not specified, this component modifies all the fields specified and adds fields that are not
found in the current data set. For example, this operation adds a new field called fullName
to the
list of existing fields in the data pipeline; none of the other fields will be modified. If fullName does not exist
in the source data set, it will be added.
APPLY SCHEMA( string [fullName] = '@toupper({{lastname}}), @toupper({{firstname}})' );However, this operation will remove all fields not explicitely listed; the outcome of this operation will leave two columns in the data pipeline after its execution.
APPLY SCHEMA( long id string [fullName] = '@toupper({{lastname}}), @toupper({{firstname}})' ) WITH STRICT_COLUMNS;
Syntax
Applies and enforces a schema definition and optionally limits which fields are retained from the data set. Each line represents a column definition in the following format: {type} fieldName{|description{|< default | @sql(...) | #func(...) >}}
APPLY SCHEMA ( -- one line per column definition {type} fieldName{|description{|< default | @sql(...) | #func(...) >}} ) WITH { CONTINUE_ON_ERROR } { STRICT_COLUMNS } { FX_BY_ROW } ;
CONTINUE_ON_ERROR |
Continues processing even if errors are detected |
STRICT_COLUMNS |
Remove any fields not listed in the schema definition |
FX_BY_ROW |
Processes any DataZen function for each row instead of once for the entire schema definition |
Example 1
-- Get data from an RSS Feed SELECT * FROM HTTP [Rss Feed] (GET /econi.xml) APPLY TX (//item); -- Enforce a schema -- The schema follows a c#-like notation for data types; valid data types: -- int, short, long, datetime,guid,bool,float,double,byte,byte[],char,string,decimal -- Certain data types accept a length - byte, char and string (ex: string, or string[500]) -- The decimal data type supports length and precision (ex: decimal(4,2)) -- The default value can be a hard-coded value, a pipeline variable (ex: @executionId), -- a datazen function (ex: #rndguid()) or operate on the current row as a -- SQL operation (ex: @sql([field1] + [field1])) -- APPLY SCHEMA ( string(36) guid string(255) link string(4000) desc||#trim({{description}}) ) FX_BY_ROW -- this is necessary because the #trim() function is "per row" STRICT_COLUMNS -- remove any other column not listed above ; -- Now, add a GUID value, with the same GUID applied to all rows (no more FX_BY_ROW) -- The default data type is string if no data type is provided APPLY SCHEMA ( _controlId|new column with a default value|#rndguid() ) ;