Document Path

JSON and XML Document transformation into a data set of rows and columns is a key capability of DataZen and is exposed on multiple configuration screens. When the JSON / XML Path option is exposed, a document can be completely transformed as needed.

In addition, the JSON and XML Path field supports a unique feature: the ability to join the document with itself, as many times as needed. This allows you to perform complex transformations in one pass. This feature is described below.

When source documents are very large, this operation may require a lot of memory.

Using the XML/JSON Path feature

In its simplest form, this field allows you to enter a well-formatted XML XPath or JSON path query. As the document is parsed, new columns are added automatically to accommodate additional data elements. When possible, the best data type will be selected for numeric values. Date/Time values are never converted automatically; they are loaded as string values. The Data Pipeline can then be used to further modify data types.

Due to the complex specifications of both path notation, this documentation provides a quick summary. For more details, search for JSON Path and XML XPath notation online.

Both path types support the extended join operator ' || ' notation. See below for more information.

This field is case-sensitive

JSON Path

The JSON Path specification allows you to select the full document, a subset from a node path, or a filtered set of the document. In addition, when the JSON is an array, multiple rows will be returned by DataZen.

Example	Comments
$	Selects the root of the document as the starting point; all direct child nodes under the root will be turned into columns in the data set returned
elements $.elements	Selects the elements nodes of the document directly under the root; all direct child nodes will be turned into columns in the data set returned
..docs	Selects all the docs nodes of the document regardless of their positions
$[*].products[?(@.category == 'display')]	Selects all the products nodes, within an array of elements, containing a category node with a value of display

XML XPath

The XML Path specification follows the XPath specification, in addition to $, and allows you to select the full document, a subset from a node path, or a filtered subset of the document. When the XPath returns multiple elements, multiple rows will be returned by DataZen.

Example	Comments
$	Selects the first node of the document as the starting point; all direct child elements under the first node will be turned into columns in the data set returned
elements /elements	Selects all the elements nodes of the document directly under the root; all direct child elements will be turned into columns in the data set returned
//docs	Selects all the docs elements of the document regardless of their positions
.//*[@domain]	Selects all the nodes containing a domain child node

Using the Join operator

This section describes an advanced JSON/XML transformation feature and should be considered a preview capability that may change significantly by the time it is released.

The Join operator ' || ' is used to join nodes with their parent or child nodes to further extend the number of columns with related data. The introduction of this operator as part of the Path makes it easier to turn nested documents into multiple rows. There are two primary approaches when joining data within the same document:

Same Row: extract additional child nodes into additional columns, on the same row of the parent; to use this approach, each additional path should select a single element from child nodes
Join: extract additional data from parent or child nodes as a Join operation into additional rows and columns; to use this approach, each additional path should select the desired nodes relative to the current one

Because of key differences in the underlying join operations, the behavior for XPath and JSON Path is different. The additional XPath operators provided for an XML document operate on Parent objects, acting as a Right Join operator. The additional JSON Path operators provided for a JSON document operate on Child objects, acting as a Left Join operator.

Multiple join operators can be specified, and an optional decorator can be added to a path specification which will prepend column names. The decorator can help when the additional nodes have the same name as previously captured node. The decorator can be any number of characters; when the decorator is used, the = sign is required and marks the start of the path specification.

Each additional path is relative to the previous path position within the document. In the specification below, path2 is relative to path1, and path3 is relative to path2.


        path1 || [x=]path2 || [y=]path3 ...

The first node inspected determines the order of the first columns returned in the data set. As additional nodes are inspected, missing columns are automatically at the end of the data set.

Example 1: JSON

If we are interested in returning the user id and the name fields, with the first phone number:
$ || ./phones[0]

        {
            "id": 1,
            "name": "james",
            "phones": [
                {
                    "phone": "123456",
                    "type": "home"
                },
                {
                    "phone": "456789",
                    "type": "cell"
                }
            ]
        }