Document Path

JSON and XML Document transformation into a data set of rows and columns is a key capability of DataZen and is exposed on multiple configuration screens. When the JSON / XML Path option is exposed, a document can be completely transformed as needed.

In addition, the JSON and XML Path field supports a unique feature: the ability to join the document with itself, as many times as needed. This allows you to perform complex transformations in one pass. This feature is described below.

When source documents are very large, this operation may require a lot of memory.

Using the XML/JSON Path feature

In its simplest form, this field allows you to enter a well-formatted XML XPath or JSON path query. As the document is parsed, new columns are added automatically to accommodate additional data elements. When possible, the best data type will be selected for numeric values. Date/Time values are never converted automatically; they are loaded as string values. The Data Pipeline can then be used to further modify data types.

Due to the complex specifications of both path notation, this documentation provides a quick summary. For more details, search for JSON Path and XML XPath notation online.

Both path types support the extended join operator ' || ' notation. See below for more information.

This field is case-sensitive

JSON Path

The JSON Path specification allows you to select the full document, a subset from a node path, or a filtered set of the document. In addition, when the JSON is an array, multiple rows will be returned by DataZen.

Example Comments
$ Selects the root of the document as the starting point; all direct child nodes under the root will be turned into columns in the data set returned
elements
$.elements
Selects the elements nodes of the document directly under the root; all direct child nodes will be turned into columns in the data set returned
..docs Selects all the docs nodes of the document regardless of their positions
$[*].products[?(@.category == 'display')] Selects all the products nodes, within an array of elements, containing a category node with a value of display

XML XPath

The XML Path specification follows the XPath specification, in addition to $, and allows you to select the full document, a subset from a node path, or a filtered subset of the document. When the XPath returns multiple elements, multiple rows will be returned by DataZen.

Example Comments
$ Selects the first node of the document as the starting point; all direct child elements under the first node will be turned into columns in the data set returned
elements
/elements
Selects all the elements nodes of the document directly under the root; all direct child elements will be turned into columns in the data set returned
//docs Selects all the docs elements of the document regardless of their positions
.//*[@domain] Selects all the nodes containing a domain child node

Using the Join operator

This section describes an advanced JSON/XML transformation feature and should be considered a preview capability that may change significantly by the time it is released.

The Join operator ' || ' is used to join nodes with their parent or child nodes to further extend the number of columns with related data. The introduction of this operator as part of the Path makes it easier to turn nested documents into multiple rows. There are two primary approaches when joining data within the same document:

  • Same Row: extract additional child nodes into additional columns, on the same row of the parent; to use this approach, each additional path should select a single element from child nodes
  • Join: extract additional data from parent or child nodes as a Join operation into additional rows and columns; to use this approach, each additional path should select the desired nodes relative to the current one

Because of key differences in the underlying join operations, the behavior for XPath and JSON Path is different. The additional XPath operators provided for an XML document operate on Parent objects, acting as a Right Join operator. The additional JSON Path operators provided for a JSON document operate on Child objects, acting as a Left Join operator.

Multiple join operators can be specified, and an optional decorator can be added to a path specification which will prepend column names. The decorator can help when the additional nodes have the same name as previously captured node. The decorator can be any number of characters; when the decorator is used, the = sign is required and marks the start of the path specification.

Each additional path is relative to the previous path position within the document. In the specification below, path2 is relative to path1, and path3 is relative to path2.

path1 || [x=]path2 || [y=]path3 ...

The first node inspected determines the order of the first columns returned in the data set. As additional nodes are inspected, missing columns are automatically at the end of the data set.

Example 1: JSON

If we are interested in returning the user id and the name fields, with the first phone number:
$ || ./phones[0]

        {
            "id": 1,
            "name": "james",
            "phones": [
                {
                    "phone": "123456",
                    "type": "home"
                },
                {
                    "phone": "456789",
                    "type": "cell"
                }
            ]
        }
    

The resulting data set would look like this:

id name phones phone type
1 james ... 123456 home

Example 2: JSON

If we are interested in returning all phone numbers along with the user id and the name fields, we would use this path:
$ || ..phones

        {
            "id": 1,
            "name": "james",
            "phones": [
                {
                    "phone": "123456",
                    "type": "home"
                },
                {
                    "phone": "456789",
                    "type": "cell"
                }
            ]
        }
    

The resulting data set would look like this:

id name phones phone type
1 james ... 123456 home
1 james ... 456789 cell

Example 3: JSON

If we are interested in returning all phone numbers from an array of clients, along with the user id and the name fields, but a field name is found in multiple nodes (the type field), we would use a decoration attribute:
$ || c_=..phones or $[*] || c_=..phones

        [
        {
            "id": 1,
            "name": "james",
            "type": "customer",
            "phones": [
                {
                    "phone": "123456",
                    "type": "home"
                },
                {
                    "phone": "456789",
                    "type": "cell"
                }
            ]
        },
        {
            "id": 2,
            "name": "jane",
            "type": "owner",
            "phones": [
                {
                    "phone": "654321",
                    "type": "home"
                },
                {
                    "phone": "987654"
                }
            ]
        }
        ]
    

The resulting data set would look like this:

id name type phones c_phone c_type
1 james customer ... 123456 home
1 james customer ... 456789 cell
2 jane owner ... 654321 home
2 jane owner ... 987654

Example 4: XML

In this example, we are interested in returning the tags along with the parent items properties. Since this is an XML specification, we need to select the tags node first, and join on its parent node. Note that in this scenario, the tags node has a list of child nodes with XmlText data; these nodes will be treated as properties instead of objects, which will return a total of two records. Also, the last item node has no tags, so it will not be returned.
//tags || p_=./..

        <root>
          <item>
            <id>111111</id>
            <unique_email_id>a@mail.com</email_address>
            <tags>
                <tag_id>1</tag_id>
                <tag_flag>2</tag_flag>
            </tags>
          </item>
          <item>
            <id>222222</id>
            <email_address>b@email.com</email_address>
            <tags>
                <tag_id>3/<tag_id>
                <tag_post>4/<tag_post>
          </tags>
          </item>
          <item>
            <id>333333</id>
            <email_address>c@email.com</email_address>
          </item>
        </root>
    

The resulting data set would look like this:

tag_id tag_flag id email_address tags tag_post
1 2 111111 a@email.com ...
3 222222 b@email.com ... 4

Example 5: XML

In this example, we are interested in returning the tag_record nodes along with the grand-parent items properties. Since this is an XML specification, we need to select the tag_record node first, and join on its grand-parent node. Note that in this scenario, the tag_record node has a list of child nodes with XmlElement data; these nodes will be treated as objects (the data set will then further expand each object property), and will return a total of four records.
//tag_record || p_=./../..

        <root>
          <item>
            <id>111111</id>
            <unique_email_id>a@mail.com</email_address>
            <tags>
                <tag_record>
                    <tag_id>1</tag_id>
                    <tag_flag>2</tag_flag>
                </tag_record>
                <tag_record>
                    <tag_id>3</tag_id>
                    <tag_flag>4</tag_flag>
                </tag_record>
            </tags>
          </item>
          <item>
            <id>222222</id>
            <email_address>b@email.com</email_address>
            <tags>
                <tag_record>
                    <tag_id>5</tag_id>
                    <tag_flag>6</tag_flag>
                </tag_record>
                <tag_record>
                    <tag_id>7</tag_id>
                    <tag_flag>8</tag_flag>
                </tag_record>
          </tags>
          </item>
        </root>
    

The resulting data set would look like this:

tag_id tag_flag id email_address tags tag_post
1 2 111111 a@email.com ...
3 4 111111 a@email.com ...
5 6 222222 b@email.com ...
7 222222 b@email.com ... 8