High Watermark
A high watermark pattern allows you to request data from a source system from a specific point in time, or last known high value, so that only a subset of records are returned. This pattern provides the best possible performance and minimizes resource consumption; the method to implement a high watermark depends somewhat on the source system. Using a high watermark without Change Capture is normally used for detecting both inserted and updated records.
Pattern Overview
This pattern describes a forward-only read from the source system using a high watermark
without the need for Synthetic CDC or a sliding window. When the source systems provides a
way to select a subset of data, this pattern can be useful to capture changes from a point in time,
on a go-forward basis.
Various fields can be used for tracking a high watermark, including datetime values, numeric fields,
or timestamp bytes.
In some cases, you may need to track two high watermark values: one for capturing new or updated records,
and another for capturing deleted records; this may be necessary when the source systems offers different
feeds for deleted records, and is usually limited to HTTP/S systems.
Column Name | Data Type | Description |
---|---|---|
ID | long | A unique identifier |
CustomerName | string | The customer name |
LastUpdatedOn | datetime | The date of the last change |
Example of a table that supports the use of a high watermark using the LastUpdatedDate field
Not all source systems provide filtering capabilities. If a source system does not provide such mechanism, this pattern cannot be used.
DataZen Implementation
Using high watermark values is implemented differently in DataZen depending on the source system. Once a pipeline job is configured for high watermark tracking, DataZen automatically stores and uses the watermark on subsequent calls.
Source System | High Watermark Supported | Comments |
---|---|---|
Relational Database | Yes | Support is provided against all relational engines, either applied automatically for simpler SELECT commands
or using the @highwatermark token as part of the SQL query
|
HTTP | Yes | Support is provided for query parameters and payloads as long as the API supports filering |
Drive | Yes | Support is provided against all drives using the internal system date of the file (last updated) |
Big Data / NoSQL | No | Support is provided using the @highwatermark token as part of the SQL query |
Messaging Consumers | No | Not applicable |
An administrator can modify the high watermark value if needed.See the High Watermark help section for more information.
Database Source
When the source system is a relational database, DataZen simplifies high watermark tracking by simply selecting the column of the source data that will be used for this purpose. To do this, from the Source tab, select a database connection and enter an SQL command; then click Refresh to load the data.
Once you have retrieved some data, go to the Replication Settings tab, and select the desired field for the High Watermark. In the example below, the last_update field will be used as the high watermark.
When you save this pipeline, DataZen will automatically modify the SQL command dynamically to inject the latest
high watermark value. You can also specify an initial high watermark value by clicking on Set initial pointers.
When using this automated heuristic for high watermark values, the source SQL command must be a simple SELECT operation.
Using very complex SELECT operations, or functions/procedures will not work and may throw an error. When dealing
with complex commands, use a @highwatermark
token instead.
Using the @highwatermark token
When more control over the high watermark is needed, you can add the @highwatermark
token directly as part of the SQL command. Ex: EXEC GetDataFromLastDate '@highwatermark'
. When using the
token directly in the SQL query, it is initially set to an empty string the first time the command executes, unless you
override the default value manually by clicking on the Set initial pointer... link, when available.
HTTP/S Source
Using a high watermark with HTTP sources depends on the ability of the API to handle HTTP requests with input parameters, either as part of the URI (as a query parameter) or the payload (usually in the form of a POST or PUT request) such as a GQL (Graph Query Language) request.
As a Query Parameter
When specifying a high watermark as part of the URL, using a query parameter, you need to specify two settings: the High Watermark field from the data being returned by the API, and the URL Param Name field to indicate the name of the parameter to modify dynamically when sending the request.
In the above screenshot, the data returned is first turned into a dataset (using the JSON/XML transformation); the last_changed field is then used to extract the high watermark. Then, or subsequent requests, the since_last_changed URI query parameter will be added automatically to the HTTP request.
You can also customize the value passed as part of the URI parameter, such as applying a date format, or specifying a
default value if the high watermark is not yet defined (for example, the first time you run the pipeline). To do so,
leave the URL Param Name field empty, and use the @highwatermark variable instead as part of the URI.
In the example below, the query specifies the filter parameter as required by the API for Business Central:
filter=lastModifiedDate gt #isnullorempty(@highwatermark, 2000-01-01T00:00:00.000Z)
As a Payload
When specifying a high watermark as part of the HTTP payload, you need to provide the High Watermark (the field from the data being returned by the API that contains the value), and leave the URL Param Name field empty. Then, use the @highwatermark runtime parameter in the body as needed.
In the above screenshot, the data returned is first turned into a dataset (using the JSON/XML transformation); the last_changed field is then used to extract the high watermark. Use the @highwatermark variable in the payload to inject the last known high watermark value.
Drive Source
Drive sources include the option to only use the most recently created or updated files, based on their system dates.
Alternatively, drive sources provide the ability to move files that were successfully processed in a sub-folder,
enabling a messaging-style forward-only pipeline that does not rely on system dates.
The option to filter files based on their system date is available across all drives, including: AWS S3 Buckets, Azure Containers,
Google Cloud Drives, FTP sites, and folders.
To enable high watermark tracking on a drive source, select Load newer/updated files only
Big Data / NoSQL Source
High watermark handling is supported for Big Data / NoSQL sources using the @highwatermark
replacement token, as explained in the Database Source section above.
Messaging Source
High watermark handling is not supported for Messaging sources.