Aggregation

An aggregation pattern allows you to centralize data from one or more sour systems in order to build data lakes, data hubs, data warehouses, or simply monitor remote sites. This pattern does not assume that the source data has the same schema across source systems; this concern is shifted to downstream consumers. Usually this pattern involves the use of both high watermarks and change capture, in addition to adding metadata to the data set in order to identify its origin.

Although the aggregation pattern is usually implemented to feed a single target system, many implementations leverage data distribution techniques (a.k.a. 'fanning' or 'sharding') when loading data lakes.

Pattern Overview

This pattern describes a forward-only read from multiple source systems, from one or more physical locations, and usually leverages one or more of the stateless patterns to eliminate already captured data sets. This pattern can be used for example in remote site monitoring scenarios.

In this pattern, data pipelines centralize all available data and changes captured into a target system, then downstream processes (such as other data pipelines) can decide how to ingest or consume the data. Once centralized.

Finally, partitioning the data in the target system may be necessary for performance reasons. For example, if Parquet files are created on an ADLS 2.0 cloud endpoint, it may be important to batch and/or partition the data across multiple containers and files.

DataZen Implementation

Implementing this pattern is fairly straighforward with DataZen. If all the data sources are located on the same network, a single DataZen Agent may be used (unless performance requirements dictate installing multiple agents). When dealing with remote sites, a DataZen Agent is usually needed at every location in order to ensure maximum read performance; in this scenario, it may be necessary to partition the target system by source location to avoid data collision.

Refer to the other patterns to determine which one should be used to best read from your systems and whether CDC, High Watermark, Window, or a combination is needed.

For an implementation example, refer to this blog post: Guide - Central Monitoring