PUSH DRIVE

Overview

The Push Drive operation allows you to send the Upsert and Delete streams to one or more files in any supported drive. Unlike other targets, the Push Drive does not update the content of files; it assumes the target files are immutable.

If a file in the target drive already exists, it will be replaced automatically. In addition, if the container/folder does not exist, it will be created automatically.

Syntax

Saves the pipeline data set into a file (or files) as a push operation.

PUSH INTO DRIVE [CONNECTION] 
    FORMAT < PARQUET | CSV | JSON | XML | RAW >
    
    { DATE_FIELD '<field>' }
    { CONTAINER '...' }
    { FILE '...' }

    -- Parquet options
    { COMPRESSION < 'NONE' | 'SNAPPY' | 'GZIP' | 'LZ4RAW' | 'ZSTD' | 'BROTLI' | 'LZO' | 'LZ4' > }
    { ROWS_PER_GROUP <n> }

    -- CSV options
    { DELIMITER '...' }
    { COMMENT_TOKENS '...' }
    { FIXED_LENGTH '...' }
    { DATE_FORMAT '...' }
    { ADD_HEADER }
    { FORCE_QUOTED }
    { FLATTEN }
    { TRIM }
    
    -- Raw option
    { COLUMN '<field>' }

    -- Raw + CSV option
    { ENCODING < 'ASCII' | 'UTF8' > }

    -- JSON + XML options
    { FILE_PER_ROW }

    -- XML options
    { ROOT_NODE '...' }
    { ELEMENT '...' }
    
    { WITH 
        { DISCARD_ON_SUCCESS }
        { DISCARD_ON_ERROR }
        { DLQ_ON_ERROR '...' }
        { RETRY < LINEAR | EXPONENTIAL > (<n>,<n>) }
    }

;

FORMAT	The file format to use: PARQUET, CSV, JSON, XML, RAW
DATE_FIELD	The field to use from the pipepine data set as the date/time value when using date tokens in the FILE or CONTAINER name; if not specified, uses the current run date of the job pipeline
CONTAINER	Overrides the target folder, container or bucket to use; can use pipeline variables, field names, and DataZen functions; when field names are used, the data is automatically partitioned accordingly
FILE	The target file name of file name pattern to use; can use pipeline variables, field names, and DataZen functions; when field names are used, the data is automatically partitioned accordingly
COMPRESSION	When using the PARQUET format, specifies the compression to use (default: SNAPPY): 'NONE', 'SNAPPY', 'GZIP', 'LZ4RAW', 'ZSTD', 'BROTLI', 'LZO', 'LZ4'
ROWS_PER_GROUP	When using the PARQUET format, determines the number of rows per group in the parquet file (default: 5000)
DELIMITER	Column delimiter to use (for CSV only)
FIXED_LENGTH	Comma-separated list of fixed lengths when creating a fixed-length file; if specified, the DELIMITER option is ignored (for CSV only)
DATE_FORMAT	The date format to use when a column is a date/time data type; any valid formatter is accepted (ex: 'o' or 'yyyyMMdd HH:mm:ss.fff') (for CSV only)
ADD_HEADER	Adds a header row to the file (for CSV only)
FORCE_QUOTED	Adds double-quotes to field names and values (for CSV only)
FLATTEN	Removes line feeds from strings (for CSV only)
TRIM	Trims string values (for CSV only)
ENCODING	Writes the text file using encoding specified: ASCII, UTF8 (default: UTF8)
FILE_PER_ROW	Creates a single file per row of data (for JSON/XML only)
ROOT_NODE	Uses the root not name specified; default: root (for XML only)
ELEMENT	Uses the element name specified for each record; default: item (for XML only)
COLUMN	Uses the column name specified as the content to save (for RAW only)
DISCARD_ON_SUCCESS	Deletes the change log after successful completion of the push operation
DISCARD_ON_ERROR	Deletes the change log if the push operation failed
DLQ_ON_ERROR	Moves the change log to a subfolder or directory if the push operation failed
RETRY EXPONENTIAL	Retries the operation on failure up to N times, waiting P seconds exponentially longer every time (N,P)
RETRY LINEAR	Retries the operation on failure up to N times, waiting P seconds every time (N,P)

Example 1


-- Load the next inserts/updates previously captured in a cloud folder using
-- CAPTURE 'mycdc' INSERT UPDATE DELETE ON KEYS 'guid' WITH PATH [adls2.0] 'container'

LOAD UPSERTS FROM [adls2.0] 'mycdc' PATH '/container' KEY_COLUMNS 'guid';

-- Send all changes into a single Parquet file
-- and override the default rows per group used in the file
PUSH INTO DRIVE [adls2.0] 
  FORMAT 'Parquet'
  CONTAINER 'logs'
  FILE 'rss_@executionid.parquet'
  ROWS_PER_GROUP 10000
;

Example 2


-- Load the next inserts/updates previously captured in a cloud folder using
-- CAPTURE 'mycdc' INSERT UPDATE DELETE ON KEYS 'guid' WITH PATH [adls2.0] 'container'

LOAD DELETES FROM [adls2.0] 'mycdc' PATH '/container' KEY_COLUMNS 'guid';

-- Send all deleted records into a single CSV file with header
PUSH INTO DRIVE [adls2.0] 
  FORMAT 'CSV'
  CONTAINER 'logs'
  FILE 'rss_deleted_@executionid.csv'
  DELIMITER ','
  ADD_HEADER
;