SELECT DRIVE

Overview

The SELECT DRIVE operation is designed to read from one or more files from a cloud drive, local path, FTP site, or network drive. Reading one file at a time is usually the preferred implementation; however, you can use this operation to read all available files and combine them into a single data set.

The file format support the * wildcard regardless of the drive selected. Depending on the drive selected, file names may be case-sensitive.

The APPLY PIPELINE operation declared within the SELECT operation allows you to add one or more transformations before capturing the final data set. When using this approach, consider that the pipeline will execute for every paged request. For example, if you use the PAGING option, and 5 pages of data are retrieved, this pipeline will execute five times. If you would like to execute a pipeline transformation after all pages have been retrieved, but before the capture operation, use the APPLY PIPELINE operation after the SELECT operation.

Using both MOVE_TO and INCLUDE_SUBFOLDERS can create an unpredictable outcome if the archive folder is itself a subdirectory.

Syntax

Reads data from one or more files located on a local folder or on cloud or FTP drives.

SELECT  
  {TOP <n>}
  < ... | * >
FROM DRIVE [CONNECTION] 
  // File name or pattern
  ( ... )
  WITH 
      FORMAT < 'PARQUET' | 'CSV' | 'XML' | 'JSON' | 'RAW' | 'BYTES' >
      { CONTAINER '...' }
      { UNZIP }

      -- CSV Options
      { HAS_HEADERS }
      { TRIM_FIELDS }
      { SKIP N }
      { QUOTED_IDENTIFIER }
      { DELIMITERS '...' }
      { FIXED_LENGTHS '...' }
      { COMMENT_TOKENS '...' }
      
      -- RAW or BYTES Option
      { LIST }

      { SINGLE_FILE }
      { NEW_FILES_ONLY }
      { MOVE_TO '...' }
      { INCLUDE_SUBFOLDERS }
      { FILENAME_COL '...' }
  
  { TRACE INTO [CONNECTION] 
    TABLE '...'
    { WITH_PAYLOAD }
  }
  { APPLY TX '...' 
    { WITH RAWCOL '...' } 
  }  

;

FORMAT

One of the file formats supported: PARQUET, CSV, XML, JSON, RAW, BYTES

CONTAINER

Overrides the path to the file when using a local folder or an FTP site, the AWS S3 bucket name, the Azure Container name, or the name of the Google Drive folder

UNZIP

Unzip the file first and processes each file with the FORMAT specified

HAS_HEADERS

Assumes the file has a header line (CSV only)

TRIM_FIELDS

Applies a trim operation to each field (CSV only)

SKIP

The number of records to skip before reading the content (CSV only)

QUOTED_IDENTIFIER

The character to use identifying data in quotes (ex: '"' ) (CSV only)

DELIMITERS

The list of characters to use as field delimiters (ex: ',' or ';|') - leave blank for automatic detection of the delimiter used by the file (CSV only)

FIXED_LENGTHS

Assumes the file does not use delimiters between fields and instead has fixed positions as provided and separated by a comma (ex: 0,15,55,200) (CSV only)

COMMENT_TOKENS

Comma-separated list of leading characters used to exclude lines (CSV only)

LIST

Returns the list of files without their content (RAW and BYTES only)

SINGLE_FILE

Indicates that each file should be processed individually as a separate execution (a new executionid will be created)

NEW_FILES_ONLY

Keep a high watermark of the last modified date of a file and only return newer files

MOVE_TO

The folder or subfolder to move the file into after successfully extracting its content

INCLUDE_SUBFOLDERS

Inspects subfolders to search for files (warning: use carefully when MOVE_TO points to a subfolder)

FILENAME_COL

Adds a column with the name provided containing the file name each record was extracted from

TRACE INTO

Saves the content of each file into a database for further inspection and troubleshooting purposes

APPLY PIPELINE

Executes an inline pipeline consisting of one or more components for each file retrieved

Example 1


-- Retrieves all parquet files in a cloud drive 
-- and add a column with the name of the file

SELECT * FROM DRIVE [adls2.0] (*.parquet)
  WITH 
   FORMAT 'Parquet'
   CONTAINER 'logs'
   FILENAME_COL '_filename'
;

Example 2


-- Retrieves available files from an SFTP site in a 
-- given directory, expecting CSV files, and moving
-- them to an archive folder upon completion

-- FTP files are case-sensitive - return all files under the 'data'
-- directory containing Weather in their names
SELECT * FROM DRIVE [mysftp] (data/*Weather*)
  WITH 
   FORMAT 'CSV'
   SINGLE_FILE		-- process one file at a time per execution
   MOVE_TO '/archive'
;