SELECT DRIVE

Overview

The SELECT DRIVE operation is designed to read from one or more files from a cloud drive, local path, FTP site, or network drive. Reading one file at a time is usually the preferred implementation; however, you can use this operation to read all available files and combine them into a single data set.

The file format support the * wildcard regardless of the drive selected. Depending on the drive selected, file names may be case-sensitive.

The APPLY PIPELINE operation declared within the SELECT operation allows you to add one or more transformations before capturing the final data set. When using this approach, consider that the pipeline will execute for every paged request. For example, if you use the PAGING option, and 5 pages of data are retrieved, this pipeline will execute five times. If you would like to execute a pipeline transformation after all pages have been retrieved, but before the capture operation, use the APPLY PIPELINE operation after the SELECT operation.

Using both MOVE_TO and INCLUDE_SUBFOLDERS can create an unpredictable outcome if the archive folder is itself a subdirectory.

Syntax

Reads data from one or more files located on a local folder or on cloud or FTP drives.

SELECT  
  {TOP <n>}
  < ... | * >
FROM DRIVE [CONNECTION] 
  // File name or pattern
  ( ... )
  WITH 
      FORMAT < 'PARQUET' | 'CSV' | 'XML' | 'JSON' | 'RAW' | 'BYTES' >
      { CONTAINER '...' }
      { UNZIP }

      -- CSV Options
      { HAS_HEADERS }
      { TRIM_FIELDS }
      { SKIP N }
      { QUOTED_IDENTIFIER }
      { DELIMITERS '...' }
      { FIXED_LENGTHS '...' }
      { COMMENT_TOKENS '...' }
      
      -- RAW or BYTES Option
      { LIST }

      { SINGLE_FILE }
      { NEW_FILES_ONLY }
      { MOVE_TO '...' }
      { INCLUDE_SUBFOLDERS }
      { FILENAME_COL '...' }
  
  { TRACE INTO [CONNECTION] 
    TABLE '...'
    { WITH_PAYLOAD }
  }
  { APPLY TX '...' 
    { WITH RAWCOL '...' } 
  }  

;

TOP	Client-side operator that limits the number of records after the inner pipeline has executed
FORMAT	One of the file formats supported: PARQUET, CSV, XML, JSON, RAW, BYTES
CONTAINER	Overrides the path to the file when using a local folder or an FTP site, the AWS S3 bucket name, the Azure Container name, or the name of the Google Drive folder
UNZIP	Unzip the file first and processes each file with the FORMAT specified
HAS_HEADERS	Assumes the file has a header line (CSV only)
TRIM_FIELDS	Applies a trim operation to each field (CSV only)
SKIP	The number of records to skip before reading the content (CSV only)
QUOTED_IDENTIFIER	The character to use identifying data in quotes (ex: '"' ) (CSV only)
DELIMITERS	The list of characters to use as field delimiters (ex: ',' or ';\|') - leave blank for automatic detection of the delimiter used by the file (CSV only)
FIXED_LENGTHS	Assumes the file does not use delimiters between fields and instead has fixed positions as provided and separated by a comma (ex: 0,15,55,200) (CSV only)
COMMENT_TOKENS	Comma-separated list of leading characters used to exclude lines (CSV only)
LIST	Returns the list of files without their content (RAW and BYTES only)
SINGLE_FILE	Indicates that each file should be processed individually as a separate execution (a new executionid will be created)
NEW_FILES_ONLY	Keep a high watermark of the last modified date of a file and only return newer files
MOVE_TO	The folder or subfolder to move the file into after successfully extracting its content
INCLUDE_SUBFOLDERS	Inspects subfolders to search for files (warning: use carefully when MOVE_TO points to a subfolder)
FILENAME_COL	Adds a column with the name provided containing the file name each record was extracted from
TRACE INTO	Saves the content of each file into a database for further inspection and troubleshooting purposes
APPLY PIPELINE	Executes an inline pipeline consisting of one or more components for each file retrieved

Example 1


-- Retrieves all parquet files in a cloud drive 
-- and add a column with the name of the file

SELECT * FROM DRIVE [adls2.0] (*.parquet)
  WITH 
   FORMAT 'Parquet'
   CONTAINER 'logs'
   FILENAME_COL '_filename'
;

Example 2


-- Retrieves available files from an SFTP site in a 
-- given directory, expecting CSV files, and moving
-- them to an archive folder upon completion

-- FTP files are case-sensitive - return all files under the 'data'
-- directory containing Weather in their names
SELECT * FROM DRIVE [mysftp] (data/*Weather*)
  WITH 
   FORMAT 'CSV'
   SINGLE_FILE		-- process one file at a time per execution
   MOVE_TO '/archive'
;