ADLS Gen1

The ADLS Gen1 destination writes files to Microsoft Azure Data Lake Storage Gen1 using Azure Active Directory service principal authentication, also known as service-to-service authentication. The destination writes data based on the specified data format and creates a separate file for every partition.

To write to Azure Data Lake Storage Gen2, use the ADLS Gen2 destination.

Before you use the ADLS Gen1 destination, you must perform some prerequisite tasks.

When you configure the ADLS Gen1 destination, you specify the service name and Azure authentication information such as the client ID and credential. Or, you can have the destination use Azure authentication information configured in the cluster where the pipeline runs.

You specify the output directory to use and whether to remove all existing files in the directory. You select the data format to write and configure related properties.

Prerequisites

Complete the following prerequisites, as needed, before you configure the ADLS Gen1 destination:
  1. If necessary, create a new Azure Active Directory application for StreamSets Transformer.

    For information about creating a new application, see the Azure documentation.

  2. Ensure that the Azure Active Directory Transformer application has the appropriate access control to perform the necessary tasks.

    To write to Azure, the Transformer application requires Write and Execute permissions. If also reading from Azure, the application requires Read permission as well.

    For information about configuring Gen1 access control, see the Azure documentation.

  3. Install the Azure Data Lake Storage Gen1 driver on the cluster where the pipeline runs.

    Most recent cluster versions include the ADLS Gen1 driver, azure-datalake-store.jar. However, older versions might require installing it. For more information about Hadoop support for Azure Data Lake Storage Gen1, see the Hadoop documentation.

  4. Retrieve Azure Data Lake Storage Gen1 authentication information from the Azure portal for configuring the destination.

    You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.

  5. Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.

Retrieve Authentication Information

The ADLS Gen1 destination connects to Azure using Azure Active Directory service principal authentication, also known as service-to-service authentication.

The destination requires several Azure authentication details to connect to Azure. If the cluster where the pipeline runs has the necessary Azure authentication information configured, then the destination uses that information by default. However, data preview is not available when using Azure authentication information configured in the cluster.

You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.

The destination requires the following Azure authentication information:

  • Client ID - Client ID for the Azure Active Directory Transformer application. Also known as the application ID or application key.

    For information on accessing this from the Azure portal, see "Get values for signing in" in the Azure documentation.

  • Credential - Authentication key for the Azure Active Directory Transformer application.

    For information on accessing this from the Azure portal, see "Get values for signing in" in the Azure documentation.

  • OAuth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory Transformer application.

    For information on accessing this from the Azure portal, see "Step 4: Get the OAuth 2.0 token endpoint" in the Azure documentation.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.

Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When writing data to Azure Data Lake Storage Gen1, Spark creates one output file for each partition. To change the number of partitions that write to a file system, add the Repartition processor before the destination.

For example, let's say that Spark splits the pipeline data into 20 partitions. The pipeline writes to Parquet files. For data scientists to efficiently analyze the Parquet data, the pipeline should write the data to a small number of Parquet files. So you use the Repartition processor before the destination to change the number of partitions to three.

Data Formats

The ADLS Gen1 destination writes records based on the specified data format.

The destination can write using the following data formats:
Avro
The destination writes an Avro file for each partition and includes the Avro schema in each file.
Output files use the following naming convention:
part-<multipart partition number>.avro
Delimited
The destination writes a delimited file for each partition. It creates a header line for each file and uses \n as the newline character. You can specify a custom delimiter, quote, and escape character to use in the data.
Output files use the following naming convention:
part-<multipart partition number>.csv
JSON
The destination writes a file for each partition and writes each record on a separate line. For more information, see the JSON Lines website.
Output files use the following naming convention:
part-<multipart partition number>.json
ORC
The destination writes an ORC file for each partition.
Output files use the following naming convention:
part-<multipart partition number>.snappy.orc
Parquet
The destination writes a Parquet file for each partition and includes the Parquet schema in every file.
Output files use the following naming convention:
part-<multipart partition number>.snappy.parquet
Text
The destination writes a text file for every partition and uses \n as the newline character.
The destination writes data from a single String field to output files. You can specify the field in the record to use.
Output files use the following naming convention:
part-<multipart partition number>.txt
XML
The destination writes an XML file for every partition. You specify the root and row tags to use in output files.
Output files use the following naming convention:
part-<5 digit number>

Configuring a ADLS Gen1 Destination

Configure an ADLS Gen1 destination to write files to Azure Data Lake Storage Gen1. Before you use this destination, complete the required prerequisites.

To write to Azure Data Lake Storage Gen2, use the ADLS Gen2 destination.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Stage library to use:
    • Azure for Hadoop 3.2.0 - Use when the cluster where the pipeline runs includes Apache Hadoop Azure Data Lake Support version 3.2.0.

      When selected, Transformer assumes that the cluster where the pipeline runs has Apache Hadoop Azure Data Lake Support 3.2.0 installed, and therefore has all of the necessary libraries to run the pipeline.

    • Azure with No Dependency - Use when running the pipeline locally or when the cluster where the pipeline runs does not include Apache Hadoop Azure Data Lake Support version 3.2.0.

      When selected, Transformer passes all of the necessary libraries to the cluster to enable running the pipeline.

  2. On the Azure tab, configure the following properties:
    ADLS Property Description
    Service Name Azure Data Lake Storage Gen1 service name.
    Client ID Client ID for the Azure Active Directory Transformer application. Also known as the application ID or application key.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing this from the Azure portal, see "Get values for signing in" in the Azure documentation.

    Tip: To secure sensitive information, you can use runtime resources or credential stores as described in the Data Collector documentation.
    Credential Authentication key for the Azure Active Directory application.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing this from the Azure portal, see "Get values for signing in" in the Azure documentation.
    Tip: To secure sensitive information, you can use runtime resources or credential stores as described in the Data Collector documentation.
    OAuth Token Endpoint OAuth 2.0 token endpoint for the Azure Active Directory application.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing this from the Azure portal, see "Step 4: Get the OAuth 2.0 token endpoint" in the Azure documentation.

    Directory Path

    Path to the directory for the output files.

    Use the following format:

    /<path to files>/

    Write Mode Mode to write files:
    • Overwrite existing files - Deletes all files in the directory before creating new files.
    • Write data to new files - Creates new files without affecting existing files in the directory.
    Additional Configuration

    Additional HDFS properties to pass to an HDFS-compatible file system. Specified properties override those in Hadoop configuration files.

    To add properties, click the Add icon and define the HDFS property name and value. Use the property names and values as expected by your version of Hadoop.

  3. On the Data Format tab, configure the following property:
    Data Format Property Description
    Data Format Format of the data. Select one of the following formats:
    • Avro
    • Delimited
    • JSON
    • ORC
    • Parquet
    • Text
    • XML
  4. For delimited data, on the Data Format tab, configure the following property:
    Delimited Property Description
    Delimiter Character Delimiter character to use in the data. Select one of the available options or use Other to enter a custom character.

    You can enter a Unicode control character using the format \uNNNN, where ​N is a hexadecimal digit from the numbers 0-9 or the letters A-F. For example, enter \u0000 to use the null character as the delimiter or \u2028 to use a line separator as the delimiter.

    Quote Character Quote character to use in the data.
    Escape Character Escape character to use in the data
  5. For text data, on the Data Format tab, configure the following property:
    Text Property Description
    Text Field String field in the record that contains the data to be written. All data must be incorporated into the specified field.
  6. For XML data, on the Data Format tab, configure the following properties:
    XML Property Description
    Root Tag Tag to use as the root element.

    Default is ROWS, which results in a <ROWS> root element.

    Row Tag Tag to use as a record delineator.

    Default is ROW, which results in a <ROW> record delineator element.