ADLS Gen1

The ADLS Gen1 destination writes files to Microsoft Azure Data Lake Storage Gen1 using Azure Active Directory service principal authentication, also known as service-to-service authentication. The destination writes data based on the specified data format and creates a separate file for every partition.

To write to Azure Data Lake Storage Gen2, use the ADLS Gen2 destination.

Before you use the ADLS Gen1 destination, you must perform some prerequisite tasks.

When you configure the ADLS Gen1 destination, you specify the service name and Azure authentication information such as the client ID and credential. Or, you can have the destination use Azure authentication information configured in the cluster where the pipeline runs.

You specify the output directory to use and whether to remove all existing files in the directory. You select the data format to write and configure related properties.

Prerequisites

Complete the following prerequisites, as needed, before you configure the ADLS Gen1 destination:

If necessary, create a new Azure Active Directory application for StreamSets Transformer.
For information about creating a new application, see the Azure documentation.
Ensure that the Azure Active Directory Transformer application has the appropriate access control to perform the necessary tasks.
To write to Azure, the Transformer application requires Write and Execute permissions. If also reading from Azure, the application requires Read permission as well.

For information about configuring Gen1 access control, see the Azure documentation.
Install the Azure Data Lake Storage Gen1 driver on the cluster where the pipeline runs.
Most recent cluster versions include the ADLS Gen1 driver, azure-datalake-store.jar. However, older versions might require installing it. For more information about Hadoop support for Azure Data Lake Storage Gen1, see the Hadoop documentation.
Retrieve Azure Data Lake Storage Gen1 authentication information from the Azure portal for configuring the destination.
You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.
Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.

Retrieve Authentication Information

The ADLS Gen1 destination connects to Azure using Azure Active Directory service principal authentication, also known as service-to-service authentication.

The destination requires several Azure authentication details to connect to Azure. If the cluster where the pipeline runs has the necessary Azure authentication information configured, then the destination uses that information by default. However, data preview is not available when using Azure authentication information configured in the cluster.

You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.

The destination requires the following Azure authentication information:

Client ID - Client ID for the Azure Active Directory Transformer application. Also known as the application ID or application key.
For information on accessing this from the Azure portal, see "Get values for signing in" in the Azure documentation.
Credential - Authentication key for the Azure Active Directory Transformer application.
For information on accessing this from the Azure portal, see "Get values for signing in" in the Azure documentation.
OAuth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory Transformer application.
For information on accessing this from the Azure portal, see "Step 4: Get the OAuth 2.0 token endpoint" in the Azure documentation.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.

Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When writing data to Azure Data Lake Storage Gen1, Spark creates one output file for each partition. To change the number of partitions that write to a file system, add the Repartition processor before the destination.

For example, let's say that Spark splits the pipeline data into 20 partitions. The pipeline writes to Parquet files. For data scientists to efficiently analyze the Parquet data, the pipeline should write the data to a small number of Parquet files. So you use the Repartition processor before the destination to change the number of partitions to three.

Data Formats

The ADLS Gen1 destination writes records based on the specified data format.

The destination can write using the following data formats:

Avro

The destination writes an Avro file for each partition and includes the Avro schema in each file.