ADLS Gen1
The ADLS Gen1 destination writes files to Microsoft Azure Data Lake Storage Gen1 using Azure Active Directory service principal authentication, also known as service-to-service authentication. The destination writes data based on the specified data format and creates a separate file for every partition.
To write to Azure Data Lake Storage Gen2, use the ADLS Gen2 destination.
Before you use the ADLS Gen1 destination, you must perform some prerequisite tasks.
When you configure the ADLS Gen1 destination, you specify the service name and Azure authentication information such as the client ID and credential. Or, you can have the destination use Azure authentication information configured in the cluster where the pipeline runs.
You specify the output directory to use and whether to remove all existing files in the directory. You select the data format to write and configure related properties.
Prerequisites
- If necessary, create a new Azure Active Directory
application for StreamSets Transformer.
For information about creating a new application, see the Azure documentation.
- Ensure that the
Azure Active Directory Transformer application
has the appropriate access control to perform the necessary tasks.
To write to Azure, the Transformer application requires Write and Execute permissions. If also reading from Azure, the application requires Read permission as well.
For information about configuring Gen1 access control, see the Azure documentation.
- Install the Azure Data Lake Storage Gen1 driver on the cluster where the
pipeline runs.
Most recent cluster versions include the ADLS Gen1 driver,
azure-datalake-store.jar. However, older versions might require installing it. For more information about Hadoop support for Azure Data Lake Storage Gen1, see the Hadoop documentation. - Retrieve Azure
Data Lake Storage Gen1 authentication information from the Azure
portal for configuring the destination.
You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.
- Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.
Retrieve Authentication Information
The ADLS Gen1 destination connects to Azure using Azure Active Directory service principal authentication, also known as service-to-service authentication.
The destination requires several Azure authentication details to connect to Azure. If the cluster where the pipeline runs has the necessary Azure authentication information configured, then the destination uses that information by default. However, data preview is not available when using Azure authentication information configured in the cluster.
You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.
The destination requires the following Azure authentication information:
- Client ID - Client ID for the Azure Active Directory
Transformer application. Also known as the
application ID or application key.
For information on accessing this from the Azure portal, see "Get values for signing in" in the Azure documentation.
- Credential - Authentication key for the Azure Active Directory Transformer application.
For information on accessing this from the Azure portal, see "Get values for signing in" in the Azure documentation.
- OAuth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory
Transformer application.
For information on accessing this from the Azure portal, see "Step 4: Get the OAuth 2.0 token endpoint" in the Azure documentation.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.
Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.
When writing data to Azure Data Lake Storage Gen1, Spark creates one output file for each partition. To change the number of partitions that write to a file system, add the Repartition processor before the destination.
For example, let's say that Spark splits the pipeline data into 20 partitions. The pipeline writes to Parquet files. For data scientists to efficiently analyze the Parquet data, the pipeline should write the data to a small number of Parquet files. So you use the Repartition processor before the destination to change the number of partitions to three.
Data Formats
The ADLS Gen1 destination writes records based on the specified data format.
- Avro
- The destination writes an Avro file for each partition and includes the Avro schema in each file.
- Delimited
- The destination writes a delimited file for each
partition. It creates a header line for each file and uses
\nas the newline character. You can specify a custom delimiter, quote, and escape character to use in the data. - JSON
- The destination writes a file for each partition and writes each record on a separate line. For more information, see the JSON Lines website.
- ORC
- The destination writes an ORC file for each partition.
- Parquet
- The destination writes a Parquet file for each partition and includes the Parquet schema in every file.
- Text
- The destination writes a text file for every partition and
uses
\nas the newline character. - XML
- The destination writes an XML file for every partition. You specify the root and row tags to use in output files.
Configuring a ADLS Gen1 Destination
To write to Azure Data Lake Storage Gen2, use the ADLS Gen2 destination.