ADLS Gen2
The ADLS Gen2 destination writes files to Microsoft Azure Data Lake Storage Gen2. To write to Azure Data Lake Storage Gen1, use the ADLS Gen1 destination.
The destination writes data based on the specified data format and creates a separate file for every partition. Before you use the ADLS Gen2 destination, you must perform some prerequisite tasks.
When you configure the ADLS Gen2 destination, you specify the Azure authentication method to use and related properties. Or, you can have the destination use Azure authentication information configured in the cluster where the pipeline runs.
You specify the output directory and write mode to use. When overwriting related partitions, first complete the overwrite partition requirement.
You select the data format to write and configure related properties. You can specify fields to use for partitioning files. You can also drop unrelated master records when using the destination as part of a slowly changing dimension pipeline.
Prerequisites
- If necessary, create a new Azure Active Directory
application for StreamSets Transformer.
For information about creating a new application, see the Azure documentation.
- Ensure that the
Azure Active Directory Transformer application
has the appropriate access control to perform the necessary tasks.
To write to Azure, the Transformer application requires Write and Execute permissions. If also reading from Azure, the application requires Read permission as well.
For information about configuring Gen2 access control, see the Azure documentation.
- Install the Azure Blob File System driver on the cluster where the pipeline
runs.
Most recent cluster versions include the Azure Blob File System driver,
azure-datalake-store.jar
. However, older versions might require installing it. For more information about Azure Data Lake Storage Gen2 support for Hadoop, see the Azure documentation. - Retrieve Azure Data Lake Storage Gen2 authentication information from the Azure
portal for configuring the origin.
You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.
- Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.
Retrieve Authentication Information
The ADLS Gen2 destination provides several ways to authenticate connections to Azure. Depending on the authentication method that you use, the destination requires different authentication details.
If the cluster where the pipeline runs has the necessary Azure authentication information configured, then the destination uses that information by default. However, data preview is not available when using Azure authentication information configured in the cluster.
You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.
- OAuth
- When connecting using OAuth authentication, the destination requires the
following information:
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
For information on accessing the application ID from the Azure portal, see the Azure documentation.
- Application Key - Authentication key for the Azure
Active Directory Transformer application. Also known as the client key.
For information on accessing the application key from the Azure portal, see the Azure documentation.
- OAuth Token Endpoint - OAuth 2.0 token endpoint for
the Azure Active Directory v1.0 application for Transformer. For example:
https://login.microsoftonline.com/<uuid>/oauth2/token
.
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
- Managed Service Identity
- When connecting using Managed Service Identity authentication, the
destination requires the following information:
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
For information on accessing the application key from the Azure portal, see the Azure documentation.
- Tenant ID - Tenant ID for the Azure Active Directory
Transformer
application. Also known as the directory ID.
For information on accessing the tenant ID from the Azure portal, see the Azure documentation.
- Application ID - Application ID for the Azure Active
Directory Transformer
application. Also known as the client ID.
- Shared Key
- When connecting using Shared Key authentication, the destination requires
the following information:
- Account Shared Key - Shared access key that Azure
generated for the storage account.
For more information on accessing the shared access key from the Azure portal, see the Azure documentation.
- Account Shared Key - Shared access key that Azure
generated for the storage account.
Write Mode
The write mode determines how the ADLS Gen2 destination writes files to the destination system. When writing files, the resulting file names are based on the data format of the files.
- Overwrite files
- Removes all files in the directory before creating new files.
- Overwrite related partitions
- Removes all files in a partition before creating new files for the partition. Partitions with no data to be written are left intact.
- Write new files to new directory
- Creates a new directory and writes new files to the directory. Generates an error if the specified directory exists when you start the pipeline.
- Write new or append to existing files
- Creates new files in an existing directory. If a file of the same name exists in the directory, the destination appends data to the file.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.
Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.
When writing data to Azure Data Lake Storage Gen2, Spark creates one output file per partition. When you configure the destination, you can specify fields to partition by.
You can alternatively use a Repartition processor earlier in the pipeline to partition by fields or to specify the number of partitions that you want to use.
When partitioning, you can use the Overwrite Related Partitions write option to overwrite only the partitions where data must be written, leaving other partitions intact. Note that this results in replacing the existing files in those partitions with new files.
For example, say you want the destination to write data
to different partitions based on country codes. In the destination, you specify the
countrycode
field in the Partition by Field property and set the
Write Mode property to Overwrite Related Partitions. When writing only Belgian data, the
destination overwrites existing files in the BE partition with a single file of the
latest data and leaves all other partitions untouched.
Overwrite Partition Requirement
When writing to partitioned files, the ADLS Gen2 destination can overwrite files within affected partitions rather than overwriting the entire data set. For example, if output data includes only data within a 03-2019 partition, then the destination can overwrite the files in the 03-2019 partition and leave all other partitions untouched.
To overwrite partitioned files, Spark must be configured to allow overwriting data within a partition. When writing to unpartitioned files, no action is needed.
To enable overwriting partitions, set the
spark.sql.sources.partitionOverwriteMode
Spark configuration
property to dynamic
.
You can configure the property in Spark, or you can configure the property in individual pipelines. Configure the property in Spark when you want to enable overwriting partitions for all Transformer pipelines.
To enable overwriting partitions for an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.
Data Formats
The ADLS Gen2 destination writes records based on the specified data format.
- Avro
- The destination writes an Avro file for each partition and includes the Avro schema in each file.
- Delimited
- The destination writes a delimited file for each
partition. It creates a header line for each file and uses
\n
as the newline character. You can specify a custom delimiter, quote, and escape character to use in the data. - JSON
- The destination writes a file for each partition and writes each record on a separate line. For more information, see the JSON Lines website.
- ORC
- The destination writes an ORC file for each partition.
- Parquet
- The destination writes a Parquet file for each partition and includes the Parquet schema in every file.
- Text
- The destination writes a text file for every partition and
uses
\n
as the newline character. - XML
- The destination writes an XML file for every partition. You specify the root and row tags to use in output files.
Configuring an ADLS Gen2 Destination
Configure an ADLS Gen2 destination to write files to Azure Data Lake Storage Gen2. Before you use the destination in a pipeline, complete the required prerequisites.
To write to Azure Data Lake Gen1, use the ADLS Gen1 destination.