Amazon S3
The Amazon S3 destination writes objects to Amazon S3. The Amazon S3 destination writes data based on the specified data format and creates a separate object for every partition.
Before you run a pipeline that uses the Amazon S3 destination, make sure to complete the prerequisite tasks.
When you configure the Amazon S3 destination, you specify the authentication method to use. You can specify Amazon S3 server-side encryption for the data. You can also use a connection to configure the destination.
You specify the output location and write mode to use. When overwriting related partitions, first complete the overwrite partition requirement.
You select the data format to write and configure related properties. You can specify fields to use for partitioning files. You can also drop unrelated master records when using the destination as part of a slowly changing dimension pipeline.
You can also configure advanced properties such as performance-related properties and proxy server properties.
Prerequisites
- Verify permissions
- The user associated with the authentication credentials in effect must have WRITE permission on the S3 bucket.
- Perform prerequisite tasks for local pipelines
-
To connect to Amazon S3, Transformer uses connection information stored in a Hadoop configuration file. Before you run a local pipeline that connects to Amazon S3, complete the prerequisite tasks.
URI Scheme
You can use the s3
or s3a
URI scheme when you specify
the bucket to write to. The URI scheme determines the underlying client that the
destination uses to write to Amazon S3.
While both URI schemes are supported for EMR clusters, Amazon recommends using the
s3
URI scheme with EMR clusters for better performance, security,
and reliability. For all other clusters, use the s3a
URI scheme.
For more information, see the Amazon documentation.
Authentication Method
You can configure the Amazon S3 destination to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys. When accessing a public bucket, you can connect anonymously using no authentication.
For more information about the authentication methods and details on how to configure each method, see Amazon Security.
Server-Side Encryption
You can configure the destination to use Amazon Web Services server-side encryption (SSE) to protect data written to Amazon S3. When configured for server-side encryption, the destination passes required server-side encryption configuration values to Amazon S3. Amazon S3 uses the values to encrypt the data as it is written to Amazon S3.
- Amazon S3-Managed Encryption Keys (SSE-S3)
- When you use server-side encryption with Amazon S3-managed keys, Amazon S3 manages the encryption keys for you.
- AWS KMS-Managed Encryption Keys (SSE-KMS)
- When you use server-side encryption with AWS Key Management Service (KMS), you specify the Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use.
- Customer-Provided Encryption Keys (SSE-C)
- When you use server-side encryption with customer-provided keys, you specify the Base64 encoded 256-bit encryption key.
For more information about using server-side encryption to protect data in Amazon S3, see the Amazon S3 documentation.
Write Mode
The write mode determines how the Amazon S3 destination writes objects to Amazon S3. When writing objects, the resulting names are based on the selected data format.
- Overwrite files
- Removes all objects in the location before creating new objects.
- Overwrite related partitions
- Removes all objects in a partition before creating new objects for the partition. Partitions with no data to be written are left intact.
- Write new files to new directory
- Creates a new bucket and writes new objects to the bucket. Generates an error if the specified bucket exists when you start the pipeline.
- Write new or append to existing files
- Creates new objects in an existing location. If an object of the same name exists in the location, the destination appends data to the object.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.
Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.
When writing data to Amazon S3, Spark creates one object for each partition. When you configure the destination, you can specify fields to partition by. You can alternatively use a Repartition processor earlier in the pipeline to partition by fields or to specify the number of partitions that you want to use.
When partitioning, you can use the Overwrite Related Partitions write option to overwrite only the partitions where data must be written, leaving other partitions intact. Note that this results in replacing the existing objects in those partitions with new objects.
For example, say you want the destination to write data to different partitions based on
country codes. In the destination, you specify the countrycode
field in
the Partition by Field property and set the Write Mode property to Overwrite Related
Partitions. When writing only Belgian data, the destination overwrites existing objects
in the BE partition with a single object of the latest data and leaves all other
partitions untouched.
Overwrite Partition Requirement
When writing to partitioned data, the Amazon S3 destination can overwrite objects within affected partitions rather than overwriting the entire data set. For example, if output data includes only data within a 03-2019 partition, then the destination can overwrite the objects in the 03-2019 partition and leave all other partitions untouched.
To overwrite partitioned data, Spark must be configured to allow overwriting data within a partition. When writing to unpartitioned data, no action is needed.
To enable overwriting partitions, set the
spark.sql.sources.partitionOverwriteMode
Spark configuration
property to dynamic
.
You can configure the property in Spark, or you can configure the property in individual pipelines. Configure the property in Spark when you want to enable overwriting partitions for all Transformer pipelines.
To enable overwriting partitions for an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.
Data Formats
The Amazon S3 destination writes records based on the specified data format.
- Avro
- The destination writes an object for each partition and includes the Avro schema in each object.
- Delimited
- The destination writes an object for each partition. It creates a header
line for each file and uses
\n
as the newline character. You can specify a custom delimiter, quote, and escape character to use in the data.
- JSON
- The destination writes an object for each partition and writes each record on a separate line. For more information, see the JSON Lines website.
- ORC
- The destination writes an object for each partition.
- Parquet
- The destination writes an object for each partition and includes the Parquet schema in every object.
- Text
- The destination writes an object for every partition and uses
\n
as the newline character. - XML
- The destination writes an object for every partition. You specify the root and row tags to use in output files.