Amazon S3

The Amazon S3 destination writes objects to Amazon S3. The Amazon S3 destination writes data based on the specified data format and creates a separate object for every partition.

Before you run a pipeline that uses the Amazon S3 destination, make sure to complete the prerequisite tasks.

When you configure the Amazon S3 destination, you specify the authentication method to use. You can specify Amazon S3 server-side encryption for the data. You can also use a connection to configure the destination.

You specify the output location and write mode to use. When overwriting related partitions, first complete the overwrite partition requirement.

You select the data format to write and configure related properties. You can specify fields to use for partitioning files. You can also drop unrelated master records when using the destination as part of a slowly changing dimension pipeline.

You can also configure advanced properties such as performance-related properties and proxy server properties.

Prerequisites

Before writing to Amazon S3 with the Amazon S3 destination, complete the following prerequisite tasks:
Verify permissions
The user associated with the authentication credentials in effect must have WRITE permission on the S3 bucket.
Perform prerequisite tasks for local pipelines

To connect to Amazon S3, Transformer uses connection information stored in a Hadoop configuration file. Before you run a local pipeline that connects to Amazon S3, complete the prerequisite tasks.

URI Scheme

You can use the s3 or s3a URI scheme when you specify the bucket to write to. The URI scheme determines the underlying client that the destination uses to write to Amazon S3.

While both URI schemes are supported for EMR clusters, Amazon recommends using the s3 URI scheme with EMR clusters for better performance, security, and reliability. For all other clusters, use the s3a URI scheme.

For more information, see the Amazon documentation.

Authentication Method

You can configure the Amazon S3 destination to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys. When accessing a public bucket, you can connect anonymously using no authentication.

For more information about the authentication methods and details on how to configure each method, see Amazon Security.

Server-Side Encryption

You can configure the destination to use Amazon Web Services server-side encryption (SSE) to protect data written to Amazon S3. When configured for server-side encryption, the destination passes required server-side encryption configuration values to Amazon S3. Amazon S3 uses the values to encrypt the data as it is written to Amazon S3.

When you enable server-side encryption for the destination, you select one of the following ways that Amazon S3 manages the encryption keys:
Amazon S3-Managed Encryption Keys (SSE-S3)
When you use server-side encryption with Amazon S3-managed keys, Amazon S3 manages the encryption keys for you.
AWS KMS-Managed Encryption Keys (SSE-KMS)
When you use server-side encryption with AWS Key Management Service (KMS), you specify the Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use.
Customer-Provided Encryption Keys (SSE-C)
When you use server-side encryption with customer-provided keys, you specify the Base64 encoded 256-bit encryption key.

For more information about using server-side encryption to protect data in Amazon S3, see the Amazon S3 documentation.

Write Mode

The write mode determines how the Amazon S3 destination writes objects to Amazon S3. When writing objects, the resulting names are based on the selected data format.

The Amazon S3 destination includes the following write modes:
Overwrite files
Removes all objects in the location before creating new objects.
Overwrite related partitions
Removes all objects in a partition before creating new objects for the partition. Partitions with no data to be written are left intact.
For example, say you have a bucket with ten partitions. If the processed data belongs in two partitions, the destination overwrites the two partitions with the new data. The other eight partitions remain unchanged.
Use to overwrite partitions, like when writing to a slowly changing partitioned file dimension.
Note: Before using this option, Spark must be configured to allow overwriting data within a partition.
Write new files to new directory
Creates a new bucket and writes new objects to the bucket. Generates an error if the specified bucket exists when you start the pipeline.
Write new or append to existing files
Creates new objects in an existing location. If an object of the same name exists in the location, the destination appends data to the object.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.

Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When writing data to Amazon S3, Spark creates one object for each partition. When you configure the destination, you can specify fields to partition by. You can alternatively use a Repartition processor earlier in the pipeline to partition by fields or to specify the number of partitions that you want to use.

When partitioning, you can use the Overwrite Related Partitions write option to overwrite only the partitions where data must be written, leaving other partitions intact. Note that this results in replacing the existing objects in those partitions with new objects.

For example, say you want the destination to write data to different partitions based on country codes. In the destination, you specify the countrycode field in the Partition by Field property and set the Write Mode property to Overwrite Related Partitions. When writing only Belgian data, the destination overwrites existing objects in the BE partition with a single object of the latest data and leaves all other partitions untouched.

Overwrite Partition Requirement

When writing to partitioned data, the Amazon S3 destination can overwrite objects within affected partitions rather than overwriting the entire data set. For example, if output data includes only data within a 03-2019 partition, then the destination can overwrite the objects in the 03-2019 partition and leave all other partitions untouched.

To overwrite partitioned data, Spark must be configured to allow overwriting data within a partition. When writing to unpartitioned data, no action is needed.

To enable overwriting partitions, set the spark.sql.sources.partitionOverwriteMode Spark configuration property to dynamic.

You can configure the property in Spark, or you can configure the property in individual pipelines. Configure the property in Spark when you want to enable overwriting partitions for all Transformer pipelines.

To enable overwriting partitions for an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.

Data Formats

The Amazon S3 destination writes records based on the specified data format.

The destination can write using the following data formats:
Avro
The destination writes an object for each partition and includes the Avro schema in each object.
Objects use the following naming convention:
part-<multipart partition number>.avro
When you configure the destination, you must specify the Avro option appropriate for the version of Spark to run the pipeline: Spark 2.3 or Spark 2.4 or later.
Delimited
The destination writes an object for each partition. It creates a header line for each file and uses \n as the newline character. You can specify a custom delimiter, quote, and escape character to use in the data.
Objects use the following naming convention:
part-<multipart partition number>.csv
JSON
The destination writes an object for each partition and writes each record on a separate line. For more information, see the JSON Lines website.
Objects use the following naming convention:
part-<multipart partition number>.json
ORC
The destination writes an object for each partition.
Output files use the following naming convention:
part-<multipart partition number>.snappy.orc
Parquet
The destination writes an object for each partition and includes the Parquet schema in every object.
Objects use the following naming convention:
part-<multipart partition number>.snappy.parquet
Text
The destination writes an object for every partition and uses \n as the newline character.
The destination writes data from a single String field. You can specify the field in the record to use.
Objects use the following naming convention:
part-<multipart partition number>.txt
XML
The destination writes an object for every partition. You specify the root and row tags to use in output files.
Output files use the following naming convention:
part-<5 digit number> 

Configuring an Amazon S3 Destination

Configure an Amazon S3 destination to write Amazon S3 objects. Before you run the pipeline, make sure to complete the prerequisite tasks.
  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Stage library to use to connect to Amazon S3:
    • AWS cluster-provided libraries - The cluster where the pipeline runs has Apache Hadoop Amazon Web Services libraries installed, and therefore has all of the necessary libraries to run the pipeline.
    • AWS Transformer-provided libraries for Hadoop 2.7.7 - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Amazon Web Services libraries required for Hadoop 2.7.7.

    • AWS Transformer-provided libraries for Hadoop 3.2.0 - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Amazon Web Services libraries required for Hadoop 3.2.0.

    Note: When using additional Amazon stages in the pipeline, ensure that they use the same stage library.
  2. On the Amazon S3 tab, configure the following properties:
    Amazon S3 Property Description
    Connection Connection that defines the information required to connect to an external system.

    To connect to an external system, you can select a connection that contains the details, or you can directly enter the details in the pipeline. When you select a connection, Control Hub hides other properties so that you cannot directly enter connection details in the pipeline.

    To create a new connection, click the Add New Connection icon: . To view and edit the details of the selected connection, click the Edit Connection icon: .

    Authentication Method Authentication method used to connect to Amazon Web Services (AWS):
    • AWS Keys - Authenticates using an AWS access key pair.
    • Instance Profile - Authenticates using an instance profile associated with the Transformer EC2 instance.
    • None - Connects to a public bucket using no authentication.
    Access Key ID AWS access key ID. Required when using AWS keys to authenticate with AWS.
    Secret Access Key AWS secret access key. Required when using AWS keys to authenticate with AWS.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Assume Role Temporarily assumes another role to authenticate with AWS.
    Important: Transformer supports assuming another role when the pipeline meets the stage library and cluster type requirements.
    Role ARN

    Amazon resource name (ARN) of the role to assume, entered in the following format:

    arn:aws:iam::<account_id>:role/<role_name>

    Where <account_id> is the ID of your AWS account and <role_name> is the name of the role to assume. You must create and attach an IAM trust policy to this role that allows the role to be assumed.

    Available when assuming another role.

    Role Session Name

    Optional name for the session created by assuming a role. Overrides the default unique identifier for the session.

    Available when assuming another role.

    Session Timeout

    Maximum number of seconds for each session created by assuming a role. The session is refreshed if the pipeline continues to run for longer than this amount of time.

    Set to a value between 3,600 seconds and 43,200 seconds.

    Available when assuming another role.

    Set Session Tags

    Sets a session tag to record the name of the currently logged in Data Collector or Transformer user that starts the pipeline or the Control Hub user that starts the job for the pipeline. AWS IAM verifies that the user account set in the session tag can assume the specified role.

    Select only when the IAM trust policy attached to the role to be assumed uses session tags and restricts the session tag values to specific user accounts.

    When cleared, the connection does not set a session tag.

    Available when assuming another role.

    Use Specific Region Specify the AWS region or endpoint to connect to.

    When cleared, the stage uses the Amazon S3 default global endpoint, s3.amazonaws.com.

    Region AWS region to connect to. Select one of the available regions. To specify an endpoint to connect to, select Other.
    Endpoint Endpoint to connect to when you select Other for the region. Enter the endpoint name.
    Bucket Location to write objects. Use the s3 or s3a URI scheme. As a best practice, use them as follows:
    For EMR clusters:
    s3://<bucket name>/<path to objects>/
    For all other clusters:
    s3a://<bucket name>/<path to objects>/

    The destination creates the path if it does not exist. The instance profile or AWS access key pair used to authenticate with AWS must have write access to the bucket.

    Write Mode Mode to write objects:
    • Overwrite files - Removes all objects in the location before creating new objects.
    • Overwrite related partitions - Removes all objects in a partition before creating new objects for the partition. Partitions with no data to be written are left intact.

      Use to overwrite partitions, like when writing to a slowly changing partitioned file dimension.

      Note: Before using this option, Spark must be configured to allow overwriting data within a partition.
    • Write new or append to existing files - Creates new objects in an existing location.

      If an object of the same name exists in the location, the destination appends data to the object.

    • Write new files to new directory - Creates a new bucket and writes new objects to the bucket. Generates an error if the specified bucket exists when you start the pipeline.
    Server-Side Encryption Option Option that Amazon S3 uses to manage encryption keys for server-side encryption:
    • None - Do not use server-side encryption.
    • SSE-S3 - Use Amazon S3-managed keys.
    • SSE-KMS - Use Amazon Web Services KMS-managed keys.
    • SSE-C - Use customer-provided keys.

    Default is SSE-S3.

    AWS KMS Key ARN Amazon resource name (ARN) of the AWS KMS master encryption key that you want to use. Use the following format:
    arn:<partition>:kms:<region>:<account-id>:key/<key-id>

    For example: arn:aws:kms:us-west-2:111122223333:key/1234abcd-12ab-34cd-56ef-1234567890ab

    Tip: To secure sensitive information, you can use credential stores or runtime resources.

    Used for SSE-KMS encryption only.

    Customer Encryption Key The 256-bit Base64 encoded encryption key to use.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.

    Used for SSE-C encryption only.

    Partition by Fields Fields to use to partition the data.

    When writing to partitioned dimension files, specify the same partition fields that are listed in the Slowly Changing Dimension processor.

    Exclude Unrelated SCD Master Records Excludes slowly changing dimension master records from partitions that are unrelated to the change data.

    Master records in the same partitions as change data are retained. This results in overwriting only the files in the partitions where changes were written.

    Use only in a slowly changing dimension pipeline that writes to a partitioned file dimension. When writing to an unpartitioned file dimension, clear this option to include all master records.

    For guidelines on configuring slowly changing dimension pipelines, see Pipeline Configuration.

  3. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Additional Configuration

    Additional HDFS properties to pass to an HDFS-compatible file system. Specified properties override those in Hadoop configuration files.

    To add properties, click the Add icon and define the HDFS property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by your version of Hadoop.

    Max Threads Maximum number of concurrent threads to use for parallel uploads.
    Buffer Hint

    TCP socket buffer size hint, in bytes.

    Default is 8192.

    Maximum Connections Maximum number of connections to Amazon.

    Default is 1.

    Connection Timeout Seconds to wait for a response before closing the connection.
    Socket Timeout Seconds to wait for a response to a query.
    Retry Count Maximum number of times to retry requests.
    Use Proxy Specifies whether to use a proxy to connect.
    Proxy Host Proxy host.
    Proxy Port Proxy port.
    Proxy User User name for proxy credentials.
    Proxy Password Password for proxy credentials.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Proxy Domain Optional domain name for the proxy server.
    Proxy Workstation Optional workstation for the proxy server.
  4. On the Data Format tab, configure the following property:
    Data Format Property Description
    Data Formats Format of the data. Select one of the following formats:
    • Avro (Spark 2.4 or later) - For Avro data processed by Spark 2.4 or later.
    • Avro (Spark 2.3) - For Avro data processed by Spark 2.3.
    • Delimited
    • JSON
    • ORC
    • Parquet
    • Text
    • XML
  5. For delimited data, on the Data Format tab, configure the following property:
    Delimited Property Description
    Delimiter Character Delimiter character to use in the data. Select one of the available options or select Other to enter a custom character.

    You can enter a Unicode control character using the format \uNNNN, where ​N is a hexadecimal digit from the numbers 0-9 or the letters A-F. For example, enter \u0000 to use the null character as the delimiter or \u2028 to use a line separator as the delimiter.

    Quote Character Quote character to use in the data.
    Escape Character Escape character to use in the data
  6. For text data, on the Data Format tab, configure the following property:
    Text Property Description
    Text Field String field in the record that contains the data to be written. All data must be incorporated into the specified field.
  7. For XML data, on the Data Format tab, configure the following properties:
    XML Property Description
    Root Tag Tag to use as the root element.

    Default is ROWS, which results in a <ROWS> root element.

    Row Tag Tag to use as a record delineator.

    Default is ROW, which results in a <ROW> record delineator element.