Amazon S3
The Amazon S3 origin reads objects stored in Amazon Simple Storage Service, also known as Amazon S3. The objects must be fully written, include data of the same supported format, and use the same schema.
When reading multiple objects in a batch, the origin reads the oldest object first. Upon successfully reading an object, the origin can delete the object, move it to an archive directory, or leave it in the directory.
When the pipeline stops, the origin notes the last-modified timestamp of the last object that it processed and stores it as an offset. When the pipeline starts again, the origin continues processing from the last-saved offset by default. You can reset pipeline offsets to process all available objects.
The Amazon S3 origin reads from Amazon S3 using connection information stored in a Hadoop configuration file. Complete the prerequisite tasks before using the origin in a local pipeline.
When you configure the origin, you specify the authentication method to use. You define the bucket and path to the objects to read. The origin reads objects from the specified directory and its subdirectories. You also specify the name pattern for the objects to read. You can optionally configure another name pattern to exclude objects from processing and define post-processing actions for successfully read objects.
You can also use a connection to configure the origin.
You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties. You can also configure advanced properties such as performance-related properties and proxy server properties.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets, which enables reading the entire data set each time you start the pipeline.
Schema Requirement
All objects processed by the Amazon S3 origin must have the same schema.
When objects have different schemas, the resulting behavior depends on the data format and the version of Spark that you use. For example, the origin might skip processing delimited data with a different schema, but add null values to Parquet data with a different schema.
Authentication Method
You can configure the Amazon S3 origin to authenticate with Amazon Web Services (AWS) using an instance profile or AWS access keys. When accessing a public bucket, you can connect anonymously using no authentication.
For more information about the authentication methods and details on how to configure each method, see Amazon Security.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.
- Delimited, JSON, text, or XML
- When reading text-based data, Spark can split the object into multiple partitions for processing, depending on the underlying file system. Multiline JSON files cannot be split.
- Avro, ORC, or Parquet
- When reading Avro, ORC, or Parquet data, Spark can split the object into
multiple partitions for processing.
Spark uses these partitions throughout the pipeline unless a processor causes Spark to shuffle the data. When you need to change the partitioning in the pipeline, use the Repartition processor.
Data Formats
The Amazon S3 origin generates records based on the specified data format.
- Avro
- The origin generates a record for every Avro record in the object. Each object must contain the Avro schema. The origin uses the Avro schema to generate records.
- Delimited
- The origin generates a record for each delimited line in the object. You can specify a custom delimiter, quote, and escape character used in the data.
- JSON
- By default, the origin generates a record for each line in the object. Each line in the object must contain valid JSON Lines data. For details, see the JSON Lines website.
- ORC
- The origin generates a record for each Optimized Row Columnar (ORC) row in the object.
- Parquet
- The origin generates records for every Parquet record in the object. The object must contain the Parquet schema. The origin uses the Parquet schema to generate records.
- Text
- The origin generates a record for each text line in the object. The object
must use
\n
as the newline character. - XML
- The origin generates a record for every row in the object. You specify the root tag used in files and the row tag used to define records.
Configuring an Amazon S3 Origin
Configure an Amazon S3 origin to read data in Amazon S3. Complete the prerequisite tasks before using the origin in a local pipeline.