Amazon S3 Requirements
Cluster EMR batch and cluster batch mode pipelines can process data from Amazon S3.
The requirements for cluster pipelines that read from Amazon S3 depend on the following batch modes:
- Cluster EMR batch mode
- Cluster EMR batch mode pipelines use a Hadoop FS origin and run on an Amazon EMR cluster to process data from Amazon S3. Cluster EMR batch mode pipelines require a supported version of an Amazon EMR cluster with Hadoop. For a list of the supported Amazon EMR and Hadoop versions, see Available Stage Libraries.
- Cluster batch mode
- Cluster batch mode pipelines use a Hadoop FS origin and run on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from Amazon S3. Cluster mode pipelines that read from HDFS require a supported version of CDH or HDP. For a list of the supported CDH or HDP versions, see Available Stage Libraries.
Configuring Cluster EMR Batch Mode for Amazon S3
Cluster EMR batch mode pipelines run on an Amazon EMR cluster to process data from Amazon S3.
Cluster EMR batch mode pipelines can run on an existing Amazon EMR cluster or on a new EMR cluster that is provisioned when the pipeline starts. When you provision a new EMR cluster, you can configure whether the cluster remains active or terminates when the pipeline stops.
Data Collector can be installed on a gateway node in an existing Amazon EMR cluster. Or, it can be installed outside of the EMR cluster - on an on-premises machine or on another Amazon EC2 instance. Regardless of where Data Collector is installed, you'll likely need to modify the Amazon EMR security group to allow Data Collector to access the master node in the EMR cluster. Security groups control inbound and outbound access to EMR cluster instances. For information on configuring security groups for Amazon EMR clusters, see the Amazon EMR documentation.
All processors and destinations supported in cluster pipelines are supported in a cluster EMR batch pipeline as long as network connectivity is correctly configured from the Amazon EMR cluster to any external system that the processors or destinations use. For example, if you include a JDBC Lookup processor in a cluster EMR batch pipeline, you must ensure that the Amazon EMR cluster can connect to the database.
Complete the following steps to configure a cluster EMR batch mode pipeline to read from Amazon S3:
Configuring Cluster Batch Mode for Amazon S3
Cluster batch mode pipelines run on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from Amazon S3.
Complete the following steps to configure a cluster batch mode pipeline to read from Amazon S3: