MapR Requirements

Cluster mode pipelines that read from a MapR cluster have the following requirements:
Component Requirement
Spark Streaming for cluster streaming mode Spark version 2.1 or later
MapR One of the following MapR and MapR Ecosystem Pack (MEP) versions:
  • MapR 5.2.0 and MEP 3.0
  • MapR 6.0.0 and MEP 4.0
  • MapR 6.0.1 and MEP 5.0
  • MapR 6.1.0 and MEP 6.0
Important: MapR 5.2.0 is a legacy stage library. MapR has announced the end of maintenance for versions 5.x effective April 2019, and recommends upgrading to the most recent release.

To view the complete list of MEPs supported by MapR core versions, see MEP Support by MapR Core Version.

Configuring Cluster Batch Mode for MapR

Complete the following steps to configure a cluster pipeline to read from MapR in cluster batch mode.

  1. Verify the installation of MapR and YARN.
  2. Install the Data Collector on a YARN gateway node.
  3. Grant the user defined in the user environment variable write permission on /user/$SDC_USER.
    The user environment variable defines the system user used to run Data Collector as a service. The file that defines the user environment variable depends on your operating system. For more information, see User and Group for Service Start.
    For example, say the user environment variable is defined as sdc and the cluster does not use Kerberos. Then you might use the following commands to create the directory and configure the necessary write permissions:
    $sudo -u hdfs hadoop fs -mkdir /user/sdc
    $sudo -u hdfs hadoop fs -chown sdc /user/sdc
  4. To enable Data Collector to submit YARN jobs, perform one of the following tasks:
    • On YARN, set the min.user.id to a value equal to or lower than the user ID associated with the Data Collector user ID, typically named "sdc".
    • On YARN, add the Data Collector user name, typically "sdc", to the allowed.system.users property.
    • After you create the pipeline, specify a Hadoop FS user in the MapR FS origin.

      For the Hadoop FS User property, enter a user with an ID that is higher than the min.user.id property, or with a user name that is listed in the allowed.system.users property.

  5. On YARN, verify that the Hadoop logging level is set to a severity of INFO or lower.
    YARN sets the Hadoop logging level to INFO by default. To change the logging level:
    1. Edit the log4j.properties file.
      By default, the file is located in the following directory:
      /opt/mapr/hadoop/<hadoop-version>/conf/
    2. Set the log4j.rootLogger property to a severity of INFO or lower, such as DEBUG or TRACE.
  6. If YARN is configured to use Kerberos authentication, configure Data Collector to use Kerberos authentication.
    When you configure Kerberos authentication for Data Collector, you enable Data Collector to use Kerberos and define the principal and keytab.
    Important: For cluster pipelines, enter an absolute path to the keytab when configuring Data Collector. Standalone pipelines do not require an absolute path.
    Once enabled, Data Collector automatically uses the Kerberos principal and keytab to connect to any YARN cluster that uses Kerberos. For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication in the Data Collector documentation.
  7. In the pipeline properties, on the General tab, set the Execution Mode property to Cluster Batch.
  8. On the Cluster tab, configure the following properties:
    Cluster Property Description
    Worker Java Options Additional Java properties for the pipeline. Separate properties with a space.

    The following properties are set by default.

    • XX:+UseConcMarkSweepGC and XX:+UseParNewGC are set to the Concurrent Mark Sweep (CMS) garbage collector.
    • Dlog4j.debug enables debug logging for log4j.

    Changing the default properties is not recommended.

    You can add any valid Java property.

    Launcher Env Configuration

    Additional configuration properties for the cluster launcher. Using simple or bulk edit mode, click the Add icon and define the property name and value.

    Worker Memory (MB) Maximum amount of memory allocated to each Data Collector worker in the cluster.

    Default is 1024 MB.

  9. In the pipeline, use the MapR FS origin for cluster mode.
    If necessary, select a cluster mode stage library on the General tab of the origin.

Configuring Cluster Streaming Mode for MapR

Complete the following steps to configure a cluster pipeline to read from MapR in cluster streaming mode.

  1. Verify the installation of MapR, Spark Streaming, and YARN.
  2. Install the Data Collector on a Spark and YARN gateway node.
  3. To enable checkpoint metadata storage, grant the user defined in the user environment variable write permission on /user/$SDC_USER.
    The user environment variable defines the system user used to run Data Collector as a service. The file that defines the user environment variable depends on your operating system. For more information, see User and Group for Service Start.
    For example, say the user environment variable is defined as sdc and the cluster does not use Kerberos. Then you might use the following commands to create the directory and configure the necessary write permissions:
    $sudo -u hdfs hadoop fs -mkdir /user/sdc
    $sudo -u hdfs hadoop fs -chown sdc /user/sdc
  4. If necessary, specify the location of the spark-submit script that points to Spark version 2.1 or later.
    Data Collector assumes that the spark-submit script used to submit job requests to Spark Streaming is located in the following directory:
    /usr/bin/spark-submit
    If the script is not in this directory, use the SPARK_SUBMIT_YARN_COMMAND environment variable to define the location of the script.
    The location of the script may differ depending on the Spark version and distribution that you use.
    For example, say the spark-submit script is in the following directory: /opt/mapr/spark/spark-2.1.0/bin/spark-submit. Then, you might use the following command to define the location of the script:
    export SPARK_SUBMIT_YARN_COMMAND=/opt/mapr/spark/spark-2.1.0/bin/spark-submit
    Note: If you change the location of the spark-submit script, you must restart Data Collector to capture the change.
  5. To enable Data Collector to submit YARN jobs, perform one of the following tasks:
    • On YARN, set the min.user.id to a value equal to or lower than the user ID associated with the Data Collector user ID, typically named "sdc".
    • On YARN, add the Data Collector user name, typically "sdc", to the allowed.system.users property.
  6. If necessary, set the Spark logging level to a severity of INFO or lower.
    By default, MapR sets the Spark logging level to WARN. To change the logging level:
    1. Edit the log4j.properties file, located in the following directory:
      <spark-home>/conf/log4j.properties
    2. Set the log4j.rootCategory property to a severity of INFO or lower, such as DEBUG or TRACE.
    For example, when using Spark 2.1.0, you would edit /opt/mapr/spark/spark-2.1.0/conf/log4j.properties, and you might set the property as follows:
    log4j.rootCategory=INFO
  7. If YARN is configured to use Kerberos authentication, configure Data Collector to use Kerberos authentication.
    When you configure Kerberos authentication for Data Collector, you enable Data Collector to use Kerberos and define the principal and keytab.
    Important: For cluster pipelines, enter an absolute path to the keytab when configuring Data Collector. Standalone pipelines do not require an absolute path.
    Once enabled, Data Collector automatically uses the Kerberos principal and keytab to connect to any YARN cluster that uses Kerberos. For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication in the Data Collector documentation.
  8. In the pipeline properties, on the General tab, set the Execution Mode property to Cluster YARN Streaming.
  9. On the Cluster tab, configure the following properties:
    Cluster Property Description
    Worker Count Number of workers used in a Cluster Yarn Streaming pipeline. Use to limit the number of workers spawned for processing. By default, one worker is spawned for every partition in the topic.

    Default is 0 for one worker for each partition.

    Worker Java Options Additional Java properties for the pipeline. Separate properties with a space.

    The following properties are set by default.

    • XX:+UseConcMarkSweepGC and XX:+UseParNewGC are set to the Concurrent Mark Sweep (CMS) garbage collector.
    • Dlog4j.debug enables debug logging for log4j.

    Changing the default properties is not recommended.

    You can add any valid Java property.

    Launcher Env Configuration

    Additional configuration properties for the cluster launcher. Using simple or bulk edit mode, click the Add icon and define the property name and value.

    Worker Memory (MB) Maximum amount of memory allocated to each Data Collector worker in the cluster.

    Default is 1024 MB.

    Extra Spark Configuration For Cluster Yarn Streaming pipelines, you can configure additional Spark configurations to pass to the spark-submit script. Enter the Spark configuration name and the value to use.
    The specified configurations are passed to the spark-submit script as follows:
    spark-submit --conf <key>=<value>

    For example, to limit the off-heap memory allocated to each executor, you can use the spark.yarn.executor.memoryOverhead configuration and set it to the number of MB that you want to use.

    Data Collector does not validate the property names or values.

    For details on additional Spark configurations that you can use, see the Spark documentation for the Spark version that you are using.

  10. In the pipeline, use the MapR Streams Consumer origin for cluster mode.
    If necessary, select a cluster mode stage library on the General tab of the origin.