MapR Prerequisites
Due to licensing restrictions, StreamSets cannot distribute MapR libraries with Data Collector. As a result, you must perform additional steps to enable the Data Collector machine to connect to MapR. Data Collector does not display MapR origins and destinations in stage library lists nor the MapR Streams statistics aggregator in the pipeline properties until you perform these prerequisites.
MapR prerequisites include installing the required client libraries and then running the command to set up MapR. If the MapR cluster is enabled with built-in security, you also must configure Data Collector to connect to a secure MapR cluster and ensure that a valid ticket exists for the Data Collector user.
Supported Versions
| Supported MapR Version | Supported MEP Version |
|---|---|
| MapR 5.1.0 | N/A |
| MapR 5.2.0 | MEP 4.0 |
| MapR 6.0.0 | MEP 4.0 |
Step 1. Install Client Libraries
Install Data Collector on a node in the MapR cluster or on a client machine.
- MapR client library - Typically named
mapr-client_<version>.<ext>.You can download the files for your operating system here:http://package.mapr.com/releases/<version>/ - Kafka client library - Typically named
mapr-kafka-<version>.<ext>.For MapR version 5.1.0, you can download the files for your operating system here:
For MapR versions 5.2.0 or later, you can download the files for your operating system here:http://package.mapr.com/releases/ecosystem-<version>/http://package.mapr.com/releases/MEP/MEP-<version>/
Step 2. Run the Command to Set Up MapR
After installing the required client libraries, run the setup-mapr
command. The command modifies configuration files, creates the required symbolic links, and
installs the appropriate MapR stage libraries. You can run the command in interactive or
non-interactive mode.
In interactive mode, the command prompts you for the MapR version and home directory. In non-interactive mode, you define the MapR version and home directory in environment variables before running the command.
In either mode, the command checks if the MapR distribution of Spark is installed in the specified MapR cluster. If a supported version is installed, the command also installs the MapR Spark stage library for you.
Running the Command in Interactive Mode
When you run the setup-mapr command in interactive mode, the command prompts you for the MapR version and home directory.
- Set the following environment variables:
Environment Variable Description SDC_HOME Data Collector home directory. Note: The default home directory for an RPM installation is/opt/streamsets-datacollector. The tarball home directory is the location where you extracted the file.SDC_CONF Data Collector configuration directory. MAPR_MEP_VERSION Required for MEP 4.0. Use this environment variable to use both of the MapR 6.0 and the MapR 6.0 MEP 4 stage libraries, or to use just the MapR 6.0 MEP 4 stage library.
This variable is not required when using just the MapR 6.0 stage library.
Do not set this environment variable for MapR versions earlier than 6.0.
Use the following command to set an environment variable:For example, use the following commands if you used the default home and configuration directories for an RPM installation, and use MEP 4.0:export <environment variable>=<value>export SDC_HOME=/opt/streamsets-datacollector export SDC_CONF=/etc/sdc export MAPR_MEP_VERSION=4 - Use the following command from the
$SDC_HOMEdirectory to set up MapR:bin/streamsets setup-mapr - When prompted, enter 5.1.0, 5.2.0, or 6.0.0 for the MapR version.
- When prompted, enter the absolute path to the MapR home
directory, usually
/opt/mapr. - Restart Data Collector and verify that MapR stages appear in stage library lists.
Running the Command in Non-Interactive Mode
When you run the setup-mapr command in non-interactive mode, you define the MapR version and home directory in environment variables before running the command.
- Set the following environment variables:
Environment Variable Description SDC_HOME Data Collector home directory. Note: The default home directory for an RPM installation is/opt/streamsets-datacollector. The tarball home directory is the location where you extracted the file.SDC_CONF Data Collector configuration directory. MAPR_HOME MapR home directory, usually /opt/mapr. MAPR_VERSION MapR version: 5.1.0, 5.2.0, or 6.0.0. MAPR_MEP_VERSION Required for MEP 4.0. Use this environment variable to use both of the MapR 6.0 and the MapR 6.0 MEP 4 stage libraries, or to use just the MapR 6.0 MEP 4 stage library.
This variable is not required when using just the MapR 6.0 stage library.
Do not set this environment variable for MapR versions earlier than 6.0.
Use the following command to set an environment variable:
For example, use the following commands if you used the default home and configuration directories for an RPM installation, the default MapR home directory, MapR 6.0.0, and MEP 4.0:export <environment variable>=<value>export SDC_HOME=/opt/streamsets-datacollector export SDC_CONF=/etc/sdc export MAPR_HOME=/opt/mapr export MAPR_VERSION=6.0.0 export MAPR_MEP_VERSION=4 - Use the following command from the
$SDC_HOMEdirectory to set up MapR:bin/streamsets setup-mapr - Restart Data Collector and verify that MapR stages appear in stage library lists.
Step 3. Connect to a MapR Cluster Secured with Built-in Security
If the MapR cluster is enabled with built-in security, you must configure Data Collector to connect to a secure MapR cluster.
Modify the SDC_JAVA_OPTS environment variable to add the
-Dmaprlogin.password.enabled configuration property.
Modify the environment variable in the required file based on how you start Data Collector. For more information about the required file to edit, see Modifying Environment Variables.
- Manual start - Uncomment the following line in the
sdc-env.shfile:#export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}" - Service start on operating systems that use the SysV init system - On
CentOS 6, Red Hat Enterprise Linux 6, or Ubuntu 14.04
LTS, uncomment the following line in the
sdcd-env.shfile:#export SDC_JAVA_OPTS="-Dmaprlogin.password.enabled=true ${SDC_JAVA_OPTS}" - Service start on operating systems that use the systemd init system - On
CentOS 7, Red Hat Enterprise Linux 7, or Ubuntu 16.04
LTS, add the following line to the file that overrides the default
settings in the
sdc.servicefile:Environment=SDC_JAVA_OPTS=-Dmaprlogin.password.enabled=trueOverride the default values in the
sdc.servicefile using the same procedure that you use to override unit configuration files on a systemd init system. For an example, see "Example 2. Overriding vendor settings" in this systemd.unit manpage.After overriding the default values, use the following command to reload the systemd manager configuration:
systemctl daemon-reload
After modifying the environment variables, restart Data Collector to enable the changes.
Step 4. Run Data Collector as a MapR Ticket User
To connect to a secure MapR cluster enabled with built-in security, ensure that a valid user, tenant, or service ticket exists for the Data Collector user.
To generate tickets, see the MapR documentation.
To run MapR commands in the secure cluster, Data Collector must run as the user account granted access in the MapR ticket.
For example, if you ran the following MapR command to generate the service ticket for applications running outside of the cluster:
maprlogin generateticket -type service -out /tmp/longlived_ticket -duration 30:0:0 -renewal 90:0:0
MapR credentials of user 'myappuser' for cluster 'mycluster' are written to '/tmp/longlived_ticket'
Then Data Collector must run as the "myappuser" user account.
Configure Data Collector to run as the required user account based on how you start Data Collector:
- Manual start
- When Data Collector is started manually, it runs as the system user account logged into the
command prompt when you use the following launch command from the
$SDC_DISTdirectory:
To connect to a secure MapR cluster, log into the command prompt as the user account granted access in the MapR ticket. Or, impersonate the required user account by using the following launch command from thebin/streamsets dc$SDC_DISTdirectory, where <user> is the user account granted access in the MapR ticket:sudo -u <user> bin/streamsets dcFor example:sudo -u myappuser /opt/streamsets-datacollector-3.4.0/bin/streamsets dc - Service start
- When Data Collector is started as a service, it runs as the system user account and group defined in environment variables. The default system user and group are named "sdc". To use the default "sdc" system user and group, generate a new MapR user or service ticket for the "sdc" user account, as described in the MapR documentation.