Java and Security Configuration

Data Collector includes several advanced properties that you can modify to customize the following areas:
  • Java configuration options
  • Security Manager that restricts the runtime permissions of user libraries

Java Configuration Options

You define Java configuration options used by Data Collector in the deployment.

In Control Hub, edit the deployment. In the Configure Engine section, click Advanced Configuration. Then, click Java Configuration.

When defining Java configuration options, avoid defining duplicate options. If you do define duplicates, the last option passed to the JVM usually takes precedence.

Java Heap Size

Modify the Data Collector Java heap size as necessary, based on the resources available on the host machine. By default, Data Collector uses 50 percent of the available memory on the host machine as the Java heap size. In most cases, the default percentage value is sufficient.

The Java heap size determines the heap size allocated to Data Collector and affects the amount of memory Data Collector can use when it runs a pipeline. Running a pipeline can use up to 65% of the allocated heap size. For example, with a heap size of 2048 MB, you can configure a pipeline to use up to 65% - that's 1331 MB of memory.

To modify the Java heap size, first select the JVM memory strategy to use:
  • Percentage - Allocates a percentage of the available memory on the host machine as the Java heap size.
  • Absolute - Allocates an absolute number in megabytes as the Java heap size.

Based on your selection, configure the minimum and maximum Java heap size in percentage or in an absolute value in megabytes. To avoid constant recalculation of the allocated heap size, set both the minimum and maximum properties to the same value.

Data Collector requires a heap size of at least 1024 MB to run. As a result, the engine always uses a minimum of 1024 MB for the heap size, regardless of the configured size.

Note: In the pipeline properties, you can use the jvm:maxMemoryMB() function to help define the percentage of the heap size the pipeline uses.

Using a Proxy Server

You can configure Data Collector to use an authenticated HTTP or HTTPS proxy server for outbound requests made to Control Hub.

Add the following Java options:

  • https.proxyUser
  • https.proxyPassword
  • https.proxyHost
  • https.proxyPort

If the proxy server uses HTTP instead of HTTPS, use http.<property name> for each property.

For example, to configure Data Collector to use an HTTPS proxy server on host 138.0.0.1 and port 3138, enter the following options in the Java Options property:

-Dhttps.proxyUser=MyName -Dhttps.proxyPassword=MyPsswrd -Dhttps.proxyHost=138.0.0.1 -Dhttps.proxyPort=3138
Then on the machine where Data Collector is running, run the following command to set the STREAMSETS_BOOTSTRAP_JAVA_OPTS environment variable to the same values:
export STREAMSETS_BOOTSTRAP_JAVA_OPTS="-Dhttps.proxyUser=MyName -Dhttps.proxyPassword=MyPsswrd -Dhttps.proxyHost=138.0.0.1 -Dhttps.proxyPort=3138"
Note: Oracle JDK disabled HTTP proxy authentication for HTTPS URLs in JDK 8 update 111. If the engine runs on a machine with Java 8u111 or later, consider using an HTTPS proxy server. Or as a workaround, consider adding the following Java property to the Java Options, setting the property to an empty string:
-Djdk.http.auth.tunneling.disabledSchemes=''

However, use this workaround with caution since it exposes credentials by sending them through an unencrypted proxy.

Remote Debugging

You can enable remote debugging to debug a Data Collector instance running on a remote machine.

Enable remote debugging by modifying the Java Options property in the Java configuration properties. Add the following debugging options to the property, where port_number is an open port number on the remote machine running Data Collector:

-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=<port_number>,suspend=n

For example, to debug Data Collector on a remote machine using port number 2005, define the Java options as follows:

-Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=2005,suspend=n

Garbage Collector

You can define the Java garbage collector that Data Collector uses. By default, Data Collector uses the Concurrent Mark Sweep (CMS) garbage collector.

For example, if you configure Data Collector to use a large heap size, you might want to use the G1 garbage collector. If you define another garbage collector, test and evaluate Data Collector performance before making the same change in a production environment. Garbage collector performance depends on each particular use case.

Define the garbage collector by modifying the Java Options property in the Java configuration properties. To use the G1 garbage collector, add the following option to the property:

-XX:+UseG1GC

Security Manager

Data Collector includes a Java Security Manager that is enabled by default. For enhanced security, you can enable the Data Collector Security Manager which prevents stages from accessing files in protected Data Collector directories.

Data Collector can use one of the following security managers:
Java Security Manager

By default, Data Collector uses the Java Security Manager. The Java Security Manager restricts the runtime permissions of user libraries. This allows administrators to control user libraries actions on production systems. For example, by default, user libraries cannot call out to network resources and potentially cause denial-of-service (DDoS) attacks.

The security policy is defined in the Security Policy configuration properties of the deployment. The file syntax is java standard.

Data Collector Security Manager
For enhanced security, enable the Data Collector Security Manager. The Data Collector Security Manager prevents stages from accessing files in protected Data Collector directories, regardless of how you define the Security Policy configuration properties of the deployment.
To enable the Data Collector Security Manager, uncomment the security_manager.sdc_manager.enable property in the Data Collector configuration properties.
Note: If you use an older JVM version, the Data Collector Security Manager might encounter some JVM known issues.

Protected Directories

When the Data Collector Security Manager is enabled, the following Data Collector directories are protected directories:
  • $SDC_CONF - Stages cannot access files in the configuration directory.
  • $SDC_DATA - Stages cannot access files in the data directory.
  • $SDC_EXTERNAL_RESOURCES - Stages can read files in the resources directory, but cannot write to files in the directory.
  • $SDC_RESOURCES - Stages can read files in the resources directory, but cannot write to files in the directory.

If needed, you can allow stages to access specific files in these protected directories by modifying Data Collector Security Manager exception properties in the Security Policy configuration properties of the deployment. However, use caution when configuring exceptions to these protected directories.

You can configure exceptions to protected directories as follows:
Exceptions for all stage libraries
To allow all stage libraries access to files in protected directories, modify the security_manager.sdc_dirs.exceptions property to define files that can be accessed.
Exceptions for specific stage libraries
To allow a specific stage library access to files in protected directories, add the following property and then define the files that the stage library can access:
security_manager.sdc_dirs.exceptions.<stage_library_name>=<file_path>
For example, the default Data Collector configuration properties includes an exception for the Java keystore credential store stage library defined as follows:
security_manager.sdc_dirs.exceptions.lib.streamsets-datacollector-jks-credentialstore-lib=$SDC_CONF/jks-credentialStore.pkcs12

When you configure a Security Manager exception property, use the appropriate directory environment variable in the file path: $SDC_CONF, $SDC_DATA, or $SDC_RESOURCES. You can enter multiple file paths separated by commas.