Release Notes

4.2.x Release Notes

The Data Collector 4.2.x releases occurred on the following dates:

4.2.1 on December 23, 2021
4.2.0 on November 9, 2021

New Features and Enhancements

New support

Red Hat Enterprise Linux 8.x - Data Collector now supports installation on RHEL 8.x, in addition to 6.x and 7.x.

New stage

InfluxDB 2.x destination - Use the destination to write to InfluxDB 2.x databases.

Updated stages

Couchbase Lookup processor property name updates - For clarity, the following property names have been changed:
- Property Name is now Sub-Document Path.
- Return Properties is now Return Sub-Documents.
- SDC Field is now Output Field.
- When performing a key value lookup and configuring multiple return properties, the Property Mappings property is now Sub-Document Mappings.
- When performing an N1QL lookup and configuring multiple return properties, the Property Mappings property is now Sub-N1QL Mappings.
Einstein Analytics destination enhancements:
- The Einstein Analytics destination has been renamed the Tableau CRM destination to match the Salesforce rebranding.
- The new Tableau CRM destination can perform automatic recovery.
HTTP Client stage statistics - HTTP Client stages provide additional metrics when you monitor the pipeline.
PostgreSQL CDC Client origin - You can specify the SSL mode to use on the new Encryption tab of the origin.
Salesforce destination - The destination supports performing hard deletes when using the Salesforce Bulk API. Hard deletes permanently delete records, bypassing the Salesforce Recycle Bin.
Salesforce stages - Salesforce stages now use version 53.0.0 of the Salesforce API by default.
SFTP/FTP/FTPS stages - All SFTP/FTP/FTPS Client stages now support HTTP and Socks proxies.

Connections when registered with Control Hub

When Data Collector version 4.2.0 is registered with Control Hub cloud or with Control Hub on-premises version 3.19.x or later, the following stage supports using Control Hub connections:
- Cassandra destination
SFTP/FTP/FTPS enhancement - The SFTP/FTP/FTPS connection allows configuring the new SFTP/FTP/FTPS proxy properties.

Additional enhancements

Enabling HTTPS for Data Collector - You can now store the keystore file in the Data Collector resources directory, $SDC_RESOURCES, and then enter a path relative to that directory when you define the keystore location. This can have upgrade impact.
Google Secret Manager enhancement - You can configure a new enforceEntryGroup Google Secret Manager credential store property to validate a user’s group against a comma-separated list of groups allowed to access each secret.

Testing update

With this release, StreamSets no longer tests Data Collector against Cloudera CDH 5.x, which has been deprecated.

Upgrade Impact

Enabling HTTPS for Data Collector

With this release, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory, $SDC_RESOURCES. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration file.

In previous releases, you can store the keystore file in the Data Collector configuration directory, $SDC_CONF, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but StreamSets recommends moving it to the resources directory when you upgrade.

Tableau CRM destination write behavior change

The write behavior of the Tableau CRM destination, previously known as the Einstein Analytics destination, has changed.

With this release, the destination writes to Salesforce by uploading batches of data to Salesforce, then signaling Salesforce to process the dataset after a configurable interval when no new data arrives. You configure the interval with the Dataset Wait Time stage property.

In versions 3.7.0 - 4.1.x, the destination signals Salesforce to process data after uploading each batch, effectively treating each batch as a dataset and making the Dataset Wait Time property irrelevant.

After upgrading from version 3.7.0 - 4.1.x, verify that the destination behavior is as expected. If necessary, update the Dataset Wait Time property to the interval that Salesforce should wait before processing each dataset.

When upgrading from a version prior to 3.7.0, no action is required. Versions prior to 3.7.0 behave like this release.

4.2.1 Fixed Issues

To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.2.1 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.

Data Collector now sets a Java system property to help address the Apache Log4j known issues.
New permissions validation for the Oracle CDC Client origin added in Data Collector 4.2.0 are too strict. This fix returns the permissions validation to the same level as 4.1.x.

4.2.0 Fixed Issues

Oracle CDC Client origin pipelines can take up to 10 minutes to shut down due to Oracle driver and executor timeout policies. With this fix, those policies are bypassed while allowing all processes to complete gracefully.
The Oracle CDC Client origin can miss recovering transactional data when the pipeline unexpectedly stops when the origin is processing overlapping transactions.
The JDBC Producer destination does not properly write to partitioned PostgreSQL database tables.
The MongoDB destination cannot write null values to MongoDB.
The Salesforce Lookup processor does not properly handle SOQL queries that include single quotation marks.
Pipeline performance suffers when using the Azure Data Lake Storage Gen2 destination to write large batches of data in the Avro data format.
The MapR DB CDC origin does not properly handle records with deleted fields.
When configured to return only the first of multiple return values, the Couchbase Lookup processor creates multiple records instead.
The Tableau CRM destination, previously known as the Einstein Analytics destination, signals Salesforce to process data after each batch, effectively treating each batch as a dataset. This fix can have upgrade impact.

4.2.x Known Issues

There are no important known issues at this time.

4.1.x Release Notes

The Data Collector 4.1.0 release occurred on August 18, 2021.

New Features and Enhancements

Use DataOps Platform to access Data Collector

Existing customers can continue to access Data Collector downloads using the StreamSets Support Portal.

All others users, including community users, can no longer download Data Collector through StreamSets Accounts. Instead, use StreamSets DataOps Platform to deploy Data Collector engines and to design and run Data Collector pipelines.

New to StreamSets DataOps Platform? Sign up and try it for free.

New stage

Google Cloud Storage executor - You can use this executor to create new objects, copy or move objects, or add metadata to new or existing objects.

Stage type enhancements

Amazon stages - When you configure the Region property, you can select from several additional regions.
Kudu stages - The default value for the Maximum Number of Worker Threads property is now 2. Previously, the default was 0, which used the Kudu default.
Existing pipelines are not affected by this change.
Orchestration stages - You can use an expression when you configure the Control Hub URL property in orchestration stages.
Salesforce stages - All Salesforce stages now support using version 52.2.0 of the Salesforce API.
Scripting processors - In the Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors, you can select the Script Error as Record Error property to have the stage handle script errors based on how the On Record Error property is configured for the stage.

Origin enhancements

Google Cloud Storage origin - You can configure post processing actions to take on objects that the origin reads.
MySQL Binary Log origin - The origin now recovers automatically from the following issues:
- Lost, damaged, or unestablished connections.
- Exceptions raised from MySQL Binary Log being out-of-sync in some cluster nodes, or from being unable to communicate with the MySQL Binary Log origin.
Oracle CDC Client origin:
- The origin includes a Batch Wait Time property that determines how long the origin waits for data before sending an empty or partial batch through the pipeline.
- The origin provides additional LogMiner metrics when you monitor a pipeline.
RabbitMQ Consumer origin - You can configure the origin to read from quorum queues by adding x-queue-type as an Additional Client Configuration property and setting it to quorum.

Processor enhancements

SQL Parser processor - You can configure the processor to use the Oracle PEG parser instead of the default parser.

Destination enhancements

Google BigQuery destination - The destination now supports writing Decimal data to Google BigQuery Decimal columns.
MongoDB destination - You can use the Improve Type Conversion property to improve how the destination handles date and decimal data.
Splunk destination - You can use the Additional HTTP Headers property to define additional key-value pairs of HTTP headers to data written to Splunk.

Credential stores

New Google Secret Manager support - You can use Google Secret Manager as a credential store for Data Collector.
Cyberark enhancement - You can configure the credentialStore.cyberark.config.ws.proxyURI property to allow defining the URI for the proxy that should be used to reach the CyberArk services.

Enterprise Stage Libraries

In October 2021, StreamSets released the following new Enterprise stage library:

Google

In September 2021, StreamSets released updates for the following Enterprise stage libraries:

Azure Synapse
Databricks
Oracle
Snowflake

For more information about the new features, fixed issues, and known issues in an Enterprise stage library, see the Enterprise stage library release notes. For a list of available Enterprise libraries, see Enterprise Stage Libraries.

Connections when registered with Control Hub

When Data Collector version 4.2.0 is registered with Control Hub cloud or with Control Hub on-premises version 3.19.x or later, the following stages support using Control Hub connections:
- MongoDB stages
- RabbitMQ stages
- Redis stages

Salesforce enhancement - The Salesforce connection includes the following role properties:
- Use Snowflake Role
- Snowflake Role Name

Stage libraries

This release includes the following new stage library:

Stage Library Name	Description
streamsets-datacollector-apache-kafka_2_8-lib	For Apache Kafka 2.8.0.

Additional enhancements

Excel data format enhancement - Stages that support reading the Excel data format include an Include Cells With Empty Value property to include empty cells in records.

4.1.0 Fixed Issues

Due to an issue with an underlying library, HTTP connections can fail when Keep-Alive is disabled.
Stages that need to parse a large number of JSON, CSV, or XML files might exceed the file descriptors limit because the stages don't release them appropriately.
Data Collector does not properly handle Avro schemas with nested Union fields.
Errors occur when using HBase stages with the CDH 6.0.x - 6.3.x or CDP 7.1 stage libraries when the HBase column name includes more than one colon (:).
When the HTTP Lookup processor paginates by page number, it can enter an endless retry loop when reading the last page of data.
The JDBC Lookup processor does not support expressions for table names when validating column mappings.
Note: Validating column mappings for multiple tables can slow pipeline performance because all table columns defined in the column mappings must be validated before processing can begin.
The Kudu Lookup processor and Kudu destination do not release resources under certain circumstances.
When reading data with a query that uses the MAX or MIN operators, the SQL Server CDC Client origin can take a long time to start processing data.

4.1.x Known Issues

There are no important known issues at this time.

4.0.x Release Notes

The Data Collector 4.1.x releases occurred on the following dates:

4.0.2 - June 23, 2021
4.0.1 - June 7, 2021
4.0.0 - May 25, 2021

New Features and Enhancements

Stage enhancements

Control Hub orchestration stages - The orchestration stages that connect to Control Hub allow using API User Credentials to log into Control Hub, as an alternative to user name and password. This affects the following stages:
- Start Job origin
- Control Hub API processor
- Start Job processor
- Wait for Job processor
Kafka stages - Kafka stages include an Override Stage Configurations property that enables the Kafka properties defined in the stage to override other stage properties.
This can impact existing pipelines.
MapR Streams stages - MapR Streams stages also include an Override Stage Configurations property that enables the additional MapR or Kafka properties defined in the stage to override other stage properties.
This can impact existing pipelines.
Salesforce stages - The Salesforce origin, processor, destination, and the Tableau CRM destination include the following new timeout properties:
- Connection Handshake Timeout
- Subscribe Timeout
Oracle CDC Client origin:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The origin includes an oracle.cdc.oracle.pseudocolumn.<pseudocolumn name> attribute for each pseudocolumn in the original statement.
- Starting with version 4.0.1, the origin includes a Batch Wait Time property.
Field Type Converter processor - The Source Field is Empty property enables to you specify the action to take when an input field is an empty string.
HTTP Client processor:
- Two Pass Records properties allow you to pass a record through the pipeline when all retries fail for per-status actions and for timeouts.
- The following record header attributes are populated when you use one of the Pass Records properties:
  - httpClientError
  - httpClientStatus
  - httpClientLastAction
  - httpClientTimeoutType
  - httpClientRetries
SQL Parser processor:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The processor includes an oracle.cdc.oracle.pseudocolumn.<pseudocolumn name> attribute for each pseudocolumn in the original statement.

Connections when registered with Control Hub

When Data Collector version 4.0.0 is registered with Control Hub cloud or with Control Hub on-premises version 3.19.x or later, the following stages support using Control Hub connections:

Oracle CDC Client origin
SQL Server CDC Client origin
SQL Server Change Tracking Client origin

Enterprise stage libraries

In June 2021, StreamSets released new versions of the Databricks and Snowflake Enterprise stage libraries.

For more information about the new features, fixed issues, and known issues for those releases, see their release notes in the StreamSets Release Notes page.

For a list of available Enterprise libraries, see Enterprise Stage Libraries.

Additional features

SDC_EXTERNAL_RESOURCES environment variable - An optional root directory for external resources, such as custom stage libraries, external libraries, and runtime resources.
The default location is $SDC_DIST/externalResources.
Support Bundle - Support bundles now include the System Messages log file when you include log files in the bundle.

Deprecated features

Several features and stages have been deprecated with this release and may be removed in a future release. We recommend that you avoid using these features and stages. For a full list, click here.

Upgrade Impact

Conflicting properties in Kafka and MapR Streams stages

In previous releases, if you specify an additional configuration property that conflicts with a stage property setting in a Kafka or MapR Streams stage, the stage property takes precedence.

With this release, a conflict generates an error. You can use the new Override Stage Configurations property to enable the Kafka or MapR configuration property to take precedence. Or, you can remove or update the configuration property to allow the stage property to take precedence.

Control Hub On-premises prerequisite task

Before using Data Collector 4.0.0 or later versions with Control Hub On-premises, you must complete a prerequisite task. For details, see the StreamSets Support portal.

HTTP Client processor batch wait time change

With this release, the HTTP Client processor performs additional checks against the specified batch wait time. This can affect existing pipelines. For details, see Review HTTP Client Processor Pipelines.

Open source status

Data Collector 4.0.0 and later versions are not open source. This means that StreamSets will not make the source code publicly available.

This change should not impact customers with a paid subscription to Data Collector. Users who download the free, open source version of Data Collector will be able to use the SaaS-based alternative that will be launched soon.

All earlier versions of Data Collector, which are open source, remain available on Github.

Stages removed

The following stages have been deprecated for several years and have been removed from Data Collector with this release:

HTTP to Kafka origin
SDC RPC to Kafka origin
UDP to Kafka origin

Updated environment variable default (tarball installation, manual start)

For manually-started tarball installations, the default location for the SDC_RESOURCES environment variable has changed from $SDC_DIST/resources to $SDC_EXTERNAL_RESOURCES/resources, which evaluates to: $SDC_DIST/externalResources/resources.

If your installation has SDC_RESOURCES set to a directory outside of the $SDC_DIST runtime directory as described in the installation instructions, or if you do not use SDC_RESOURCES, no action is required.

If your installation has SDC_RESOURCES set to a directory inside the $SDC_DIST runtime directory, you should move the $SDC_RESOURCES directory outside of the $SDC_DIST runtime directory and set the SDC_RESOURCES variable to the new location, as specified in the upgrade instructions.

For all other installations, the default for the SDC_RESOURCES environment variable remains the same.

4.0.2 Fixed Issues

The JDBC Producer destination can round the scale of numeric data when it performs multi-row operations while writing to SQL Server tables.
You cannot use API user credentials in Orchestration stages.

4.0.1 Fixed Issue

In the JDBC Lookup processor, enabling the Validate Column Mappings property when using an expression to represent the lookup table generates an invalid SQL query.
Though fixed, using column mapping validation with an expression for the table name requires querying the database for all column names. As a result, the response time can be slower than expected.

4.0.0 Fixed Issues

The SQL Server CDC Client origin does not process data correctly when configured to generate schema change events.
The Hadoop FS destination stages fail to recover temporary files when the directory template includes pipeline parameters or expressions.
The Oracle CDC Client origin can generate an exception when trying to process data from a transaction after the same partially-processed transaction has already been flushed after exceeding the maximum transaction length.
The Oracle CDC Client origin fails to start when it is configured to start from a timestamp or SCN that is contained in multiple database incarnations.
Some conditional expressions in the Field Mapper processor can cause errors when operating on field names.
HTTP Client stages should not log the proxy password when the Data Collector logging mode is set to Debug.
HTTP Client Processor can create creating duplicate requests when Pagination Mode is set to None.
The MQTT Subscriber origin does not properly restore a persistent session.
The Oracle CDC Client origin generates an exception when Oracle includes an empty string in a redo log statement, which is unexpected. With this fix, the origin interprets empty strings as NULL.
Data Collector uses a Java version specified in the PATH environment variable over the version defined in the JAVA_HOME environment variable.

4.0.x Known Issues

There are no important known issues at this time.