Release Notes

4.4.x Release Notes

The Data Collector 4.4.x releases occurred on the following dates:

4.4.1 on March 24, 2022
4.4.0 on February 16, 2022

New Features and Enhancements

Updated stages

Amazon S3 stages - You can use an Amazon S3 stage to connect to Amazon S3 using a custom endpoint.
Amazon S3 destination - You can configure the destination to add tags to the Amazon S3 objects that it creates.
Base 64 Field Decoder and Encoder processors - You can configure the processors to decode or encode multiple fields.
Google BigQuery (Legacy) destination - The destination, formerly called Google BigQuery, has been renamed and deprecated with this release. The destination may be removed in a future release. We recommend that you use the Google BigQuery (Enterprise) destination to write data to Google BigQuery, which supports processing CDC data and handling data drift.
Hive Query executor - You can use time functions in the SQL queries that execute on Hive or Impala. When using time functions, you can also select the time zone that the executor uses to evaluate the functions.
HTTP Client stages - You can configure additional security headers to include in the HTTP requests made by the stage. Use additional security headers when you want to include sensitive information, such as user names or passwords, in an HTTP header.
For example, you might use the credential:get() function in an additional security header to retrieve a password stored securely in a credential store.
HTTP Client processor - You can configure the processor to send a single request that contains all records in the batch.
JMS destination - You can configure the destination to include record header attributes with a jms.header prefix in message attributes.

Pulsar stages - You can configure a Pulsar stage to use OAuth 2.0 authentication to connect to an Apache Pulsar cluster.
Pulsar Consumer origin - The origin creates a pulsar.topic record header attribute that includes the topic that the message was read from.
Salesforce stages - Salesforce stages now use version 53.1.0 of the Salesforce API by default.

Connections

With this release, the following stages support using Control Hub connections:
- CoAP Client destination
- Influx DB destination
- Influx DB 2.x destination
- Pulsar stages
Amazon S3 enhancement - The Amazon S3 connection supports connecting to Amazon S3 using a custom endpoint.

Credential stores

Google Secret Manager - You can configure Data Collector to authenticate with Google Secret Manager using credentials in a Google Cloud service account credentials JSON file.

Enterprise Library

In February 2022, StreamSets released an updated Snowflake Enterprise stage library.

For more information about the Snowflake release, see the Snowflake 1.10.0 release notes, available with the Enterprise Libraries Release Notes.

Enterprise stage libraries are free for use in both development and production.

Upgrade Impact

Encryption JAR file removed from Couchbase stage library

With Data Collector 4.4.0 and later, the Couchbase stage library no longer includes an encryption JAR file that the Couchbase stages do not directly use. Removing the JAR file should not affect pipelines using Couchbase stages.

However, if Couchbase pipelines display errors about classes or methods not being found, you can install the following encryption JAR file as an external library for the Couchbase stage library:

https://search.maven.org/artifact/com.couchbase.client/encryption/1.0.0/jar

To install an external library, see Install External Libraries.

4.4.1 Fixed Issues

In Data Collector version 4.4.0, the HTTP Client processor cannot write HTTP response data to an existing field. Earlier Data Collector versions are not affected by this issue.
When a Kubernetes pod that contains Data Collector shuts down while a pipeline that includes a MapR FS File Metadata or HDFS File Metadata executor is running, the executor cannot always perform the configured tasks.
Access to Control Hub through the Data Collector user interface times out.
Though this fix may have resolved the issue, as a best practice, use Control Hub to author pipelines instead of Data Collector.

4.4.0 Fixed Issues

To address recently-discovered vulnerabilities in Apache Log4j 2.17.0 and earlier 2.x versions, Data Collector 4.4.0 is packaged with Log4j 2.17.1. This is the latest available Log4j version, and contains fixes for all known issues.
The Oracle CDC Client origin does not correctly handle a daylight saving time change when configured to use a database time zone that uses daylight saving time.
The MapR DB CDC origin does not properly handle records with null values.
The Kafka Multitopic Consumer origin does not respect the configured Max Batch Wait Time.
A state notification webhook always uses the POST request method, even if configured to use a different request method.
When the HTTP Client origin uses OAuth authentication and the request returns 401 Unauthorized and 403 Forbidden statuses, the origin generates a new OAuth token indefinitely.
The MapR DB CDC origin incorrectly updates the offset during pipeline preview.
When Amazon stages are configured to assume another role and configured to connect to an endpoint, the stages do not redirect to the correct URL.
JDBC origins encounter an exception when reading data with an incorrect date format, instead of processing the record as an error record.
The Directory origin skips reading files that have the same timestamp.
The JDBC Multitable Consumer origin cannot use a wildcard character (%) in the Schema property.
The Azure Data Lake Storage Gen2 and Local FS destinations do not correctly shut down threads.
When using WebSocket tunneling for browser to engine communication, Data Collector cannot use a proxy server for outbound requests made to Control Hub.
You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.

4.4.x Known Issue

In Data Collector 4.4.0, the HTTP Client processor cannot write HTTP response data to an existing field. Earlier Data Collector versions are not affected by this issue.
Workaround: If using Data Collector 4.4.0, upgrade to Data Collector 4.4.1, where this issue is fixed.

4.3.x Release Notes

The Data Collector 4.3.0 release occurred on January 13, 2022.

New Features and Enhancements

Internal update: This release includes internal updates to support an upcoming StreamSets DataOps Platform Control Hub feature.; Note: All new Data Collector deployments on StreamSets DataOps Platform will use Data Collector version 4.3.0 or higher. Existing deployments are not affected.

4.3.0 Fixed Issues

To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.3.0 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.
Data Collector now sets a Java system property to help address the Apache Log4j known issues.

4.3.x Known Issues

You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.

4.2.x Release Notes

The Data Collector 4.2.x releases occurred on the following dates:

4.2.1 on December 23, 2021
4.2.0 on November 9, 2021

New Features and Enhancements

New support

Red Hat Enterprise Linux 8.x - You can now deploy Data Collector engines on RHEL 8.x, in addition to 6.x and 7.x.

New stage

InfluxDB 2.x destination - Use the destination to write to InfluxDB 2.x databases.

Updated stages

Couchbase Lookup processor property name updates - For clarity, the following property names have been changed:
- Property Name is now Sub-Document Path.
- Return Properties is now Return Sub-Documents.
- SDC Field is now Output Field.
- When performing a key value lookup and configuring multiple return properties, the Property Mappings property is now Sub-Document Mappings.
- When performing an N1QL lookup and configuring multiple return properties, the Property Mappings property is now Sub-N1QL Mappings.
Einstein Analytics destination enhancements:
- The Einstein Analytics destination has been renamed the Tableau CRM destination to match the Salesforce rebranding.
- The new Tableau CRM destination can perform automatic recovery.
HTTP Client stage statistics - HTTP Client stages provide additional metrics when you monitor the pipeline.
PostgreSQL CDC Client origin - You can specify the SSL mode to use on the new Encryption tab of the origin.
Salesforce destination - The destination supports performing hard deletes when using the Salesforce Bulk API. Hard deletes permanently delete records, bypassing the Salesforce Recycle Bin.
Salesforce stages - Salesforce stages now use version 53.0.0 of the Salesforce API by default.
SFTP/FTP/FTPS stages - All SFTP/FTP/FTPS Client stages now support HTTP and Socks proxies.

Connections

With this release, the following stage supports using Control Hub connections:
- Cassandra destination
SFTP/FTP/FTPS enhancement - The SFTP/FTP/FTPS connection allows configuring the new SFTP/FTP/FTPS proxy properties.

Additional enhancements

Enabling HTTPS for Data Collector - You can now store the keystore file in the Data Collector resources directory, <installation_dir>/externalResources/resources, and then enter a path relative to that directory when you define the keystore location. This can have upgrade impact.
Google Secret Manager enhancement - You can configure a new enforceEntryGroup Google Secret Manager credential store property to validate a user’s group against a comma-separated list of groups allowed to access each secret.

Upgrade Impact

Enabling HTTPS for Data Collector

With this release, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory, <installation_dir>/externalResources/resources. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration properties.

In previous releases, you can store the keystore file in the Data Collector configuration directory, <installation_dir>/etc, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but StreamSets recommends moving it to the resources directory when you upgrade.

Tableau CRM destination write behavior change

The write behavior of the Tableau CRM destination, previously known as the Einstein Analytics destination, has changed.

With this release, the destination writes to Salesforce by uploading batches of data to Salesforce, then signaling Salesforce to process the dataset after a configurable interval when no new data arrives. You configure the interval with the Dataset Wait Time stage property.

In versions 3.7.0 - 4.1.x, the destination signals Salesforce to process data after uploading each batch, effectively treating each batch as a dataset and making the Dataset Wait Time property irrelevant.

After upgrading from version 3.7.0 - 4.1.x, verify that the destination behavior is as expected. If necessary, update the Dataset Wait Time property to the interval that Salesforce should wait before processing each dataset.

When upgrading from a version prior to 3.7.0, no action is required. Versions prior to 3.7.0 behave like this release.

4.2.1 Fixed Issues

To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.2.1 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.

Data Collector now sets a Java system property to help address the Apache Log4j known issues.
New permissions validation for the Oracle CDC Client origin added in Data Collector 4.2.0 are too strict. This fix returns the permissions validation to the same level as 4.1.x.

4.2.0 Fixed Issues

Oracle CDC Client origin pipelines can take up to 10 minutes to shut down due to Oracle driver and executor timeout policies. With this fix, those policies are bypassed while allowing all processes to complete gracefully.
The Oracle CDC Client origin can miss recovering transactional data when the pipeline unexpectedly stops when the origin is processing overlapping transactions.
The JDBC Producer destination does not properly write to partitioned PostgreSQL database tables.
The MongoDB destination cannot write null values to MongoDB.
The Salesforce Lookup processor does not properly handle SOQL queries that include single quotation marks.
Pipeline performance suffers when using the Azure Data Lake Storage Gen2 destination to write large batches of data in the Avro data format.
The MapR DB CDC origin does not properly handle records with deleted fields.
When configured to return only the first of multiple return values, the Couchbase Lookup processor creates multiple records instead.
The Tableau CRM destination, previously known as the Einstein Analytics destination, signals Salesforce to process data after each batch, effectively treating each batch as a dataset. This fix can have upgrade impact.

4.2.x Known Issues

You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.

4.1.x Release Notes

The Data Collector 4.1.0 release occurred on August 18, 2021.

New Features and Enhancements

New stage

Google Cloud Storage executor - You can use this executor to create new objects, copy or move objects, or add metadata to new or existing objects.

Stage type enhancements

Amazon stages - When you configure the Region property, you can select from several additional regions.
Kudu stages - The default value for the Maximum Number of Worker Threads property is now 2. Previously, the default was 0, which used the Kudu default.
Existing pipelines are not affected by this change.
Orchestration stages - You can use an expression when you configure the Control Hub URL property in orchestration stages.
Salesforce stages - All Salesforce stages now support using version 52.2.0 of the Salesforce API.
Scripting processors - In the Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors, you can select the Script Error as Record Error property to have the stage handle script errors based on how the On Record Error property is configured for the stage.

Origin enhancements

Google Cloud Storage origin - You can configure post processing actions to take on objects that the origin reads.
MySQL Binary Log origin - The origin now recovers automatically from the following issues:
- Lost, damaged, or unestablished connections.
- Exceptions raised from MySQL Binary Log being out-of-sync in some cluster nodes, or from being unable to communicate with the MySQL Binary Log origin.
Oracle CDC Client origin:
- The origin includes a Batch Wait Time property that determines how long the origin waits for data before sending an empty or partial batch through the pipeline.
- The origin provides additional LogMiner metrics when you monitor a pipeline.
RabbitMQ Consumer origin - You can configure the origin to read from quorum queues by adding x-queue-type as an Additional Client Configuration property and setting it to quorum.

Processor enhancements

SQL Parser processor - You can configure the processor to use the Oracle PEG parser instead of the default parser.

Destination enhancements

Google BigQuery (Legacy) destination - The destination now supports writing Decimal data to Google BigQuery Decimal columns.
MongoDB destination - You can use the Improve Type Conversion property to improve how the destination handles date and decimal data.
Splunk destination - You can use the Additional HTTP Headers property to define additional key-value pairs of HTTP headers to data written to Splunk.

Credential stores

New Google Secret Manager support - You can use Google Secret Manager as a credential store for Data Collector.
Cyberark enhancement - You can configure the credentialStore.cyberark.config.ws.proxyURI property to allow defining the URI for the proxy that should be used to reach the CyberArk services.

Enterprise Stage Libraries

In October 2021, StreamSets released the following new Enterprise stage library:

Google

In September 2021, StreamSets released updates for the following Enterprise stage libraries:

Azure Synapse
Databricks
Oracle
Snowflake

For more information about the new features, fixed issues, and known issues in an Enterprise stage library, see the Enterprise stage library release notes. For a list of available Enterprise libraries, see Enterprise Stage Libraries.

Connections

With this release, the following stages support using Control Hub connections:
- MongoDB stages
- RabbitMQ stages
- Redis stages

Salesforce enhancement - The Salesforce connection includes the following role properties:
- Use Snowflake Role
- Snowflake Role Name

Stage libraries

This release includes the following new stage library:

Stage Library Name	Description
streamsets-datacollector-apache-kafka_2_8-lib	For Apache Kafka 2.8.0.

Additional enhancements

Excel data format enhancement - Stages that support reading the Excel data format include an Include Cells With Empty Value property to include empty cells in records.

4.1.0 Fixed Issues

Due to an issue with an underlying library, HTTP connections can fail when Keep-Alive is disabled.
Stages that need to parse a large number of JSON, CSV, or XML files might exceed the file descriptors limit because the stages don't release them appropriately.
Data Collector does not properly handle Avro schemas with nested Union fields.
Errors occur when using HBase stages with the CDH 6.0.x - 6.3.x or CDP 7.1 stage libraries when the HBase column name includes more than one colon (:).
When the HTTP Lookup processor paginates by page number, it can enter an endless retry loop when reading the last page of data.
The JDBC Lookup processor does not support expressions for table names when validating column mappings.
Note: Validating column mappings for multiple tables can slow pipeline performance because all table columns defined in the column mappings must be validated before processing can begin.
The Kudu Lookup processor and Kudu destination do not release resources under certain circumstances.
When reading data with a query that uses the MAX or MIN operators, the SQL Server CDC Client origin can take a long time to start processing data.

4.1.x Known Issues

You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.

4.0.x Release Notes

The Data Collector 4.1.x releases occurred on the following dates:

4.0.2 - June 23, 2021
4.0.1 - June 7, 2021
4.0.0 - May 25, 2021

New Features and Enhancements

Stage enhancements

Control Hub orchestration stages - The orchestration stages that connect to Control Hub allow using API User Credentials to log into Control Hub, as an alternative to user name and password. This affects the following stages:
- Start Job origin
- Control Hub API processor
- Start Job processor
- Wait for Job processor
Kafka stages - Kafka stages include an Override Stage Configurations property that enables the Kafka properties defined in the stage to override other stage properties.
MapR Streams stages - MapR Streams stages also include an Override Stage Configurations property that enables the additional MapR or Kafka properties defined in the stage to override other stage properties.
Salesforce stages - The Salesforce origin, processor, destination, and the Tableau CRM destination include the following new timeout properties:
- Connection Handshake Timeout
- Subscribe Timeout
Oracle CDC Client origin:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The origin includes an oracle.cdc.oracle.pseudocolumn.<pseudocolumn name> attribute for each pseudocolumn in the original statement.
- Starting with version 4.0.1, the origin includes a Batch Wait Time property.
Field Type Converter processor - The Source Field is Empty property enables to you specify the action to take when an input field is an empty string.
HTTP Client processor:
- Two Pass Records properties allow you to pass a record through the pipeline when all retries fail for per-status actions and for timeouts.
- The following record header attributes are populated when you use one of the Pass Records properties:
  - httpClientError
  - httpClientStatus
  - httpClientLastAction
  - httpClientTimeoutType
  - httpClientRetries
SQL Parser processor:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The processor includes an oracle.cdc.oracle.pseudocolumn.<pseudocolumn name> attribute for each pseudocolumn in the original statement.

Connections

With this release, the following stages support using Control Hub connections:

Oracle CDC Client origin
SQL Server CDC Client origin
SQL Server Change Tracking Client origin

Enterprise stage libraries

In June 2021, StreamSets released new versions of the Databricks and Snowflake Enterprise stage libraries.

For more information about the new features, fixed issues, and known issues for those releases, see their release notes in the StreamSets Release Notes page.

For a list of available Enterprise libraries, see Enterprise Stage Libraries.

Additional features

SDC_EXTERNAL_RESOURCES environment variable - An optional root directory for external resources, such as custom stage libraries, external libraries, and runtime resources.
The default location is $SDC_DIST/externalResources.
Support Bundle - Support bundles now include the System Messages log file when you include log files in the bundle.

4.0.2 Fixed Issues

The JDBC Producer destination can round the scale of numeric data when it performs multi-row operations while writing to SQL Server tables.
You cannot use API user credentials in Orchestration stages.

4.0.1 Fixed Issue

In the JDBC Lookup processor, enabling the Validate Column Mappings property when using an expression to represent the lookup table generates an invalid SQL query.
Though fixed, using column mapping validation with an expression for the table name requires querying the database for all column names. As a result, the response time can be slower than expected.

4.0.0 Fixed Issues

The SQL Server CDC Client origin does not process data correctly when configured to generate schema change events.
The Hadoop FS destination stages fail to recover temporary files when the directory template includes pipeline parameters or expressions.
The Oracle CDC Client origin can generate an exception when trying to process data from a transaction after the same partially-processed transaction has already been flushed after exceeding the maximum transaction length.
The Oracle CDC Client origin fails to start when it is configured to start from a timestamp or SCN that is contained in multiple database incarnations.
Some conditional expressions in the Field Mapper processor can cause errors when operating on field names.
HTTP Client stages should not log the proxy password when the Data Collector logging mode is set to Debug.
HTTP Client Processor can create creating duplicate requests when Pagination Mode is set to None.
The MQTT Subscriber origin does not properly restore a persistent session.
The Oracle CDC Client origin generates an exception when Oracle includes an empty string in a redo log statement, which is unexpected. With this fix, the origin interprets empty strings as NULL.
Data Collector uses a Java version specified in the PATH environment variable over the version defined in the JAVA_HOME environment variable.

4.0.x Known Issues

There are no important known issues at this time.