Release Notes
4.4.x Release Notes
- 4.4.1 on March 24, 2022
- 4.4.0 on February 16, 2022
New Features and Enhancements
- Updated stages
-
- Amazon S3 stages - You can use an Amazon S3 stage to connect to Amazon S3 using a custom endpoint.
- Amazon S3 destination - You can configure the destination to add tags to the Amazon S3 objects that it creates.
- Base 64 Field Decoder and Encoder processors - You can configure the processors to decode or encode multiple fields.
- Google BigQuery (Legacy) destination - The destination, formerly called Google BigQuery, has been renamed and deprecated with this release. The destination may be removed in a future release. We recommend that you use the Google BigQuery (Enterprise) destination to write data to Google BigQuery, which supports processing CDC data and handling data drift.
- Hive Query executor - You can use time functions in the SQL queries that execute on Hive or Impala. When using time functions, you can also select the time zone that the executor uses to evaluate the functions.
- HTTP
Client stages - You can configure additional security headers
to include in the HTTP requests made by the stage. Use additional
security headers when you want to include sensitive information, such as
user names or passwords, in an HTTP header.
For example, you might use the
credential:get()
function in an additional security header to retrieve a password stored securely in a credential store. - HTTP Client processor - You can configure the processor to send a single request that contains all records in the batch.
- JMS destination - You can configure the destination to include record
header attributes with a
jms.header
prefix in message attributes.
- Pulsar stages - You can configure a Pulsar stage to use OAuth 2.0 authentication to connect to an Apache Pulsar cluster.
- Pulsar Consumer origin - The origin creates a
pulsar.topic
record header attribute that includes the topic that the message was read from. - Salesforce stages - Salesforce stages now use version 53.1.0 of the Salesforce API by default.
- Connections
-
- With this release, the
following stages support using Control Hub
connections:
- CoAP Client destination
- Influx DB destination
- Influx DB 2.x destination
- Pulsar stages
- Amazon S3 enhancement - The Amazon S3 connection supports connecting to Amazon S3 using a custom endpoint.
- With this release, the
following stages support using Control Hub
connections:
- Credential stores
-
- Google Secret Manager - You can configure Data Collector to authenticate with Google Secret Manager using credentials in a Google Cloud service account credentials JSON file.
- Enterprise Library
-
In February 2022, StreamSets released an updated Snowflake Enterprise stage library.
Upgrade Impact
- Encryption JAR file removed from Couchbase stage library
- With Data Collector 4.4.0 and later, the Couchbase stage library no longer includes an encryption JAR file that the Couchbase stages do not directly use. Removing the JAR file should not affect pipelines using Couchbase stages.
4.4.1 Fixed Issues
- In Data Collector version 4.4.0, the HTTP Client processor cannot write HTTP response data to an existing field. Earlier Data Collector versions are not affected by this issue.
- When a Kubernetes pod that contains Data Collector shuts down while a pipeline that includes a MapR FS File Metadata or HDFS File Metadata executor is running, the executor cannot always perform the configured tasks.
- Access to Control Hub through the Data Collector user interface times out.
Though this fix may have resolved the issue, as a best practice, use Control Hub to author pipelines instead of Data Collector.
4.4.0 Fixed Issues
- To address recently-discovered vulnerabilities in Apache Log4j 2.17.0 and earlier 2.x versions, Data Collector 4.4.0 is packaged with Log4j 2.17.1. This is the latest available Log4j version, and contains fixes for all known issues.
- The Oracle CDC Client origin does not correctly handle a daylight saving time change when configured to use a database time zone that uses daylight saving time.
- The MapR DB CDC origin does not properly handle records with null values.
- The Kafka Multitopic Consumer origin does not respect the configured Max Batch Wait Time.
- A state notification webhook always uses the POST request method, even if configured to use a different request method.
- When the HTTP Client origin uses OAuth authentication and the request returns 401 Unauthorized and 403 Forbidden statuses, the origin generates a new OAuth token indefinitely.
- The MapR DB CDC origin incorrectly updates the offset during pipeline preview.
- When Amazon stages are configured to assume another role and configured to connect to an endpoint, the stages do not redirect to the correct URL.
- JDBC origins encounter an exception when reading data with an incorrect date format, instead of processing the record as an error record.
- The Directory origin skips reading files that have the same timestamp.
- The JDBC Multitable Consumer origin cannot use a wildcard character (%) in the Schema property.
- The Azure Data Lake Storage Gen2 and Local FS destinations do not correctly shut down threads.
- When using WebSocket tunneling for browser to engine communication, Data Collector cannot use a proxy server for outbound requests made to Control Hub.
- You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.
4.4.x Known Issue
- In Data Collector 4.4.0, the HTTP Client processor cannot write HTTP response data to an
existing field. Earlier Data Collector versions are not affected by this issue.
Workaround: If using Data Collector 4.4.0, upgrade to Data Collector 4.4.1, where this issue is fixed.
4.3.x Release Notes
The Data Collector 4.3.0 release occurred on January 13, 2022.
New Features and Enhancements
- Internal update
- This release includes internal updates to support an upcoming StreamSets DataOps Platform Control Hub feature.
4.3.0 Fixed Issues
- To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.3.0 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.
- Data Collector now sets a Java system property to help address the Apache Log4j known issues.
4.3.x Known Issues
- You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.
4.2.x Release Notes
- 4.2.1 on December 23, 2021
- 4.2.0 on November 9, 2021
New Features and Enhancements
- New support
-
- Red Hat Enterprise Linux 8.x - You can now deploy Data Collector engines on RHEL 8.x, in addition to 6.x and 7.x.
- New stage
-
- InfluxDB 2.x destination - Use the destination to write to InfluxDB 2.x databases.
- Updated stages
-
- Couchbase Lookup processor property name updates - For clarity, the
following property names have been changed:
- Property Name is now Sub-Document Path.
- Return Properties is now Return Sub-Documents.
- SDC Field is now Output Field.
- When performing a key value lookup and configuring multiple return properties, the Property Mappings property is now Sub-Document Mappings.
- When performing an N1QL lookup and configuring multiple return properties, the Property Mappings property is now Sub-N1QL Mappings.
- Einstein Analytics destination enhancements:
- The Einstein Analytics destination has been renamed the Tableau CRM destination to match the Salesforce rebranding.
- The new Tableau CRM destination can perform automatic recovery.
- HTTP Client stage statistics - HTTP Client stages provide additional metrics when you monitor the pipeline.
- PostgreSQL CDC Client origin - You can specify the SSL mode to use on the new Encryption tab of the origin.
- Salesforce destination - The destination supports performing hard deletes when using the Salesforce Bulk API. Hard deletes permanently delete records, bypassing the Salesforce Recycle Bin.
- Salesforce stages - Salesforce stages now use version 53.0.0 of the Salesforce API by default.
- SFTP/FTP/FTPS stages - All SFTP/FTP/FTPS Client stages now support HTTP and Socks proxies.
- Couchbase Lookup processor property name updates - For clarity, the
following property names have been changed:
- Connections
-
- With this release,
the following stage supports using Control Hub
connections:
- Cassandra destination
- SFTP/FTP/FTPS enhancement - The SFTP/FTP/FTPS connection allows configuring the new SFTP/FTP/FTPS proxy properties.
- With this release,
the following stage supports using Control Hub
connections:
- Additional enhancements
-
- Enabling HTTPS for Data Collector - You can now store the keystore file in the Data Collector resources directory, <installation_dir>/externalResources/resources, and then enter a path relative to that directory when you define the keystore location. This can have upgrade impact.
- Google Secret Manager enhancement - You can configure a new
enforceEntryGroup
Google Secret Manager credential store property to validate a user’s group against a comma-separated list of groups allowed to access each secret.
Upgrade Impact
- Enabling HTTPS for Data Collector
- With this release, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory, <installation_dir>/externalResources/resources. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration properties.
- Tableau CRM destination write behavior change
- The write behavior of the Tableau CRM destination, previously known as the Einstein Analytics destination, has changed.
4.2.1 Fixed Issues
- To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.2.1 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.
- Data Collector now sets a Java system property to help address the Apache Log4j known issues.
- New permissions validation for the Oracle CDC Client origin added in Data Collector 4.2.0 are too strict. This fix returns the permissions validation to the same level as 4.1.x.
4.2.0 Fixed Issues
- Oracle CDC Client origin pipelines can take up to 10 minutes to shut down due to Oracle driver and executor timeout policies. With this fix, those policies are bypassed while allowing all processes to complete gracefully.
- The Oracle CDC Client origin can miss recovering transactional data when the pipeline unexpectedly stops when the origin is processing overlapping transactions.
- The JDBC Producer destination does not properly write to partitioned PostgreSQL database tables.
- The MongoDB destination cannot write null values to MongoDB.
- The Salesforce Lookup processor does not properly handle SOQL queries that include single quotation marks.
- Pipeline performance suffers when using the Azure Data Lake Storage Gen2 destination to write large batches of data in the Avro data format.
- The MapR DB CDC origin does not properly handle records with deleted fields.
- When configured to return only the first of multiple return values, the Couchbase Lookup processor creates multiple records instead.
- The Tableau CRM destination, previously known as the Einstein Analytics destination, signals Salesforce to process data after each batch, effectively treating each batch as a dataset. This fix can have upgrade impact.
4.2.x Known Issues
- You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.
4.1.x Release Notes
The Data Collector 4.1.0 release occurred on August 18, 2021.
New Features and Enhancements
- New stage
-
- Google Cloud Storage executor - You can use this executor to create new objects, copy or move objects, or add metadata to new or existing objects.
- Stage type enhancements
-
- Amazon stages - When you configure the Region property, you can select from several additional regions.
- Kudu stages - The default value for the Maximum Number
of Worker Threads property is now 2. Previously, the default was 0,
which used the Kudu default.
Existing pipelines are not affected by this change.
- Orchestration stages - You can use an expression when you configure the Control Hub URL property in orchestration stages.
- Salesforce stages - All Salesforce stages now support using version 52.2.0 of the Salesforce API.
- Scripting processors - In the Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors, you can select the Script Error as Record Error property to have the stage handle script errors based on how the On Record Error property is configured for the stage.
- Origin enhancements
-
- Google Cloud Storage origin - You can configure post processing actions to take on objects that the origin reads.
- MySQL Binary Log origin - The origin now recovers automatically from
the following issues:
- Lost, damaged, or unestablished connections.
- Exceptions raised from MySQL Binary Log being out-of-sync in some cluster nodes, or from being unable to communicate with the MySQL Binary Log origin.
- Oracle CDC Client origin:
- The origin includes a Batch Wait Time property that determines how long the origin waits for data before sending an empty or partial batch through the pipeline.
- The origin provides additional LogMiner metrics when you monitor a pipeline.
- RabbitMQ Consumer origin - You can configure the origin to read from
quorum queues by adding
x-queue-type
as an Additional Client Configuration property and setting it toquorum
.
- Processor enhancements
-
- SQL Parser processor - You can configure the processor to use the Oracle PEG parser instead of the default parser.
- Destination enhancements
-
- Google BigQuery (Legacy) destination - The destination now supports writing Decimal data to Google BigQuery Decimal columns.
- MongoDB destination - You can use the Improve Type Conversion property to improve how the destination handles date and decimal data.
- Splunk destination - You can use the Additional HTTP Headers property to define additional key-value pairs of HTTP headers to data written to Splunk.
- Credential stores
-
- New Google Secret Manager support - You can use Google Secret Manager as a credential store for Data Collector.
- Cyberark enhancement - You can configure the
credentialStore.cyberark.config.ws.proxyURI
property to allow defining the URI for the proxy that should be used to reach the CyberArk services.
- Enterprise Stage Libraries
- In October 2021, StreamSets released the following new Enterprise stage
library:
- Connections
-
- With this release,
the following stages support using Control Hub
connections:
- MongoDB stages
- RabbitMQ stages
- Redis stages
- Salesforce enhancement - The Salesforce connection includes the
following role properties:
- Use Snowflake Role
- Snowflake Role Name
- With this release,
the following stages support using Control Hub
connections:
- Stage libraries
- This release includes the following new stage library:
Stage Library Name Description streamsets-datacollector-apache-kafka_2_8-lib For Apache Kafka 2.8.0. - Additional enhancements
-
- Excel data format enhancement - Stages that support reading the Excel data format include an Include Cells With Empty Value property to include empty cells in records.
4.1.0 Fixed Issues
- Due to an issue with an underlying library, HTTP connections can fail when Keep-Alive is disabled.
- Stages that need to parse a large number of JSON, CSV, or XML files might exceed the file descriptors limit because the stages don't release them appropriately.
- Data Collector does not properly handle Avro schemas with nested Union fields.
- Errors occur when using HBase stages with the CDH 6.0.x - 6.3.x or CDP 7.1 stage libraries when the HBase column name includes more than one colon (:).
- When the HTTP Lookup processor paginates by page number, it can enter an endless retry loop when reading the last page of data.
- The JDBC Lookup
processor does not support expressions for table names when validating column
mappings. Note: Validating column mappings for multiple tables can slow pipeline performance because all table columns defined in the column mappings must be validated before processing can begin.
- The Kudu Lookup processor and Kudu destination do not release resources under certain circumstances.
- When reading data with a query that uses the MAX or MIN operators, the SQL Server CDC Client origin can take a long time to start processing data.
4.1.x Known Issues
- You can only use the Google Secret Manager credential store with self-managed deployments and GCE deployments. You cannot use the credential store with Amazon EC2 deployments at this time.
4.0.x Release Notes
- 4.0.2 - June 23, 2021
- 4.0.1 - June 7, 2021
- 4.0.0 - May 25, 2021
New Features and Enhancements
- Stage enhancements
-
- Control Hub orchestration stages - The orchestration stages that
connect to Control Hub allow using API User Credentials to log into
Control Hub, as an alternative to user name and password. This
affects the following stages:
- Start Job origin
- Control Hub API processor
- Start Job processor
- Wait for Job processor
- Kafka stages - Kafka stages include an Override Stage Configurations property that enables the Kafka properties defined in the stage to override other stage properties.
- MapR Streams stages - MapR Streams stages also include an Override Stage Configurations property that enables the additional MapR or Kafka properties defined in the stage to override other stage properties.
- Salesforce stages - The Salesforce origin, processor, destination,
and the Tableau CRM destination include the following new timeout
properties:
- Connection Handshake Timeout
- Subscribe Timeout
- Oracle CDC Client origin:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The origin includes an
oracle.cdc.oracle.pseudocolumn.<pseudocolumn name>
attribute for each pseudocolumn in the original statement. - Starting with version 4.0.1, the origin includes a Batch Wait Time property.
- Field Type Converter processor - The Source Field is Empty property enables to you specify the action to take when an input field is an empty string.
- HTTP Client processor:
- Two Pass Records properties allow you to pass a record through the pipeline when all retries fail for per-status actions and for timeouts.
- The following record header attributes are populated when you
use one of the Pass Records properties:
- httpClientError
- httpClientStatus
- httpClientLastAction
- httpClientTimeoutType
- httpClientRetries
- SQL Parser processor:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The processor includes an
oracle.cdc.oracle.pseudocolumn.<pseudocolumn name>
attribute for each pseudocolumn in the original statement.
- Control Hub orchestration stages - The orchestration stages that
connect to Control Hub allow using API User Credentials to log into
Control Hub, as an alternative to user name and password. This
affects the following stages:
- Connections
- With this release, the
following stages support using Control Hub connections:
- Oracle CDC Client origin
- SQL Server CDC Client origin
- SQL Server Change Tracking Client origin
- Enterprise stage libraries
- In June 2021, StreamSets released new versions of the Databricks and Snowflake Enterprise stage libraries.
- Additional features
-
- SDC_EXTERNAL_RESOURCES environment variable - An optional root
directory for external resources, such as custom stage libraries,
external libraries, and runtime resources.
The default location is $SDC_DIST/externalResources.
- Support Bundle - Support bundles now include the System Messages log file when you include log files in the bundle.
- SDC_EXTERNAL_RESOURCES environment variable - An optional root
directory for external resources, such as custom stage libraries,
external libraries, and runtime resources.
4.0.2 Fixed Issues
- The JDBC Producer destination can round the scale of numeric data when it performs multi-row operations while writing to SQL Server tables.
- You cannot use API user credentials in Orchestration stages.
4.0.1 Fixed Issue
- In the JDBC Lookup
processor, enabling the Validate Column Mappings property when using an
expression to represent the lookup table generates an invalid SQL
query.
Though fixed, using column mapping validation with an expression for the table name requires querying the database for all column names. As a result, the response time can be slower than expected.
4.0.0 Fixed Issues
- The SQL Server CDC Client origin does not process data correctly when configured to generate schema change events.
-
The Hadoop FS destination stages fail to recover temporary files when the directory template includes pipeline parameters or expressions.
- The Oracle CDC Client origin can generate an exception when trying to process data from a transaction after the same partially-processed transaction has already been flushed after exceeding the maximum transaction length.
- The Oracle CDC Client origin fails to start when it is configured to start from a timestamp or SCN that is contained in multiple database incarnations.
- Some conditional expressions in the Field Mapper processor can cause errors when operating on field names.
- HTTP Client stages should not log the proxy password when the Data Collector logging mode is set to Debug.
-
HTTP Client Processor can create creating duplicate requests when Pagination Mode is set to None.
-
The MQTT Subscriber origin does not properly restore a persistent session.
- The Oracle CDC Client origin generates an exception when Oracle includes an empty string in a redo log statement, which is unexpected. With this fix, the origin interprets empty strings as NULL.
- Data Collector uses a Java version specified in the PATH environment variable over the version defined in the JAVA_HOME environment variable.
4.0.x Known Issues
There are no important known issues at this time.