Release Notes
4.2.x Release Notes
- 4.2.1 on December 23, 2021
- 4.2.0 on November 9, 2021
New Features and Enhancements
- New support
-
- Red Hat Enterprise Linux 8.x - Data Collector now supports installation on RHEL 8.x, in addition to 6.x and 7.x.
- New stage
-
- InfluxDB 2.x destination - Use the destination to write to InfluxDB 2.x databases.
- Updated stages
-
- Couchbase Lookup processor property name updates - For clarity, the
following property names have been changed:
- Property Name is now Sub-Document Path.
- Return Properties is now Return Sub-Documents.
- SDC Field is now Output Field.
- When performing a key value lookup and configuring multiple return properties, the Property Mappings property is now Sub-Document Mappings.
- When performing an N1QL lookup and configuring multiple return properties, the Property Mappings property is now Sub-N1QL Mappings.
- Einstein Analytics destination enhancements:
- The Einstein Analytics destination has been renamed the Tableau CRM destination to match the Salesforce rebranding.
- The new Tableau CRM destination can perform automatic recovery.
- HTTP Client stage statistics - HTTP Client stages provide additional metrics when you monitor the pipeline.
- PostgreSQL CDC Client origin - You can specify the SSL mode to use on the new Encryption tab of the origin.
- Salesforce destination - The destination supports performing hard deletes when using the Salesforce Bulk API. Hard deletes permanently delete records, bypassing the Salesforce Recycle Bin.
- Salesforce stages - Salesforce stages now use version 53.0.0 of the Salesforce API by default.
- SFTP/FTP/FTPS stages - All SFTP/FTP/FTPS Client stages now support HTTP and Socks proxies.
- Couchbase Lookup processor property name updates - For clarity, the
following property names have been changed:
- Connections when registered with Control Hub
-
- When Data Collector version 4.2.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stage
supports using Control Hub
connections:
- Cassandra destination
- SFTP/FTP/FTPS enhancement - The SFTP/FTP/FTPS connection allows configuring the new SFTP/FTP/FTPS proxy properties.
- When Data Collector version 4.2.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stage
supports using Control Hub
connections:
- Additional enhancements
-
- Enabling HTTPS for Data Collector - You can now store the keystore file in the Data Collector resources directory,
$SDC_RESOURCES
, and then enter a path relative to that directory when you define the keystore location. This can have upgrade impact. - Google Secret Manager enhancement - You can configure a new
enforceEntryGroup
Google Secret Manager credential store property to validate a user’s group against a comma-separated list of groups allowed to access each secret.
- Enabling HTTPS for Data Collector - You can now store the keystore file in the Data Collector resources directory,
- Testing update
- With this release, StreamSets no longer tests Data Collector against Cloudera CDH 5.x, which has been deprecated.
Upgrade Impact
- Enabling HTTPS for Data Collector
- With this release, when you enable HTTPS for Data Collector, you can store the keystore file in the Data Collector resources directory,
$SDC_RESOURCES
. You can then enter a path relative to that directory when you define the keystore location in the Data Collector configuration file. - Tableau CRM destination write behavior change
- The write behavior of the Tableau CRM destination, previously known as the Einstein Analytics destination, has changed.
4.2.1 Fixed Issues
- To address recently-discovered vulnerabilities in Apache Log4j 2.16.x and earlier 2.x versions, Data Collector 4.2.1 is packaged with Log4j 2.17.0. This is the latest available Log4j version, and contains fixes for all known issues.
- Data Collector now sets a Java system property to help address the Apache Log4j known issues.
- New permissions validation for the Oracle CDC Client origin added in Data Collector 4.2.0 are too strict. This fix returns the permissions validation to the same level as 4.1.x.
4.2.0 Fixed Issues
- Oracle CDC Client origin pipelines can take up to 10 minutes to shut down due to Oracle driver and executor timeout policies. With this fix, those policies are bypassed while allowing all processes to complete gracefully.
- The Oracle CDC Client origin can miss recovering transactional data when the pipeline unexpectedly stops when the origin is processing overlapping transactions.
- The JDBC Producer destination does not properly write to partitioned PostgreSQL database tables.
- The MongoDB destination cannot write null values to MongoDB.
- The Salesforce Lookup processor does not properly handle SOQL queries that include single quotation marks.
- Pipeline performance suffers when using the Azure Data Lake Storage Gen2 destination to write large batches of data in the Avro data format.
- The MapR DB CDC origin does not properly handle records with deleted fields.
- When configured to return only the first of multiple return values, the Couchbase Lookup processor creates multiple records instead.
- The Tableau CRM destination, previously known as the Einstein Analytics destination, signals Salesforce to process data after each batch, effectively treating each batch as a dataset. This fix can have upgrade impact.
4.2.x Known Issues
There are no important known issues at this time.
4.1.x Release Notes
The Data Collector 4.1.0 release occurred on August 18, 2021.
New Features and Enhancements
- Use DataOps Platform to access Data Collector
- Existing customers can continue to access Data Collector downloads using the StreamSets Support Portal.
- New stage
-
- Google Cloud Storage executor - You can use this executor to create new objects, copy or move objects, or add metadata to new or existing objects.
- Stage type enhancements
-
- Amazon stages - When you configure the Region property, you can select from several additional regions.
- Kudu stages - The default value for the Maximum Number
of Worker Threads property is now 2. Previously, the default was 0,
which used the Kudu default.
Existing pipelines are not affected by this change.
- Orchestration stages - You can use an expression when you configure the Control Hub URL property in orchestration stages.
- Salesforce stages - All Salesforce stages now support using version 52.2.0 of the Salesforce API.
- Scripting processors - In the Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors, you can select the Script Error as Record Error property to have the stage handle script errors based on how the On Record Error property is configured for the stage.
- Origin enhancements
-
- Google Cloud Storage origin - You can configure post processing actions to take on objects that the origin reads.
- MySQL Binary Log origin - The origin now recovers automatically from
the following issues:
- Lost, damaged, or unestablished connections.
- Exceptions raised from MySQL Binary Log being out-of-sync in some cluster nodes, or from being unable to communicate with the MySQL Binary Log origin.
- Oracle CDC Client origin:
- The origin includes a Batch Wait Time property that determines how long the origin waits for data before sending an empty or partial batch through the pipeline.
- The origin provides additional LogMiner metrics when you monitor a pipeline.
- RabbitMQ Consumer origin - You can configure the origin to read from
quorum queues by adding
x-queue-type
as an Additional Client Configuration property and setting it toquorum
.
- Processor enhancements
-
- SQL Parser processor - You can configure the processor to use the Oracle PEG parser instead of the default parser.
- Destination enhancements
-
- Google BigQuery destination - The destination now supports writing Decimal data to Google BigQuery Decimal columns.
- MongoDB destination - You can use the Improve Type Conversion property to improve how the destination handles date and decimal data.
- Splunk destination - You can use the Additional HTTP Headers property to define additional key-value pairs of HTTP headers to data written to Splunk.
- Credential stores
-
- New Google Secret Manager support - You can use Google Secret Manager as a credential store for Data Collector.
- Cyberark enhancement - You can configure the
credentialStore.cyberark.config.ws.proxyURI
property to allow defining the URI for the proxy that should be used to reach the CyberArk services.
- Enterprise Stage Libraries
- In October 2021, StreamSets released the following new Enterprise stage
library:
- Connections when registered with Control Hub
-
- When Data Collector version 4.2.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stages
support using Control Hub
connections:
- MongoDB stages
- RabbitMQ stages
- Redis stages
- Salesforce enhancement - The Salesforce connection includes the
following role properties:
- Use Snowflake Role
- Snowflake Role Name
- When Data Collector version 4.2.0 is registered with Control Hub cloud or with Control Hub
on-premises version 3.19.x or later, the following stages
support using Control Hub
connections:
- Stage libraries
- This release includes the following new stage library:
Stage Library Name Description streamsets-datacollector-apache-kafka_2_8-lib For Apache Kafka 2.8.0. - Additional enhancements
-
- Excel data format enhancement - Stages that support reading the Excel data format include an Include Cells With Empty Value property to include empty cells in records.
4.1.0 Fixed Issues
- Due to an issue with an underlying library, HTTP connections can fail when Keep-Alive is disabled.
- Stages that need to parse a large number of JSON, CSV, or XML files might exceed the file descriptors limit because the stages don't release them appropriately.
- Data Collector does not properly handle Avro schemas with nested Union fields.
- Errors occur when using HBase stages with the CDH 6.0.x - 6.3.x or CDP 7.1 stage libraries when the HBase column name includes more than one colon (:).
- When the HTTP Lookup processor paginates by page number, it can enter an endless retry loop when reading the last page of data.
- The JDBC Lookup
processor does not support expressions for table names when validating column
mappings. Note: Validating column mappings for multiple tables can slow pipeline performance because all table columns defined in the column mappings must be validated before processing can begin.
- The Kudu Lookup processor and Kudu destination do not release resources under certain circumstances.
- When reading data with a query that uses the MAX or MIN operators, the SQL Server CDC Client origin can take a long time to start processing data.
4.1.x Known Issues
There are no important known issues at this time.
4.0.x Release Notes
- 4.0.2 - June 23, 2021
- 4.0.1 - June 7, 2021
- 4.0.0 - May 25, 2021
New Features and Enhancements
- Stage enhancements
-
- Control Hub orchestration stages - The orchestration stages that
connect to Control Hub allow using API User Credentials to log into
Control Hub, as an alternative to user name and password. This
affects the following stages:
- Start Job origin
- Control Hub API processor
- Start Job processor
- Wait for Job processor
- Kafka stages - Kafka stages include an Override Stage Configurations
property that enables the Kafka properties defined in the stage to
override other stage properties.
This can impact existing pipelines.
- MapR Streams stages - MapR Streams stages also include an Override
Stage Configurations property that enables the additional MapR or
Kafka properties defined in the stage to override other stage
properties.
This can impact existing pipelines.
- Salesforce stages - The Salesforce origin, processor, destination,
and the Tableau CRM destination include the following new timeout
properties:
- Connection Handshake Timeout
- Subscribe Timeout
- Oracle CDC Client origin:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The origin includes an
oracle.cdc.oracle.pseudocolumn.<pseudocolumn name>
attribute for each pseudocolumn in the original statement. - Starting with version 4.0.1, the origin includes a Batch Wait Time property.
- Field Type Converter processor - The Source Field is Empty property enables to you specify the action to take when an input field is an empty string.
- HTTP Client processor:
- Two Pass Records properties allow you to pass a record through the pipeline when all retries fail for per-status actions and for timeouts.
- The following record header attributes are populated when you
use one of the Pass Records properties:
- httpClientError
- httpClientStatus
- httpClientLastAction
- httpClientTimeoutType
- httpClientRetries
- SQL Parser processor:
- You can now remove Oracle pseudocolumns from parsed redo logs and place the information in record header attributes.
- The processor includes an
oracle.cdc.oracle.pseudocolumn.<pseudocolumn name>
attribute for each pseudocolumn in the original statement.
- Control Hub orchestration stages - The orchestration stages that
connect to Control Hub allow using API User Credentials to log into
Control Hub, as an alternative to user name and password. This
affects the following stages:
- Connections when registered with Control Hub
- When Data Collector version 4.0.0 is registered with Control Hub cloud or with Control
Hub on-premises version 3.19.x or later, the following stages support
using Control Hub
connections:
- Oracle CDC Client origin
- SQL Server CDC Client origin
- SQL Server Change Tracking Client origin
- Enterprise stage libraries
- In June 2021, StreamSets released new versions of the Databricks and Snowflake Enterprise stage libraries.
- Additional features
-
- SDC_EXTERNAL_RESOURCES environment variable - An optional root
directory for external resources, such as custom stage libraries,
external libraries, and runtime resources.
The default location is $SDC_DIST/externalResources.
- Support Bundle - Support bundles now include the System Messages log file when you include log files in the bundle.
- SDC_EXTERNAL_RESOURCES environment variable - An optional root
directory for external resources, such as custom stage libraries,
external libraries, and runtime resources.
- Deprecated features
- Several features and stages have been deprecated with this release and may be removed in a future release. We recommend that you avoid using these features and stages. For a full list, click here.
Upgrade Impact
- Conflicting properties in Kafka and MapR Streams stages
- In previous releases, if you specify an additional configuration property that conflicts with a stage property setting in a Kafka or MapR Streams stage, the stage property takes precedence.
- Control Hub On-premises prerequisite task
- Before using Data Collector 4.0.0 or later versions with Control Hub On-premises, you must complete a prerequisite task. For details, see the StreamSets Support portal.
- HTTP Client processor batch wait time change
- With this release, the HTTP Client processor performs additional checks against the specified batch wait time. This can affect existing pipelines. For details, see Review HTTP Client Processor Pipelines.
- Open source status
- Data Collector 4.0.0 and later versions are not open source. This means that StreamSets will not make the source code publicly available.
- Stages removed
-
The following stages have been deprecated for several years and have been removed from Data Collector with this release:
- HTTP to Kafka origin
- SDC RPC to Kafka origin
- UDP to Kafka origin
- Updated environment variable default (tarball installation, manual start)
- For manually-started tarball installations, the default location for the SDC_RESOURCES environment variable has changed from $SDC_DIST/resources to $SDC_EXTERNAL_RESOURCES/resources, which evaluates to: $SDC_DIST/externalResources/resources.
4.0.2 Fixed Issues
- The JDBC Producer destination can round the scale of numeric data when it performs multi-row operations while writing to SQL Server tables.
- You cannot use API user credentials in Orchestration stages.
4.0.1 Fixed Issue
- In the JDBC Lookup
processor, enabling the Validate Column Mappings property when using an
expression to represent the lookup table generates an invalid SQL
query.
Though fixed, using column mapping validation with an expression for the table name requires querying the database for all column names. As a result, the response time can be slower than expected.
4.0.0 Fixed Issues
- The SQL Server CDC Client origin does not process data correctly when configured to generate schema change events.
-
The Hadoop FS destination stages fail to recover temporary files when the directory template includes pipeline parameters or expressions.
- The Oracle CDC Client origin can generate an exception when trying to process data from a transaction after the same partially-processed transaction has already been flushed after exceeding the maximum transaction length.
- The Oracle CDC Client origin fails to start when it is configured to start from a timestamp or SCN that is contained in multiple database incarnations.
- Some conditional expressions in the Field Mapper processor can cause errors when operating on field names.
- HTTP Client stages should not log the proxy password when the Data Collector logging mode is set to Debug.
-
HTTP Client Processor can create creating duplicate requests when Pagination Mode is set to None.
-
The MQTT Subscriber origin does not properly restore a persistent session.
- The Oracle CDC Client origin generates an exception when Oracle includes an empty string in a redo log statement, which is unexpected. With this fix, the origin interprets empty strings as NULL.
- Data Collector uses a Java version specified in the PATH environment variable over the version defined in the JAVA_HOME environment variable.
4.0.x Known Issues
There are no important known issues at this time.