What's New

What's New in 3.6.0

Data Collector version 3.6.0 includes the following new features and enhancements:

Data Collector Edge (SDC Edge)

Register SDC Edge with Control Hub - You can now use the command line to register SDC Edge with Control Hub.
Delimited data format - Stages in edge pipelines can now process the delimited data format.
Functions - The sdc:hostname() function can now return the host name of a Data Collector or Data Collector Edge machine and can be used within edge pipelines.

What's New in 3.5.0

Data Collector version 3.5.0 includes the following new features and enhancements:

Origins

New Pulsar Consumer origin - A new origin that reads messages from one or more topics in an Apache Pulsar cluster.
JDBC Multitable Consumer and JDBC Query Consumer origin enhancements - These origins now include an option to convert timestamp data to the String data type instead of to the Datetime data type to ensure that the precision is maintained.
Salesforce origin enhancement - When using the Bulk API, the origin can now execute an SOQL query that includes one or more subqueries.
WebSocket Client and WebSocket Server origin enhancements - When included in a microservice pipeline, the origins can now send responses back to the originating endpoint when used with microservice destinations in the same pipeline.

Processors

New Encrypt and Decrypt Fields processor - A new processor that encrypts or decrypts individual field values.
New MongoDB Lookup processor - A new processor that performs lookups in MongoDB and passes all values from the returned document to a new list-map field. Use the MongoDB Lookup to enrich records with additional data.
New HTTP Router processor - A new processor that passes records to streams based on the HTTP method and URL path in record header attributes. Use the HTTP Router processor in pipelines with an origin that creates HTTP method and path record header attributes - including the HTTP Server origin and the REST Service origin.
Field Type Converter processor enhancement - The processor can now convert the Boolean data type to the Integer, Long, or Short data type.
Salesforce Lookup processor enhancements - The processor includes the following enhancements:
- The processor can now return multiple values. You can configure the lookup to return the first value or to return all matches as separate records.
- You can now configure how the processor handles a lookup that returns no value in fields with no default value defined. Upgraded pipelines continue to send records with no return value and no default value to error.

Destinations

New Pulsar Producer destination - A new destination that writes data to topics in an Apache Pulsar cluster.
New Syslog destination - A new destination that writes data to a Syslog server.
HTTP Client, Kafka Producer, and Kinesis Producer destination enhancements - When included in a microservice pipeline, the destinations can now send responses to the microservice origin in the pipeline.

Executors

New Databricks executor - A new executor that starts a Databricks job each time it receives an event.
With the addition of this new executor, Data Collector has removed the ability to use the Spark executor with Databricks. If you upgrade pipelines that include the Spark executor with Databricks, you must update the pipeline to use the Databricks executor after you upgrade.

Hive Stages

JDBC Credentials - The following Hive stages now allow you to enter credentials separately from the JDBC URL for Hive:
- Hive Metadata processor
- Hive Metastore destination
- Hive Query executor

Salesforce Stages

API version - Data Collector now ships with version 43.0 of the Salesforce Web Services Connector libraries used by the following Salesforce stages:

Technology Preview Functionality

Data Collector now includes certain new features and stages with the Technology Preview designation. Technology Preview functionality is available for use in development and testing, but is not meant for use in production.

Technology Preview stages display a Technology Preview icon on the upper left corner of the preview stage, as follows:

When Technology Preview functionality becomes approved for use in production, the release notes and documentation reflect the change, and the Technology Preview icon is removed from the UI.

The following Technology Preview stages are available in this release:

Databricks ML Evaluator processor - A new processor that uses a machine learning model exported with Databricks ML Model Export to generate evaluations, scoring, or classifications of data.
MLeap Evaluator processor - A new processor that uses a machine learning model stored in an MLeap bundle to generate evaluations, scoring, or classifications of data.
PMML Evaluator processor - A new processor that uses a machine learning model stored in the Predictive Model Markup Language (PMML) format to generate predictions or classifications of data.
TensorFlow Evaluator processor - A new processor that uses TensorFlow machine learning models to generate predictions or classifications of data.

Data Formats

Delimited data format enhancement - When reading delimited data that contains headers with empty values, Data Collector now replaces the empty values with the string “empty-” plus the column number starting from zero. For example, if the 3rd column header is empty, then the field name in Data Collector becomes “empty-2”. Previously, Data Collector retained the empty field name.
Excel data format enhancement - When reading Excel data, Data Collector now processes the underlying raw values for numeric columns in a spreadsheet, rather than the displayed values. For example, if a cell contains 3.14159 but the display format is set to 2 decimals such that the spreadsheet displays 3.14, Data Collector still processes the full value of 3.14159. Previously, Data Collector encountered errors when processing an Excel spreadsheet that contained displayed values.

Data Collector Edge (SDC Edge)

Download an installer for Windows - You can now download a Microsoft installer to install SDC Edge on a Windows operating system.
Run SDC Edge as a service - You can now register SDC Edge to run as a system service on Darwin, Linux, or Windows operating systems.
System Metrics origin enhancement - The origin can now read metrics from specific processes running on the edge device.
Windows Event Log origin enhancement - The origin can now read from a custom Windows log.
Dev Data Generator origin supported - Edge pipelines now support the Dev Data Generator origin.
TensorFlow Evaluator processor supported - Edge pipelines support the new TensorFlow Evaluator processor.
Functions - Edge pipelines now support all job functions and the pipeline:startTime() function.
Disable the ability to manage production edge pipelines - By default, you can use the Data Collector UI or REST API to manage edge pipelines deployed to an SDC Edge - including previewing, validating, starting, stopping, resetting the origin, and monitoring the pipelines. You can now disable the ability to manage edge pipelines in a production environment using the Data Collector UI or REST API. When disabled, you manage edge pipelines using Control Hub or by starting pipelines when you start SDC Edge.
Skip verifying trusted certificates - In a test or development environment, you can start SDC Edge so that it skips verifying trusted certificates. For example, you might want to skip verifying trusted certificates when SDC Edge is registered with a Control Hub on-premises installation enabled for HTTPS and you want to temporarily avoid configuring the truststore file for SDC Edge.StreamSets highly recommends that you configure SDC Edge to verify trusted certificates in a production environment.

Working with Control Hub

Automate registering and unregistering Data Collectors - You can now use an automation tool such as Ansible, Chef, or Puppet to automate the registering and unregistering of Data Collectors using the following commands:
```
streamsets sch register
streamsets sch unregister
```

Microservice Pipelines

Origins for microservice pipelines - The following origins can now send responses back to the originating REST API client when used with destinations that send records to the origin in the same microservice pipeline:
- WebSocket Client origin
- WebSocket Server origin
Destinations for microservice pipelines - The following destinations can now send records to the origin in the microservice pipeline with the specified response:
- HTTP Client destination
- Kafka Producer destination
- Kines Producer destination
Sample microservice pipeline - When you create a microservice pipeline, the sample microservice pipeline now includes the new HTTP Router processor instead of the Stream Selector processor to route data to different streams based on the request method.

Data Governance Tools

Supported stages - Data Collector can now publish metadata to data governance tools for the following stages:
- Amazon S3 origin
- Kafka Multitopic Consumer origin
- SFTP/FTP Client origin
- Kafka Producer destination
Cloudera Navigator versions - Data Collector can now publish metadata to Cloudera Navigator running on Cloudera Manager versions 5.10 to 5.15.
Previously, publishing metadata to Cloudera Navigator was supported only on Cloudera Manager version 5.10 or 5.11.
Secure connections to Cloudera Navigator - If Cloudera Navigator is configured for TLS/SSL, Data Collector requires a local truststore file to verify the identity of the Cloudera Navigator Metadata Server. You now configure the truststore file location and password in the $SDC_CONF/sdc.properties file when you configure the connection to Cloudera Navigator.

Credential Stores

New Microsoft Azure Key Vault credential store - You can now define credentials in Microsoft Azure Key Vault and then use the Data Collector credential functions in stage properties to retrieve those values.
Commands for a Java keystore credential store - You now use the stagelib-cli jks-credentialstore command to add, list, and delete credentials in a Java keystore credential store. Previously you used the jks-cs command, which has now been deprecated.

Expression Language

String functions - This release includes the following new function:
- str:split() - Splits a string into a list of string values.
Pipeline functions - This release includes the following new function:
- pipeline:startTime() - Returns the start time of the pipeline as a Datetime value.
Job functions - This release includes the following new functions:
- job:id() - Returns the ID of the job if the pipeline was run from a Control Hub job. Otherwise, returns "UNDEFINED".
- job:name() - Returns the name of the job if the pipeline was run from a Control Hub job. Otherwise, returns "UNDEFINED".
- job:startTime() - Returns the start time of the job if the pipeline was run from a Control Hub job. Otherwise, returns the start time of the pipeline.
- job:user() - Returns the user who started the job if the pipeline was run from a Control Hub job. Otherwise, returns "UNDEFINED".

Stage Libraries

New stage libraries - This release includes the following new stage libraries:

Stage Library Name	Description
streamsets-datacollector-apache-kafka_1_1-lib	Apache Kafka version 1.1.x
streamsets-datacollector-apache-kafka_2_0-lib	Apache Kafka version 2.0.x
streamsets-datacollector-apache-pulsar_2-lib	Apache Pulsar version 2.1.0-incubating
streamsets-datacollector-azure-keyvault-credentialstore-lib	Microsoft Azure Key Vault credential store system
streamsets-datacollector-cdh_6_0-lib	Cloudera CDH version 6.0 distribution of Apache Hadoop Note: Does not include the following stages: HBase Lookup processor Spark Evaluator processor HBase destination
streamsets-datacollector-crypto-lib	For cryptography stages, including the Encrypt and Decrypt Fields processor
streamsets-datacollector-mapr_6_0-mep5-lib	MapR Ecosystem Pack (MEP) version 5 for MapR 6.0.1
streamsets-datacollector-tensorflow-lib	TensorFlow

Legacy stage libraries - The following stage libraries are now legacy stage libraries:

Stage Library Name	Description
streamsets-datacollector-apache-kafka_0_9-lib	Apache Kafka version 0.9.x
streamsets-datacollector-apache-kafka_0_10-lib	Apache Kafka version 0.10.x
streamsets-datacollector-cdh_5_8-lib	Cloudera CDH version 5.8 distribution of Apache Hadoop
streamsets-datacollector-cdh_5_9-lib	Cloudera CDH version 5.9 distribution of Apache Hadoop
streamsets-datacollector-cdh_kafka_2_0-lib	Cloudera distribution of Apache Kafka 2.0.x (0.9.0)
streamsets-datacollector-hdp_2_4-lib	Hortonworks version 2.4 distribution of Apache Hadoop
streamsets-datacollector-hdp_2_4-hive1-lib	Hortonworks version 2.4.x distribution of Apache Hive version 1.x
streamsets-datacollector-hdp_2_5-lib	Hortonworks version 2.5.x distribution of Apache Hadoop
streamsets-datacollector-hdp_2_5-flume-lib	Hortonworks version 2.5.x distribution of Apache Flume
streamsets-datacollector-mapr_5_1-lib	MapR version 5.1

Legacy stage libraries that are more than two years old are not included with Data Collector. Though not recommended, you can still download and install the older stage libraries as custom stage libraries.

If you have pipelines that use these legacy stage libraries, you will need to update the pipelines to use a more current stage library or install the legacy stage library manually. For more information see Update Pipelines using Legacy Stage Libraries.

Miscellaneous

Import pipelines from an external HTTP URL - You can now use Data Collector to import pipelines from an external HTTP URL. For example, you can import pipelines from the StreamSets GitHub repository.
Collection of usage statistics - When you log in to Data Collector as the admin/admin user for the first time, you can now choose to improve Data Collector by sending anonymized usage data. Previously, the ui.enable.usage.data.collection property in the Data Collector configuration file determined whether usage data was collected. This property has been removed.

What's New in 3.4.0

Data Collector version 3.4.0 includes the following new features and enhancements:

Origins

New PostgreSQL CDC Client origin - Use the PostgreSQL CDC Client origin to process change data capture information for a PostgreSQL database.
New Test origin- You can now configure a virtual test origin to provide test data for data preview to aid in pipeline development. In Control Hub, you can also use test origins when developing pipeline fragments.
Amazon S3, Directory, SFTP/FTP Client, Google Cloud Storage enhancements - The listed origins can now process Microsoft Excel files.
Dev Data Generator origin enhancement - The development origin can now generate additional types of data for testing purposes - such as sample address data, names, or prices.
Hadoop FS origin enhancements - The origin includes the following enhancements:
- Process Amazon S3 data in cluster EMR batch mode - Use the origin in a cluster EMR batch pipeline that runs on an Amazon EMR cluster to process data from Amazon S3.
- Process Amazon S3 data in cluster batch mode - Use the origin in a cluster batch pipeline that runs on a Cloudera distribution of Hadoop (CDH) or Hortonworks Data Platform (HDP) cluster to process data from Amazon S3.
HTTP Client origin enhancements - The origin includes the following changes and enhancements:
- The origin now uses buffered request transfer encoding by default. Upgraded pipelines retain their previous configuration.
- HEAD request responses create an empty record. Information returned from the HEAD appear in record header attributes.
HTTP Server origin enhancement - The origin now includes the name of the client or proxy that made the request in the remoteHost record header attribute.
MongoDB origin enhancement - You can now use a date field as the offset field.
Oracle CDC Client origin enhancements - The origin includes the following changes and enhancements:
- Multithreaded parsing - When using local caching and parsing the SQL query, the origin can now use multiple threads to parse transactions.
- PEG Parser - To improve performance for very wide tables, you can try our experimental PEG parser.
- With this release, the Query Timeout property has been removed. You can no longer configure a query to timeout before the end of a LogMiner session. The existing LogMiner Session Window property defines how long the session lasts.
Salesforce origin enhancement - When using the SOAP API, the origin can now execute an SOQL query that includes one or more subqueries. Support for subqueries using the Bulk API will be added in a future release.

Processors

New Whole File Transformer processor - Use the Whole File Transformer processor to convert fully written Avro files to Parquet in a whole file pipeline.
Field Hasher processor enhancement - The processor can now add a user-defined field separator to fields before hashing.
HTTP Client processor enhancements - The processor includes the following changes and enhancements:
- The processor now uses buffered request transfer encoding by default. Upgraded pipelines retain their previous configuration.
- HEAD request responses create an empty record. Information returned from the HEAD appear in record header attributes.
- The resolved request URL is now written to the Data Collector log when Data Collector logging is set to debug or higher.
JDBC Lookup processor enhancement - When using local caching, the processor can now use additional cores to prepopulate the cache to enhance pipeline performance.

Destinations

New Couchbase destination - A new destination that writes data to a Couchbase database.
New Splunk destination - A new destination that writes data to Splunk using the Splunk HTTP Event Collector (HEC).
Cassandra destination enhancement - You can now use SSL/TLS to connect to Cassandra.
HTTP Client destination enhancement - The destination now uses buffered request transfer encoding by default. Upgraded pipelines retain their previous configuration.

Executors

Amazon S3 executor enhancements - The executor includes the following enhancements:
- The executor can now copy objects to a new location and optionally delete the original object.
- The executor can now generate event records each time the executor creates a new object, adds tags to an existing object, or completes copying an object to a new location.

Data Collector Edge (SDC Edge)

New System Metrics origin - A new origin that reads system metrics - such as CPU and memory usage - from the edge device where SDC Edge is installed.
HTTP Client origin supported - Edge sending pipelines now support the HTTP Client origin. However, the origin does not currently support batch processing mode, pagination, or OAuth2 authorization in edge pipelines.
WebSocket Client origin supported - Edge sending pipelines now support the WebSocket Client origin.
Pipeline functions - Edge pipelines now support the following pipeline functions:
- pipeline:id()
- pipeline:title()
- pipeline:user()
Preview and validate edge pipelines - You can now use the Data Collector UI or the command line and REST API to preview and validate edge pipelines.
Publish multiple edge pipelines to SDC Edge - You can now use the Data Collector Home page to directly publish multiple edge pipelines at one time to an SDC Edge that is running. Previously, you could only publish a single edge pipeline at a time.
Download edge pipelines from SDC Edge - You can now use the Data Collector UI to download all edge pipelines deployed to an SDC Edge in addition to all sample edge pipelines included with SDC Edge.
Filter the Home page by edge pipelines - You can now select Edge Pipelines as a category on the Data Collector Home page to view all available edge pipelines.

Microservice pipelines

You can now create microservices using microservice pipelines. Use the following new stages in microservice pipelines:

New REST Service origin - Listens on an HTTP endpoint, parses the contents of all authorized requests, and sends responses back to the originating REST API. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Send Response to Origin destination - Sends records with the specified response to the microservice origin in the pipeline.

Pipelines

Notifications - You can now configure a pipeline to send an email or webhook when the pipeline changes to the Running_Error state.
Error records - Error records now include an errorJobID internal header attribute when the pipeline that generated the error record was started by a Control Hub job.
Install external libraries from the properties panel - You can now select a stage in the pipeline canvas and then install external libraries for that stage from the properties panel. Previously, you had to navigate to the Package Manager page to install external libraries.

Cluster Pipelines

New cluster EMR batch mode - Data Collector can now use the cluster EMR batch mode to run on an Amazon EMR cluster to process data from Amazon S3. Data Collector runs as an application on top of MapReduce in the EMR cluster.
Data Collector can run on an existing EMR cluster or on a new EMR cluster that is provisioned when the cluster pipeline starts. When you provision a new EMR cluster, you can configure whether the cluster remains active or terminates when the pipeline stops.

Use the Hadoop FS origin to process data from Amazon S3 in cluster EMR batch mode.
Logs - You can now configure the Data Collector on the master gateway node to use the log4j rolling file appender to write log messages to an sdc.log file. This configuration is propagated to the worker nodes such that each Data Collector worker writes log messages to an sdc.log file within the YARN application directory.

Data Formats

New Excel data format - You can now use the following file-based origins to process Microsoft Excel files:
- Amazon S3 origin
- Directory origin
- Google Cloud Storage origin
- SFTP/FTP Client origin
Avro and Protobuf data formats - To preserve the ordering of fields, the Avro and Protobuf data formats now use the list-map root field type instead of the map root field type.

Stage Libraries

This version of Data Collector includes the following new stage libraries:

streamsets-datacollector-cdh_5_15-lib - The Cloudera CDH 5.15 distribution of Hadoop.
streamsets-datacollector-emr_hadoop_2_8_3-lib - Includes the Hadoop FS origin for cluster EMR batch mode pipelines that run on an Amazon EMR cluster to process data from Amazon S3.

Miscellaneous

Cloudera Manager CSD enhancement - The Cloudera Manager CSD now enables specifying a StreamSets Customer ID, used when generating support bundles. The customer ID is generated by the StreamSets Support team for users with a paid subscription.
Postgres rename - Postgres CSV and Postgres Text delimited format types are now known as PostgreSQL CSV and PostgreSQL Text, respectively The Postgres Metadata processor is now known as the PostgreSQL Metadata processor. And the Drift Synchronization Solution for Postgres is now known as the Drift Synchronization Solution for PostgreSQL.
Documentation enhancement - The online help has a new look and feel. All of the previous documentation remains exactly where you expect it, but it is now easier to view and navigate on smaller devices like your tablet or mobile phone.

What's New in 3.3.1

Data Collector version 3.3.1 includes the following new features and enhancements:

Origin

JDBC Multitable Consumer origin enhancement - You can now optionally define a schema exclusion pattern to exclude some schemas from being read. The schema exclusion pattern uses a Java-based regular expression, or regex.

Processor

Kudu Lookup processor enhancements:
- You can now configure the Maximum Number of Worker Threads property to limit the number of threads that the processor uses.
- You can now configure an Admin Operation Timeout property to determine how many milliseconds to allow for admin-type operations, such as opening a table or getting a table schema.

Destination

Kudu destination enhancements:
- You can now configure the Maximum Number of Worker Threads property to limit the number of threads that the destination uses.
- You can now configure an Admin Operation Timeout property to determine how many milliseconds to allow for admin-type operations, such as opening a table or getting a table schema.

Environment Variables

Data Collector now includes a SPARK_KAFKA_VERSION environment variable that is set to 0.10 by default in the Data Collector environment configuration file - sdc.env.sh or sdcd.env.sh. Do not change this environment variable value. This variable is used only when you run cluster streaming mode pipelines on a Cloudera CDH cluster.

What's New in 3.3.0

Data Collector version 3.3.0 includes the following new features and enhancements:

Cluster Pipelines

When using Spark 2.1 or later and Kafka 0.10.0.0 or later in a cluster pipeline that reads from a Kafka cluster on YARN, you can now enable the pipeline to use Kafka security features such as SSL/TLS and Kerberos authentication.

Origins

WebSocket Client Origin enhancement - You can now configure the origin to send an initial message or command after connecting to the WebSocket server.

Processors

New SQL Parser processor - A processor that parses SQL queries. For example, if you set the Parse SQL Query property to false in the Oracle CDC origin, the origin writes the SQL query to an “sql” field that can be parsed by the SQL Parser.
Field Zip processor enhancement - The Continue option for the Field Does Not Exist property is now named Include without Processing.

Pipelines

Notifications - You can now configure a pipeline to send an email or webhook when the pipeline changes to the Stop_Error state.
Preview - The default value of the Preview Timeout property has been increased to 30,000 milliseconds. Previously the default was 10,000 milliseconds.

Edge Pipelines

Sensor Reader origin enhancement - This development stage can now generate records with thermal data such as that generated by BCM2835 onboard thermal sensors.

Stage Libraries

This version of Data Collector includes several new, changed, and removed stage libraries because of the introduction of cluster streaming mode with support for Kafka security features using Spark 2.1 or later and Kafka 0.10.0.0 or later.

For more information about the changed stage libraries, see Upgrade to Spark 2.1 or Later.

What's New in 3.2.0.0

Data Collector version 3.2.0.0 includes the following new features and enhancements:

Origins

New Hadoop FS Standalone origin - Similar to the Directory origin, the Hadoop FS Standalone origin can use multiple threads to read fully-written files. Use this origin in standalone execution mode pipelines to read files in HDFS.
New MapR FS Standalone origin - Similar to the Directory origin, the MapR FS Standalone origin can use multiple threads to read fully-written files. Use this origin in standalone execution mode pipelines to read files in MapR FS.
New Dev Snapshot Replaying origin - The Dev Snapshot Replaying origin is a development stage that reads records from a downloaded snapshot file.
HTTP Client origin enhancement - You can now configure the origin to process JSON files that include multiple JSON objects or a single JSON array
JDBC Multitable Consumer origin enhancements - The origin can now generate table-finished and schema-finished events when it completes processing all rows in a table or schema. You can also configure the number of seconds that the origin delays generating the no-more-data event. You might want to configure a delay if you want the table-finished or schema-finished events to appear in the event stream before the no-more-data event.
Oracle CDC Client origin enhancements - The origin includes the following enhancements:
- You can set a new Parse SQL Query property to false to skip parsing the SQL queries. Instead, the origin writes the SQL query to a “sql” field that can be parsed later. Default is true, which retains the previous behavior of parsing the SQL queries.
- The Send Redo Query property has been renamed. The new name is Send Redo Query in Headers.

TCP Server origin enhancement - You can now use the origin to read the supported Data Collector data formats when passed in Flume events as Avro messages.

Processors

HTTP Client processor enhancement - You can now use the PATCH method with the processor.
JDBC Lookup processor enhancement - The Retry on Cache Miss property has been renamed to Retry on Missing Value.
Kudu Lookup processor enhancement - You can now configure the processor behavior when a lookup returns no value.

Destinations

Hadoop FS, Local FS, and MapR FS destination enhancements - These destinations now support writing records using the SDC Record format.
HTTP Client destination enhancement - You can now use the PATCH method with the destination.

Executors

MapReduce executor enhancement - You can now use the new Avro to ORC job to convert Avro files to ORC files.

Data Collector Edge (SDC Edge)

SDC Edge includes the following enhancements:

JavaScript Evaluator processor supported - Both edge sending pipelines and edge receiving pipelines now support the JavaScript Evaluator processor.
Publish edge pipelines to SDC Edge - You can now use the Data Collector UI to directly publish edge pipelines to an SDC Edge that is running. Previously, you had to first export edge pipelines from Data Collector, and then move them to the SDC Edge installed on the edge device.
Manage edge pipelines from the Data Collector UI - You can now use the Data Collector UI to start, monitor, stop, and reset the origin for edge pipelines running on a remote SDC Edge. Previously, you had to use the command line and REST API to manage edge pipelines on SDC Edge.

Miscellaneous

Pipeline error handling enhancement - You can now configure pipelines to write error records to Azure Event Hub.
Pipeline runner idle time enhancement - You can configure the number of seconds that a pipeline runner waits before sending an empty batch.
Runtime statistics enhancement - Runtime statistics now include the number of empty or idle batches that are generated by the pipeline.
Snapshot enhancement - Snapshots now include record header attributes for error records. Previously, snapshots included only the record fields in an error record.

Stages

This version of Data Collector includes the following new stage library:

Apache Kudu version 1.7

What's New in 3.1.2.0

Data Collector version 3.1.2.0 includes the following new features and enhancements:

Directory origin enhancement - When processing files using the Last Modified Timestamp read order, the Directory origin now assesses the change timestamp in addition to the last modified timestamp to establish file processing order.
Impersonation enhancement for Control Hub - You can now configure Data Collector to use a partial Control Hub user ID for Hadoop impersonation mode and shell impersonation mode. Use this feature when Data Collector is registered with Control Hub, and when the Hadoop or target operating system has user name requirements that do not allow using the full Control Hub user ID.
NetFlow 9 processing enhancement - When processing NetFlow 9 data, Data Collector now includes FIELD_SENDER and FIELD_RECIPIENT fields to include sender and receiver information.

What's New in 3.1.1.0

Data Collector version 3.1.1.0 includes the following new features and enhancements:

Origins

Directory origin enhancements - The origin includes the following enhancements:
- The Max Files in Directory property has been renamed to Max Files Soft Limit. As the name indicates, the property is now a soft limit rather than a hard limit. As such, if the directory contains more files than the configured Max Files Soft Limit, the origin can temporarily exceed the soft limit and the pipeline can continue running.
  Previously, this property was a hard limit. When the directory contained more files, the pipeline failed.
- The origin includes a new Spooling Period property that determines the number of seconds to continue adding files to the processing queue after the maximum files soft limit has been exceeded.

Destinations

Einstein Analytics destination enhancement - The Append Timestamp to Alias property is now disabled by default for new pipelines. When disabled, the destination can append, delete, overwrite, or upsert data to an existing dataset. When enabled, the destination creates a new dataset for each upload of data.
The property was added in version 3.1.0.0 and was enabled by default. Pipelines upgraded from versions earlier than 3.1.0.0 have the property enabled by default.
Solr destination enhancements - The destination includes the following enhancements:
- The destination now includes an Ignore Optional Fields property that allows ignoring null values in optional fields when writing records.
- The destination allows you to configure Wait Flush, Wait Searcher, and Soft Commit properties to tune write performance.

What's New in 3.1.0.0

Data Collector version 3.1.0.0 includes the following new features and enhancements:

Data Synchronization Solution for Postgres

This release includes a beta version of the Data Synchronization Solution for Postgres. The solution uses the new Postgres Metadata processor to detect drift in incoming data and automatically create or alter corresponding PostgreSQL tables as needed before the data is written. The solution also leverages the JDBC Producer destination to perform the writes.

As a beta feature, use the Data Synchronization Solution for Postgres for development or testing only. Do not use the solution in production environments.

Support for additional databases is planned for future releases. To state a preference, leave a comment on this issue.

Data Collector Edge (SDC Edge)

SDC Edge includes the following enhancements:

Edge pipelines now support the following stages:
- Dev Raw Data Source origin
- Kafka Producer destination
Edge pipelines now support the following functions:
- emptyList()
- emptyMap()
- isEmptyMap()
- isEmptyList()
- length()
- record:attribute()
- record:attributeOrDefault()
- size()
When you start SDC Edge, you can now change the default log directory.

Origins

HTTP Client origin enhancement - You can now configure the origin to use the Link in Response Field pagination type. After processing the current page, this pagination type uses a field in the response body to access the next page.
HTTP Server origin enhancement - You can now use the origin to process the contents of authorized HTTP PUT requests.
Kinesis Consumer origin enhancement - You can now define tags to apply to the DynamoDB lease table that the origin creates to store offsets.
MQTT Subscriber origin enhancement - The origin now includes a TOPIC_HEADER_NAME record header attribute that includes the topic information for each record.
MongoDB origin enhancement - The origin now generates a no-more-data event when it has processed all available documents and the configured batch wait time has elapsed.
Oracle CDC Client origin enhancement - You can now specify the tables to process by using SQL-like syntax in table inclusion patterns and exclusion patterns.
Salesforce origin enhancements - The origin includes the following enhancements:
- The origin can now subscribe to Salesforce platform events.
- You can now configure the origin to use Salesforce PK Chunking.
- When necessary, you can disable query validation.
- You can now use Mutual Authentication to connect to Salesforce.

Processors

New Field Replacer processor - A new processor that replaces values in fields with nulls or with new values.
The Field Replacer processor replaces the Value Replacer processor which has been deprecated. The Field Replacer processor lets you define more complex conditions to replace values. For example, the Field Replacer can replace values which fall within a specified range. The Value Replacer cannot replace values that fall within a specified range.

StreamSets recommends that you update Value Replacer pipelines as soon as possible.
New Postgres Metadata processor - A new processor that determines when changes in data structure occur and creates and alters PostgreSQL tables accordingly. Use as part of the Drift Synchronization Solution for Postgres in development or testing environments only.
Aggregator processor enhancements - The processor includes the following enhancements:
- Event records now include the results of the aggregation.
- You can now specify the root field for event records. You can use a String or Map root field. Upgraded pipelines retain the previous behavior, writing aggregation data to a String root field.
JDBC Lookup processor enhancement - The processor includes the following enhancements:
- You can now configure a Missing Values Behavior property that defines processor behavior when a lookup returns no value. Upgraded pipelines continue to send records with no return value to error.
- You can now enable the Retry on Cache Miss property so that the processor retries lookups for known missing values. By default, the processor always returns the default value for known missing values to avoid unnecessary lookups.
Kudu Lookup processor enhancement - The processor no longer requires that you add a primary key column to the Key Columns Mapping. However, adding only non-primary keys can slow the performance of the lookup.
Salesforce Lookup processor enhancement - You can now use Mutual Authentication to connect to Salesforce.

Destinations

New Aerospike destination - A new destination that writes data to Aerospike.
New Named Pipe destination- A new destination that writes data to a UNIX named pipe.
Einstein Analytics destination enhancements - The destination includes the following enhancements:
- You can specify the name of the edgemart container that contains the dataset.
- You can define the operation to perform: Append, Delete, Overwrite, or Upsert.
- You can now use Mutual Authentication to connect to Salesforce.
Elasticsearch destination enhancement - You can now configure the destination to merge data which performs an update with doc_as_upsert.
Salesforce destination enhancement - The destination includes the following enhancements:
- The destination can now publish Salesforce platform events.
- You can now use Mutual Authentication to connect to Salesforce.

Data Formats

Log data format enhancement - Data Collector can now process data using the following log format types:
- Common Event Format (CEF)
- Log Event Extended Format (LEEF)

Expression Language

Error record functions - This release includes the following new function:
- record:errorStackTrace() - Returns the error stack trace for the record.
Time functions - This release includes the following new functions:
- time:dateTimeZoneOffset() - Returns the time zone offset in milliseconds for the specified date and time zone.
- time:timeZoneOffset() - Returns the time zone offset in milliseconds for the specified time zone.
Miscellaneous functions - This release includes the following changed and new functions:
- runtime:loadResource() - This function has been changed to trim any leading or trailing whitespace characters from the file before returning the value in the file. Previously, the function did not trim white space characters - you had to avoid including unnecessary characters in the file.
- runtime:loadResourceRaw() - New function that returns the value in the specified file, including any leading or trailing whitespace characters in the file.

Additional Stage Libraries

This release includes the following additional stage libraries:

Apache Kudu 1.6
Cloudera 5.13 distribution of Apache Kafka 2.1
Cloudera 5.14 distribution of Apache Kafka 2.1
Cloudera CDH 5.14 distribution of Hadoop
Kinetica 6.1

Miscellaneous

Data Collector classpath validation - Data Collector now performs a classpath health check upon starting up. The results of the health check are written to the Data Collector log. When necessary, you can configure Data Collector to skip the health check or to stop upon errors.
Support bundle Data Collector property - You can configure a property in the Data Collector configuration file to have Data Collector automatically upload support bundles when problems occur. The property is disabled by default.
Redirect registered Data Collector user logins property - You can enable a property in the Control Hub configuration file, dpm.properties, to redirect Data Collector user logins to Control Hub using the HTML meta refresh method.
Runtime properties enhancement - You can now use environment variables in runtime properties.

What's New in 3.0.3.0

Data Collector version 3.0.3.0 includes the following enhancements:

MySQL Binary Log origin enhancement - You can now use a Keep Alive thread to connect to the MySQL server by configuring the Enable KeepAlive Thread and the KeepAlive Interval advanced properties.
By default, the origin uses Keep Alive threads with an interval of one minute. Upgraded pipelines also use the new defaults.
HTTP Client processor enhancement - The processor can now process compressed data.
Scripting processors enhancement - The Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors can use a new boolean sdcFunction.isPreview() method to determine if the pipeline is in preview mode.

What's New in 3.0.2.0

Data Collector version 3.0.2.0 includes the following enhancement:

SFTP/FTP Client origin enhancement - The origin can now generate events when starting and completing processing for a file and when all available files have been processed.

What's New in 3.0.1.0

Data Collector version 3.0.1.0 includes the following enhancements:

Azure IoT/Event Hub Consumer origin enhancement - The Azure Event Hub Consumer origin has been renamed to the Azure IoT/Event Hub Consumer origin.
HTTP Server origin enhancement - The HTTP Server origin now includes path and queryString record header attributes, as well as any other HTTP header attributes included in the request.
MongoDB origins enhancement - Both the MongoDB origin and the MongoDB Oplog origin now support delegated authentication and the BSON data type for binary data.
SQL Server CDC origin enhancement - The SQL Server CDC origin now includes information from the SQL Server CDC __$command_id column in a record header attribute named jdbc. __$command_id.
MongoDB destination enhancement - The MongoDB destination now supports delegated authentication.

What's New in 3.0.0.0

Data Collector version 3.0.0.0 includes the following new features:

Installation

Java requirement - Data Collector now supports both Oracle Java 8 and OpenJDK 8.
RPM packages - StreamSets now provides the following Data Collector RPM packages:
- EL6 - Use to install Data Collector on CentOS 6, Oracle Linux 6, or Red Hat Enterprise Linux 6
- EL7 - Use to install Data Collector on CentOS 7, Oracle Linux 7, or Red Hat Enterprise Linux 7.
Previously, StreamSets provided a single RPM package used to install Data Collector on any of these operating systems.

Edge Pipelines

You can now design and run edge pipelines to read data from or send data to an edge device. Edge pipelines are bidirectional. They can send edge data to other Data Collector pipelines for further processing. Or, they can receive data from other pipelines and then act on that data to control the edge device.

Edge pipelines run in edge execution mode on StreamSets Data Collector Edge (SDC Edge). SDC Edge is a lightweight agent without a UI that runs pipelines on edge devices. Install SDC Edge on each edge device where you want to run edge pipelines.

You design edge pipelines in Data Collector, export the edge pipelines, and then use commands to run the edge pipelines on an SDC Edge installed on an edge device.

Origins

New Amazon SQS Consumer origin - An origin that reads messages from Amazon Simple Queue Service (SQS). Can create multiple threads to enable parallel processing in a multithreaded pipeline.
New Google Cloud Storage origin - An origin that reads fully written objects in Google Cloud Storage.
New MapR DB CDC origin - An origin that reads changed MapR DB data that has been written to MapR Streams. Can create multiple threads to enable parallel processing in a multithreaded pipeline.
New MapR Multitopic Streams Consumer origin - An origin that reads messages from multiple MapR Streams topics. It can create multiple threads to enable parallel processing in a multithreaded pipeline.
New UDP Multithreaded Source origin - The origin listens for UDP messages on one or more ports and queues incoming packets on an intermediate queue for processing. It can create multiple threads to enable parallel processing in a multithreaded pipeline.
New WebSocket Client origin - An origin that reads data from a WebSocket server endpoint.
New Windows Event Log origin - An origin that reads data from Microsoft Windows event logs. You can use this origin only in pipelines configured for edge execution mode.
New Sensor Reader development origin - A development origin that generates sample atmospheric data for edge pipelines.
Amazon S3 origin enhancements:
- The origin now produces no-more-data events and includes a new socket timeout property.
- You can now specify the number of times the origin retries a query. The default is three.
Directory origin enhancement - The origin can now use multiple threads to perform parallel processing of files.
HTTP Client origin enhancement - The origin can now log request and response data to the Data Collector log.
JDBC Multitable Consumer origin enhancements:
- The origin can now use non-incremental processing for tables with no primary key or offset column.
- You can now specify an Init Query to be executed after establishing a connection to the database, before performing other tasks. This can be used, for example, to modify session attributes.
- A new Queries Per Second property determines how many queries can be run every second.
  This property replaces the Query Interval property. For information about possible upgrade impact, see JDBC Multitable Consumer Query Interval Change.
JDBC Query Consumer origin enhancements:
- You can now specify an Init Query to be executed after establishing a connection to the database, before performing other tasks. This can be used, for example, to modify session attributes.
- The Microsoft SQL Server CDC functionality in the JDBC Query Consumer origin is now deprecated and will be removed from the origin in a future release. For upgrade information, see Update JDBC Query Consumer Pipelines used for SQL Server CDC Data.
Kafka Multitopic Consumer origin enhancement - The origin is now available in the following stage libraries, in addition to the Apache Kafka 0.10 stage library:
- Apache Kafka 0.9
- CDH Kafka 2.0 (0.9.0) and 2.1 (0.9.0)
- HDP 2.5 and 2.6
Kinesis Consumer origin enhancement - You can now specify the number of times the origin retries a query. The default is three.
Oracle CDC Client origin enhancements:
- When using SCNs for the initial change, the origin now treats the specified SCN as a starting point rather than looking for an exact match.
- The origin now passes raw data to the pipeline as a byte array.
- The origin can now include unparsed strings from the parsed SQL query for unsupported data types in records.
- The origin now uses local buffering instead of Oracle LogMiner buffering by default. Upgraded pipelines require no changes.
- The origin now supports reading the Timestamp with Timezone data type. When reading Timestamp with Timezone data, the origin includes the offset with the datetime data in the Data Collector Zoned Datetime data type. It does not include the time zone ID.
SQL Server CDC Client origin enhancements - You can now perform the following tasks with the SQL Server CDC Client origin:
- Process CDC tables that appear after the pipeline starts.
- Check for schema changes and generate events when they are found.
- In addition, a new Capture Instance Name property replaces the Schema and Table Name Pattern properties from earlier releases.
  
  You can simply use the schema name and table name pattern for the capture instance name. Or, you can specify the schema name and a capture instance name pattern, which allows you to specify specific CDC tables to process when you have multiple CDC tables for a single data table.
  
  Upgraded pipelines require no changes.
UDP Source origin enhancement - The Enable Multithreading property that enabled using multiple epoll receiver threads is now named Use Native Transports (epoll).

Processors

New Aggregator processor - A processor that aggregates data within a window of time. Displays the results in Monitor mode and can write the results to events.
New Delay processor - A processor that can delay processing a batch of records for a specified amount of time.
Field Type Converter processor enhancement - You can now convert strings to the Zoned Datetime data type, and vice versa. You can also specify the format to use.
Hive Metadata processor enhancement - You can now configure additional JDBC configuration properties to pass to the JDBC driver.
HTTP Client processor enhancements:
- The processor can now log request and response data to the Data Collector log.
- The Rate Limit now defines the minimum amount of time between requests in milliseconds. Previously, it defined the time between requests in seconds. Upgraded pipelines require no changes.
JDBC Lookup and JDBC Tee processor enhancements - You can now specify an Init Query to be executed after establishing a connection to the database, before performing other tasks. This can be used, for example, to modify session attributes.
Kudu Lookup processor enhancement - The Cache Kudu Table property is now named Enable Table Caching. The Maximum Entries to Cache Table Objects property is now named Maximum Table Entries to Cache.
Salesforce Lookup processor enhancement - You can use a new Retrieve lookup mode to look up data for a set of records instead of record-by-record. The mode provided in previous releases is now named SOQL Query. Upgraded pipelines require no changes.

Destinations

New Google Cloud Storage destination - A new destination that writes data to objects in Google Cloud Storage. The destination can generate events for use as dataflow triggers.
New KineticaDB destination - A new destination that writes data to a Kinetica table.
Amazon S3 destination enhancement - You can now specify the number of times the destination retries a query. The default is three.
Hive Metastore destination enhancement - You can now configure additional JDBC configuration properties to pass to the JDBC driver.
HTTP Client destination enhancements:
- The destination can now log request and response data to the Data Collector log.
- You can now use the HTTP Client destination to write Avro, Delimited, and Protobuf data in addition to the previous data formats.
JDBC Producer destination enhancement - You can now specify an Init Query to be executed after establishing a connection to the database, before performing other tasks. This can be used, for example, to modify session attributes.
Kudu destination enhancement - If the destination receives a change data capture log from the following source systems, you now must specify the source system in the Change Log Format property so that the destination can determine the format of the log: Microsoft SQL Server, Oracle CDC Client, MySQL Binary Log, or MongoDB Oplog.
MapR DB JSON destination enhancement - The destination now supports writing to MapR DB based on the CRUD operation in record header attributes and the Insert API and Set API properties.
MongoDB destination enhancements - With this release, the Upsert operation is no longer supported by the destination. Instead, the destination includes the following enhancements:
- Support for the Replace and Update operations.
- Support for an Upsert flag that, when enabled, is used with both the Replace and Update operations.
For information about upgrading existing upsert pipelines, see Update MongoDB Destination Upsert Pipelines.
Redis destination enhancement - The destination now supports processing data using CRUD operations stored in record header attributes.
Salesforce destination enhancement - When using the Salesforce Bulk API to update, insert, or upsert data, you can now use a colon (:) or period (.) as a field separator when defining the Salesforce field to map the Data Collector field to. For example, Parent__r:External_Id__c or Parent__r.External_Id__c are both valid Salesforce fields.
Wave Analytics destination rename - With this release, the Wave Analytics destination is now named the Einstein Analytics destination, following the recent Salesforce rebranding. All of the properties and functionality of the destination remain the same.

Executor

Hive Query executor enhancement - You can now configure additional JDBC configuration properties to pass to the JDBC driver.
JDBC Query executor enhancement - You can now specify an Init Query to be executed after establishing a connection to the database, before performing other tasks. This can be used, for example, to modify session attributes.

Cloudera Navigator

Cloudera Navigator integration is now released as part of the StreamSets Commercial Subscription. The beta version included in earlier releases is no longer available with Data Collector. For information about the StreamSets Commercial Subscription, contact us.

For information about upgrading a version of Data Collector with Cloudera Navigator integration enabled, see Disable Cloudera Navigator Integration.

Credential Stores

CyberArk - Data Collector now provides a credential store implementation for CyberArk Application Identity Manager. You can define the credentials required by external systems - user names or passwords - in CyberArk. Then you use credential expression language functions in JDBC stage properties to retrieve those values, instead of directly entering credential values in stage properties.
Supported stages - You can now use the credential functions in all stages that require you to enter sensitive information. Previously, you could only use the credential functions in JDBC stages.

Data Collector Configuration

By default when Data Collector restarts, it automatically restarts all pipelines that were running before Data Collector shut down. You can now disable the automatic restart of pipelines by configuring the runner.boot.pipeline.restart property in the $SDC_CONF/sdc.properties file.

Dataflow Performance Manager / StreamSets Control Hub

StreamSets Control Hub - With this release, we have created a new product called StreamSets Control Hub^TM that includes a number of new cloud-based dataflow design, deployment, and scale-up features. Since this release is now our core service for controlling dataflows, we have renamed the StreamSets cloud experience from "Dataflow Performance Manager (DPM)" to "StreamSets Control Hub".
DPM now refers to the performance management functions that reside in the cloud such as live metrics and data SLAs. Customers who have purchased the StreamSets Enterprise Edition will gain access to all Control Hub functionality and continue to have access to all DPM functionality as before.

To understand the end-to-end StreamSets Data Operations Platform and how the products fit together, visit https://streamsets.com/products/.
Aggregated statistics - When working with Control Hub, you can now configure a pipeline to write aggregated statistics to MapR Streams.

Data Formats

New NetFlow 9 support - Data Collector now supports processing NetFlow 9 template-based messages. Stages that previously processed NetFlow 5 data can now process NetFlow 9 data as well.
Datagram data format enhancement - The Datagram Data Format property is now named the Datagram Packet Format.
Delimited data format enhancement - Data Collector can now process data using the Postgres CSV and Postgres Text delimited format types.

Expression Language

New field path expressions - You can use field path expressions in certain stages to specify the fields to use in an expression.
New field functions - You can use the following new field functions in field path expressions:
- f:attribute() - Returns the value of the specified attribute.
- f:path() - Returns the path of a field.
- f:type() - Returns the data type of a field.
- f:value() - Returns the value of a field.
New string functions - The release includes the following new functions:
- str:isNullOrEmpty() - Returns true or false based on whether a string is null or is the empty string.
- str:splitKV() - Splits key-value pairs in a string into a map of string values.

Stage Libraries

New stage libraries - This release includes the following new stage libraries:
- Apache Kafka 1.0
- Apache Kafka 0.11
- Apache Kudu 1.5
- Cloudera CDH 5.13
- Cloudera Kafka 3.0.0 (0.11.0)
- Hortonworks 2.6.1, including Hive 1.2
- Hortonworks 2.6.2, including Hive 1.2 and 2.0
- MapR version 6.0 (MEP 4)
- MapR Spark 2.1 (MEP 3)
Legacy stage libraries - Stage libraries that are more than two years old are no longer included with Data Collector. Though not recommended, you can still download and install the older stage libraries as custom stage libraries.
If you have pipelines that use these legacy stage libraries, you will need to update the pipelines to use a more current stage library or install the legacy stage library manually, For more information see Update Pipelines using Legacy Stage Libraries.
Statistics stage library enhancement - The statistics stage library is now included in the core Data Collector installation.

Miscellaneous

New data type - Data Collector now supports the Zoned Datetime data type.
New Data Collector metrics - JVM metrics have been renamed Data Collector Metrics and now include general Data Collector metrics in addition to JVM metrics. The JVM Metrics menu item has also been renamed SDC Metrics.
Pipeline error records - You can now write error records to Google Pub/Sub, Google Cloud Storage, or an MQTT broker.
Snapshot enhancements:
- Standalone pipelines can now automatically take a failure snapshot when the pipeline fails due to a data-related exception.
- You can now download snapshots through the UI and the REST API.
Time zone enhancement - Time zones have been organized and updated to use JDK 8 names. This should make it easier to select time zones in stage properties. In the rare case that your pipeline uses a format not supported by JDK 8, edit the pipeline to select a compatible time zone.

What's New in 2.7.2.0

Data Collector version 2.7.2.0 includes the following new features:

Origins

New Kafka Multitopic Consumer origin - A new origin that reads messages from multiple Kafka topics. Creates multiple threads to enable parallel processing in a multithreaded pipeline.
Kinesis Consumer origin enhancement - You can now configure the origin to start reading messages from a specified timestamp.

Destinations

New Google BigQuery destination - A new destination that streams data into Google BigQuery.

What's New in 2.7.1.1

Data Collector version 2.7.1.1 includes the following new features and enhancements:

Origins

Amazon S3 origin enhancement - You can now specify a Connection Timeout advanced property.
JDBC Multitable Consumer origin enhancement - You can now use the origin to read from views in addition to tables.
OPC UA Client origin enhancement - You can now configure channel properties, such as the maximum chunk or message size.
Oracle CDC Client origin enhancement - You can now configure a JDBC Fetch Size property to determine the minimum number of records that the origin waits for before passing data to the pipeline. When writing to the destination is slow, use the default of 1 record to improve performance. Previously, the origin used the Oracle JDBC driver default of 10 records.

Executor

New MapR FS File Metadata executor - The new executor can change file metadata, create an empty file, or remove a file or directory in MapR each time it receives an event.

What's New in 2.7.1.0

Data Collector version 2.7.1.0 includes the following new features and enhancements:

Stage Libraries

Data Collector includes the following upgraded stage library:

Jython 2.7.1

Origins

New Azure Event Hub Consumer origin - A multithreaded origin that reads data from Microsoft Azure Event Hub.
OPC UA Client origin enhancement - You can now specify node information in a file. Or have the origin browse for nodes to use based on a specified root node.

Processors

New Schema Generator processor - A processor that generates a schema for each record and writes the schema to a record header attribute.

Destinations

New Azure Event Hub Producer destination - A destination that writes data to Microsoft Azure Event Hub.
New Azure IoT Hub Producer destination - A destination that writes data to Microsoft Azure IoT Hub.

What's New in 2.7.0.0

Data Collector version 2.7.0.0 includes the following new features and enhancements:

Credential Stores

Data Collector now has a credential store API that integrates with the following credential store systems:

Java keystore
Hashicorp Vault

You define the credentials required by external systems - user names, passwords, or access keys - in a Java keystore file or in Vault. Then you use credential expression language functions in JDBC stage properties to retrieve those values, instead of directly entering credential values in stage properties.

The following JDBC stages can use the new credential functions:

JDBC Multitable Consumer origin
JDBC Query Consumer origin
Oracle CDC Client origin
SQL Server CDC Client origin
SQL Server Change Tracking origin
JDBC Lookup processor
JDBC Tee processor
JDBC Producer destination
JDBC Query executor

Publish Pipeline Metadata to Cloudera Navigator (Beta)

Data Collector now provides beta support for publishing metadata about running pipelines to Cloudera Navigator. You can then use Cloudera Navigator to explore the pipeline metadata, including viewing lineage diagrams of the metadata.

Feel free to try out this feature in a development or test Data Collector, and send us your feedback. We are continuing to refine metadata publishing as we gather input from the community and work with Cloudera.

Stage Libraries

Data Collector includes the following new stage libraries:

Apache Kudu version 1.4.0
Cloudera CDH version 5.12 distribution of Hadoop
Cloudera version 5.12 distribution of Apache Kafka 2.1
Google Cloud - Includes the Google BigQuery origin, Google Pub/Sub Subscriber origin, and Google Pub/Sub Publisher destination.
Java keystore credential store - For use with credential stores.
Vault credential store - For use with credential stores.

Data Collector Configuration

Access Hashicorp Vault secrets - The Data Collector Vault integration now relies on Vault's App Role authentication backend. Previously, Data Collector relied on Vault's App ID authentication backend. Hashicorp has deprecated the App ID authentication backend.
New Hadoop user impersonation property - When you enable Data Collector to impersonate the current Data Collector user when writing to Hadoop, you can now also configure Data Collector to make the username lowercase. This can be helpful with case-sensitive implementations of LDAP.
New Java security properties - The Data Collector configuration file now includes properties with a "java.security." prefix, which you can use to configure Java security properties.
New property to define the amount of time to cache DNS lookups - By default, the java.security.networkaddress.cache.ttl property is set to 0 so that the JVM uses the Domain Name Service (DNS) time to live value, instead of caching the lookups for the lifetime of the JVM.
SDC_HEAPDUMP_PATH enhancement - The new default file name, $SDC_LOG/sdc_heapdump_${timestamp}.hprof, includes a timestamp so you can write multiple heap dump files to the specified directory.

Dataflow Triggers

Pipeline events - The event framework now generates pipeline lifecycle events when the pipeline stops and starts. You can pass each pipeline event to an executor or to another pipeline for more complex processing. Use pipeline events to trigger tasks before pipeline processing begins or after it stops.

Origins

New Google BigQuery origin - An origin that executes a query job and reads the result from Google BigQuery.
New Google Pub/Sub Subscriber origin - A multithreaded origin that consumes messages from a Google Pub/Sub subscription.
New OPC UA Client origin - An origin that processes data from an OPC UA server.
New SQL Server CDC Client origin - A multithreaded origin that reads data from Microsoft SQL Server CDC tables.
New SQL Server Change Tracking origin - A multithreaded origin that reads data from Microsoft SQL Server change tracking tables and generates the latest version of each record.
Directory origin event enhancements - The Directory origin can now generate no-more-data events when it completes processing all available files and the batch wait time has elapsed without the arrival of new files. Also, the File Finished event now includes the number of records and files processed.
Hadoop FS origin enhancement - The Hadoop FS origin now allows you to read data from other file systems using the Hadoop FileSystem interface. Use the Hadoop FS origin in cluster batch pipelines.
HTTP Client origin enhancement - The HTTP Client origin now allows time functions and datetime variables in the request body. It also allows you to specify the time zone to use when evaluating the request body.
HTTP Server origin enhancement - The HTTP Server origin can now process Avro files.
JDBC Query Consumer origin enhancement - You can now configure the behavior for the origin when it encounters data of an unknown data type.
JDBC Multitable Consumer origin enhancements:
- You can now use the origin to perform multithreaded processing of partitions within a table. Use partition processing to handle even larger volumes of data. This enhancement also includes new JDBC header attributes.
  By default, all new pipelines use partition processing when possible. Upgraded pipelines use multithreaded table processing to preserve previous behavior.
- You can now configure the behavior for the origin when it encounters data of an unknown data type.
Oracle CDC Client origin enhancements:
- The origin can now buffer data locally rather than utilizing Oracle LogMiner buffers.
- You can now specify the behavior when the origin encounters an unsupported field type - send to the pipeline, send to error, or discard.
- You can configure the origin to include null values passed from the LogMiner full supplemental logging. By default, the origin ignores null values.
- You now must select the target server time zone for the origin.
- You can now configure a query timeout for the origin.
- The origin now includes the row ID in the oracle.cdc.rowId record header attribute and can include the LogMiner redo query in the oracle.cdc.query record header attribute.
RabbitMQ Consumer origin enhancement - When available, the origin now provides attributes generated by RabbitMQ, such as contentType, contentEncoding, and deliveryMode, as record header attributes.
TCP Server origin enhancement - The origin can now process character-based data that includes a length prefix.
UDP Source origin enhancement - The origin can now process binary and character-based raw data.
New last-modified time record header attribute - Directory, File Tail, and SFTP/FTP Client origins now include the last modified time for the originating file for a record in an mtime record header attribute.

Processors

New Data Parser processor - Use the new processor to extract NetFlow or syslog messages as well as other supported data formats that are embedded in a field.
New JSON Generator processor - Use the new processor to serialize data from a record field to a JSON-encoded string.
New Kudu Lookup processor - Use the new processor to perform lookups in Kudu to enrich records with additional data.
Hive Metadata processor enhancement - You can now configure custom record header attributes for metadata records.

Destinations

New Google Pub/Sub Publisher destination - A destination that publishes messages to Google Pub/Sub.
New JMS Producer destination - A destination that writes data to JMS.
Amazon S3 destination enhancements:
- You can now use expressions in the Bucket property for the Amazon S3 destination. This enables you to write records dynamically based expression evaluation.
- The Amazon S3 object written event record now includes the number of records written to the object.
Azure Data Lake Store destination enhancement - The Client ID and Client Key properties have been renamed Application ID and Application Key to align with the updated property names in the new Azure portal.
Cassandra destination enhancement - The destination now supports Kerberos authentication if you have installed the DataStax Enterprise Java driver.
Elasticsearch destination enhancement - The destination can now create parent-child relationships between documents in the same index.
Hive Metastore destination - You can now configure the destination to create custom record header attributes.
Kafka Producer destination enhancement - The destination can now write XML documents.
Solr destination enhancement - You can now configure the destination to skip connection validation when the Solr configuration file, solrconfig.xml, does not define the default search field (“df”) parameter.

Executors

New Amazon S3 executor - Use the Amazon S3 executor to create new Amazon S3 objects for the specified content or add tags to existing objects each time it receives an event.
HDFS File Metadata executor enhancement - The executor can now remove a file or directory when it receives an event.

Dataflow Performance Manager

Revert changes to published pipelines - If you update a published pipeline but decide not to publish the updates to DPM as a new version, you can revert the changes made to the pipeline configuration.

Pipelines

Pipeline error handling enhancements:
- Use the new Error Record Policy to specify the version of the record to include in error records.
- You can now write error records to Amazon Kinesis Streams.
Error records enhancement - Error records now include the user-defined stage label in the errorStageLabel header attribute.
Pipeline state enhancements - Pipelines can now display the following new states: STARTING_ERROR, STOPPING_ERROR, and STOP_ERROR.

Data Formats

Writing XML - You can now use the Google Pub/Sub Publisher, JMS Producer, and Kafka Producer destinations to write XML documents to destination systems. Note the record structure requirement before you use this data format.
Avro:
- Origins now write the Avro schema to an avroSchema record header attribute.
- Origins now include precision and scale field attributes for every Decimal field.
- Data Collector now supports the time-based logical types added to Avro in version 1.8.
Delimited - Data Collector can now continue processing records with delimited data when a row has more fields than the header. Previously, rows with more fields than the header were sent to error.

Cluster Pipelines

This release includes the following Cluster Yarn Streaming enhancements:

Use a new Worker Count property to limit the number of worker nodes used in Cluster Yarn Streaming pipelines. By default, a Data Collector worker is spawned for each partition in the topic.
You can now define Spark configuration properties to pass to the spark-submit script.

Expression Language

This release includes the following new functions:

credential:get() - Returns credential values from a credential store.
credential:getWithOptions() - Returns credential values from a credential store using additional options to communicate with the credential store.
record:errorStageLabel() - Returns the user-defined name of the stage that generated the error record.
list:join() - Merges elements in a List field into a String field, using the specified separator between elements.
list:joinSkipNulls() - Merges elements in a List field into a String field, using the specified separator between elements and skipping null values.

str:indexOf() - Returns the index within a string of the first occurrence of the specified subset of characters.

Miscellaneous

Global bulk edit mode - In any property where you would previously click an Add icon to add additional configurations, you can now switch to bulk edit mode to enter a list of configurations in JSON format.
Snapshot enhancement - Snapshots no longer produce empty batches when waiting for data.
Webhooks enhancement - You can use several new pipeline state notification parameters in webhooks.

What's New in 2.6.0.1

Data Collector version 2.6.0.1 includes the following enhancement:

Kinesis Consumer origin - You can now reset the origin for Kinesis Consumer pipelines. Resetting the origin for Kinesis Consumer differs from other origins, so please note the requirement and guidelines.

What's New in 2.6.0.0

Data Collector version 2.6.0.0 includes the following new features and enhancements:

Installation

MapR prerequisites - You can now run the setup-mapr command in interactive or non-interactive mode. In interactive mode, the command prompts you for the MapR version and home directory. In non-interactive mode, you define the MapR version and home directory in environment variables before running the command.

Stage Libraries

Data Collector now supports the following stage libraries:

Hortonworks version 2.6 distribution of Apache Hadoop
Cloudera distribution of Spark 2.1
MapR distribution of Spark 2.1

Data Collector Configuration

New buffer size configuration - You can now use a new parser.limit configuration property to increase the Data Collector parser buffer size. The parser buffer is used by the origin to process many data formats, including Delimited, JSON, and XML. The parser buffer size limits the size of the records that origins can process. The Data Collector parser buffer size is 1048576 bytes by default.

Drift Synchronization Solution for Hive

Parquet support - You can now use the Drift Synchronization Solution for Hive to generate Parquet files. Previously, the Data Synchronization Solution supported only Avro data. This enhancement includes the following updates:
- Hive Metadata processor data format property - Use the new data format property to indicate the data format to use.
- Parquet support in the Hive Metastore destination - The destination can now create and update Parquet tables in Hive. The destination no longer includes a data format property since that information is now configured in the Hive Metadata processor.

See the documentation for implementation details and a Parquet case study.

Multithreaded Pipelines

The multithreaded framework includes the following enhancements:

Origins for multithreaded pipelines - You can now use the following origins to create multithreaded pipelines:
- CoAP Server origin
- TCP Server origin
Multithreaded origin icons - The icons for multithreaded origins now include the following multithreaded indicator:
For example, here’s the updated Elasticsearch origin icon:

Dataflow Triggers / Event Framework

New executors - You can now use the following executors to perform tasks upon receiving an event:
- Email executor
- Shell executor

Dataflow Performance Manager (DPM)

Pipeline statistics - You can now configure a pipeline to write statistics directly to DPM. Write statistics directly to DPM when you run a job for the pipeline on a single Data Collector.
When you run a job on multiple Data Collectors, a remote pipeline instance runs on each of the Data Collectors. To view aggregated statistics for the job within DPM, you must configure the pipeline to write the statistics to a Kafka cluster, Amazon Kinesis Streams, or SDC RPC.
Update published pipelines - When you update a published pipeline, Data Collector now displays a red asterisk next to the pipeline name to indicate that the pipeline has been updated since it was last published.

Origins

New CoAP Server origin - An origin that listens on a CoAP endpoint and processes the contents of all authorized CoAP requests. The origin performs parallel processing and can generate multithreaded pipelines.
New TCP Server origin - An origin that listens at the specified ports, establishes TCP sessions with clients that initiate TCP connections, and then processes the incoming data. The origin can process NetFlow, syslog, and most Data Collector data formats as separated records. You can configure custom acknowledgement messages and use a new batchSize variable, as well as other expressions, in the messages.
SFTP/FTP Client origin enhancement - You can now specify the first file to process. This enables you to skip processing files with earlier timestamps.

Processors

Groovy, JavaScript, and Jython Evaluator processor enhancements:
- You can now include some methods of the sdcFunctions scripting object in the initialization and destroy scripts for the processors.
- You can now use runtime parameters in the code developed for a Groovy Evaluator processor.
Hive Metadata processor enhancements:
- The Hive Metadata processor can now process Parquet data as part of the Drift Synchronization Solution for Hive.
- You can now specify the data format to use: Avro or Parquet.
- You can now configure an expression that defines comments for generated columns.
JDBC Lookup processor enhancements:
- The JDBC Lookup processor can now return multiple values. You can now configure the lookup to return the first value or to return all matches as separate records.
- When you monitor a pipeline that includes the JDBC Lookup processor, you can now view stage statistics about the number of queries the processor makes and the average time of the queries.
Spark Evaluator enhancement - The Spark Evaluator now supports Spark 2.x.

Destinations

New CoAP Client destination - A destination that writes to a CoAP endpoint.
Hive Metastore destination enhancements:
- The destination can now create and update Parquet tables in Hive.
- Also, the data format property has been removed. You now specify the data format in the Hive Metadata processor.
  Since the Hive Metastore previously supported only Avro data, there is no upgrade impact.
Kudu destination enhancement - You can use the new Mutation Buffer Space property to set the buffer size that the Kudu client uses to write each batch.

Executors

New Email executor - Use to send custom emails upon receiving an event. For a case study, see Case Study: Sending Email.
- New Shell executor - Use to execute shell scripts upon receiving an event.
- JDBC Query executor enhancement - A new Batch Commit property allows the executor to commit to the database after each batch. Previously, the executor did not call commits by default.
  
  For new pipelines, the property is enabled by default. For upgraded pipelines, the property is disabled to prevent changes in pipeline behavior.
- Spark executor enhancement - The executor now supports Spark 2.x.

REST API / Command Line Interface

Offset management - Both the REST API and command line interface can now retrieve the last-saved offset for a pipeline and set the offset for a pipeline when it is not running. Use these commands to implement pipeline failover using an external storage system. Otherwise, pipeline offsets are managed by Data Collector and there is no need to update the offsets.

Expression Language

vault:read enhancement - The vault:read function now supports returning the value for a key nested in a map.

General

Support bundles - You can now use Data Collector to generate a support bundle. A support bundle is a ZIP file that includes Data Collector logs, environment and configuration information, pipeline JSON files, resource files, and pipeline snapshots.
You upload the generated file to the StreamSets support team so that we can use the information to troubleshoot your support tickets.
TLS property enhancements - Stages that support SSL/TLS now provide the following enhanced set of properties that enable more specific configuration:
- Keystore and truststore type - You can now choose between Java Keystore (JKS) and PKCS #12 (p-12). Previously, Data Collector only supported JKS.
- Transport protocols - You can now specify the transport protocols that you want to allow. By default, Data Collector allows only TLSv1.2.
- Cipher suites - You can now specify the cipher suites to allow. Data Collector provides a modern set of default cipher suites. Previously, Data Collector always allowed the default cipher suites for the JRE.
To avoid upgrade impact, all SSL/TLS/HTTPS properties in existing pipelines are preserved during upgrade.
Cluster mode enhancement - Cluster streaming mode now supports Spark 2.x. For information about using Spark 2.x stages with cluster mode, see Cluster Pipeline Limitations.
Precondition enhancement - Stages with user-defined preconditions now process all preconditions before passing a record to error handling. This allows error records to include all precondition failures in the error message.
Pipeline import/export enhancement - When you export multiple pipelines, Data Collector now includes all pipelines in a single zip file. You can also import multiple pipelines from a single zip file.

What's New in 2.5.1.0

Data Collector version 2.5.1.0 includes the following enhancement:

New stage library - Data Collector now supports the Cloudera CDH version 5.11 distribution of Hadoop and the Cloudera version 5.11 distribution of Apache Kafka 2.1.
Upgrading to this version can require updating existing pipelines. For details, see Working with Cloudera CDH 5.11 or Later.

What's New in 2.5.0.0

Data Collector version 2.5.0.0 includes the following new features and enhancements:

Multithreaded Pipelines

The multithreaded framework includes the following enhancements:

Origins for multithreaded pipelines - You can now use the following origins to create multithreaded pipelines:
- Elasticsearch origin
- JDBC Multitable Consumer origin
- Kinesis Consumer origin
- WebSocket Server origin
Maximum pipeline runners - You can now configure a maximum number of pipeline runners to use in a pipeline. Previously, Data Collector generated pipeline runners based on the number of threads created by the origin. This allows you to tune performance and resource usage. By default, Data Collector still generates runners based on the number of threads that the origin uses.
Record Deduplicator processor enhancement - The processor can now deduplicate records across all pipeline runners in a multithreaded pipeline.
Pipeline validation enhancement - The pipeline now displays duplicate errors generated by using multiple threads as one error message.
Log enhancement - Multithreaded pipelines now include the runner ID in log information.
Monitoring - Monitoring now displays a histogram of available pipeline runners, replacing the information previously included in the Runtime Statistics list.

Pipelines

Data Collector pipeline permissions change - With this release, pipeline permissions are no longer enabled by default. To enable pipeline permissions, edit the pipeline.access.control.enabled Data Collector configuration property.
Stop pipeline execution - You can configure pipelines to transfer data and automatically stop execution based on an event such as reaching the end of a table. The JDBC and Salesforce origins can generate events when they reach the end of available data that the Pipeline Finisher uses to stop the pipeline. Click here for a case study.
Pipeline runtime parameters - You can now define runtime parameters when you configure a pipeline, and then call the parameters from within that pipeline. When you start the pipeline from the user interface, the command line, or the REST API, you specify the values to use for those parameters. Use pipeline parameters to represent any stage or pipeline property with a value that must change for each pipeline run - such as batch sizes and timeouts, directories, or URI.
In previous versions, pipeline runtime parameters were named pipeline constants. You defined the constant values in the pipeline, and could not pass different values when you started the pipeline.
Pipeline ID enhancement - Data Collector now prefixes the pipeline ID with the alphanumeric characters entered for the pipeline title. For example, if you enter “Oracle To HDFS” as the pipeline title, then the pipeline ID has the following value: OracleToHDFStad9f592-5f02-4695-bb10-127b2e41561c.
Webhooks for pipeline state changes and alerts - You can now configure pipeline state changes and metric and data alerts to call webhooks in addition to sending email. For example, you can configure an incoming webhook in Slack so that an alert can be posted to a Slack channel. Or, you can configure a webhook to start another pipeline when the pipeline state is changed to Finished or Stopped.
Force a pipeline to stop from the command line - If a pipeline remains in a Stopping state, you can now use the command line to force stop the pipeline immediately.

Stage Libraries

Data Collector now supports the Apache Kudu version 1.3.x. stage library.

Salesforce Stages

The following Salesforce stages include several enhancements:

Salesforce origin and Salesforce Lookup processor
- The origin and processor can use a proxy to connect to Salesforce.
- You can now specify SELECT * FROM <object> in a SOQL query. The origin or processor expands * to all fields in the Salesforce object that are accessible to the configured user.
- The origin and processor generate Salesforce field attributes that provide additional information about each field, such as the data type of the Salesforce field.
- The origin and processor can now additionally retrieve deleted records from the Salesforce recycle bin.
- The origin can now generate events when it completes processing all available data.
Salesforce destination - The destination can now use a CRUD operation record header attribute to indicate the operation to perform for each record. You can also configure the destination to use a proxy to connect to Salesforce.
Wave Analytics destination - You can now configure the authentication endpoint and the API version that the destination uses to connect to Salesforce Wave Analytics. You can also configure the destination to use a proxy to connect to Salesforce.

Origins

New Elasticsearch origin - An origin that reads data from an Elasticsearch cluster. The origin uses the Elasticsearch scroll API to read documents using a user-defined Elasticsearch query. The origin performs parallel processing and can generate multithreaded pipelines.
New MQTT Subscriber origin - An origin that subscribes to a topic on an MQTT broker to read messages from the broker.
New WebSocket Server origin - An origin that listens on a WebSocket endpoint and processes the contents of all authorized WebSocket requests. The origin performs parallel processing and can generate multithreaded pipelines.
Dev Data Generator origin enhancement - When you configure the origin to generate events to test event handling functionality, you can now specify the event type to use.
HTTP Client origin enhancements - When using pagination, the origin can include all response fields in the resulting record in addition to the fields in the specified result field path. The origin can now also process the following new data formats: Binary, Delimited, Log, and SDC Record.
HTTP Server origin enhancement - The origin requires that HTTP clients include the application ID in all requests. You can now configure HTTP clients to send data to a URL that includes the application ID in a query parameter, rather than including the application ID in request headers.
JDBC Multitable Consumer origin enhancements - The origin now performs parallel processing and can generate multithreaded pipelines. The origin can generate events when it completes processing all available data.
You can also configure the quote character to use around table, schema, and column names in the query. And you can configure the number of times a thread tries to read a batch of data after receiving an SQL error.
JDBC Query Consumer origin enhancements - The origin can now generate events when it completes processing all available data, and when it successfully completes or fails to complete a query.
To handle transient connection or network errors, you can now specify how many times the origin should retry a query before stopping the pipeline.
Kinesis Consumer origin enhancement - The origin now performs parallel processing and can generate multithreaded pipelines.
MongoDB origin and MongoDB Oplog origin enhancements - The origins can now use LDAP authentication in addition to username/password authentication to connect to MongoDB. You can also now include credentials in the MongoDB connection string.

Processors

New Field Order processor - A processor that orders fields in a map or list-map field and outputs the fields into a list-map or list root field.
Field Flattener enhancement - You can now flatten a field in place to raise it to the parent level.
Groovy, JavaScript, and Jython Evaluator processor enhancement - You can now develop an initialization script that the processor runs once when the pipeline starts. Use an initialization script to set up connections or resources required by the processor.
You can also develop a destroy script that the processor runs once when the pipeline stops. Use a destroy script to close any connections or resources opened by the processor.
JDBC Lookup enhancement - Default value date formats. When the default value data type is Date, use the following format: yyyy/MM/dd . When the default value data type is Datetime, use the following format: yyyy/MM/dd HH:mm:ss.
Record Deduplicator processor enhancement - The processor can now deduplicate records across all pipeline runners in a multithreaded pipeline.
Spark Evaluator processor enhancements - The processor is now included in the MapR 5.2 stage library.
The processor also now provides beta support of cluster mode pipelines. In a development or test environment, you can use the processor in pipelines that process data from a Kafka or MapR cluster in cluster streaming mode. Do not use the Spark Evaluator processor in cluster mode pipelines in a production environment.

Destinations

New HTTP Client destination - A destination that writes to an HTTP endpoint.
New MQTT Publisher destination - A destination that publishes messages to a topic on an MQTT broker.
New WebSocket Client destination - A destination that writes to a WebSocket endpoint.
Azure Data Lake Store destination enhancement - You can now configure an idle timeout for output files.
Cassandra destination enhancements - The destination now supports the Cassandra uuid and timeuuid data types. And you can now specify the Cassandra batch type to use: Logged or Unlogged. Previously, the destination used the Logged batch type.
JDBC Producer enhancements - The origin now includes a Schema Name property for entering the schema name. For information about possible upgrade impact, see Configure JDBC Producer Schema Names.
You can also use the Enclose Object Name property to enclose the database/schema, table, and column names in quotation marks when writing to the database.
MapR DB JSON destination enhancement - You can now enter an expression that evaluates to the name of the MapR DB JSON table to write to.
MongoDB destination enhancements - The destination can now use LDAP authentication in addition to username/password authentication to connect to MongoDB. You can also now include credentials in the MongoDB connection string.
SDC RPC destination enhancements - The Back Off Period value that you enter now increases exponentially after each retry, until it reaches the maximum wait time of 5 minutes. Previously, there was no limit to the maximum wait time. The maximum value for the Retries per Batch property is now unlimited - previously it was 10 retries.
Solr destination enhancement - You can now configure the action that the destination takes when it encounters missing fields in the record. The destination can discard the fields, send the record to error, or stop the pipeline.

Executors

New Spark executor - The executor starts a Spark application on a YARN or Databricks cluster each time it receives an event.
New Pipeline Finisher executor - The executor stops the pipeline and transitions it to a Finished state when it receives an event. Can be used with the JDBC Query Consumer, JDBC Multitable Consumer, and Salesforce origins to perform batch processing of available data.
HDFS File Metadata executor enhancement - The executor can now create an empty file upon receiving an event. The executor can also generate a file-created event when generating events.
MapReduce executor enhancement - When starting the provided Avro to Parquet job, the executor can now overwrite any temporary files created from a previous run of the job.

Functions

New escape XML functions - Three new string functions enable you to escape and unescape XML.
New pipeline user function - A new pipeline user function enables you to determine the user who started the pipeline.
New function to generate UUIDs - A new function that enables you generate UUIDs.
New function returns the number of available processors - The runtime:availableProcessors() function returns the number of processors available to the Java virtual machine.

General Enhancements

Data Collector Hadoop impersonation enhancement - You can use the stage.conf_hadoop.always.impersonate.current.user Data Collector configuration property to ensure that Data Collector uses the current Data Collector user to read from or write to Hadoop systems.
When enabled, you cannot configure alternate users in the following Hadoop-related stages:
- Hadoop FS origin and destination
- MapR FS origin and destination
- HBase lookup and destination
- MapR DB destination
- HDFS File Metadata executor
- Map Reduce executor
Stage precondition property enhancement - Records that do not meet all preconditions for a stage are now processed based on error handling configured in the stage. Previously, they were processed based on error handling configured for the pipeline. See Evaluate Precondition Error Handling for information about upgrading.
XML parsing enhancements - You can include field XPath expressions and namespaces in the record with the Include Field XPaths property. And use the new Output Field Attributes property to write XML attributes and namespace declarations to field attributes rather than including them in the record as fields.
Wrap long lines in properties - You can now configure Data Collector to wrap long lines of text that you enter in properties, instead of displaying the text with a scroll bar.

What's New in 2.4.1.0

Data Collector version 2.4.1.0 includes the following new features and enhancements:

Salesforce origin enhancement - When the origin processes existing data and is not subscribed to notifications, it can now repeat the specified query at regular intervals. The origin can repeat a full or incremental query.
Log data display - You can stop and restart the automatic display of the most recent log data on the Data Collector Logs page.
New time function - The time:createDateFromStringTZ function enables creating Date objects adjusted for time zones from string datetime values.
New stage library stage-type icons - The stage library now displays icons to differentiate between different stage types.

Note: The Hive Drift Solution is now known as the "Drift Synchronization Solution for Hive" in the documentation.

What's New in 2.4.0.0

Data Collector version 2.4.0.0 includes the following new features and enhancements:

Pipeline Sharing and Permissions

Data Collector now provides pipeline-level permissions. Permissions determine the access level that users and groups have on pipelines. To create a multitenant environment, create groups of users and then share pipelines with the groups to grant different levels of access.

With this change, only the pipeline owner and users with the Admin role can view a pipeline by default. If upgrading from a previous version of Data Collector, see the following post-upgrade task, Configure Pipeline Permissions.

This feature includes the following components:

Pipeline permissions - Pipelines now have read, write, and execute permissions. Pipeline permissions overlay existing Data Collector roles to provide greater security. For information, see Roles and Permissions.
Pipeline sharing - The pipeline owner and users with the Admin role can configure pipeline permissions for users and groups.
Data Collector pipeline access control property - You can enable and disable the use of pipeline permissions with the pipeline.access.control.enabled configuration property. By default, this property is enabled.
Permissions transfer - You can transfer all pipeline permissions associated with a user or group to a different user or group. Use pipeline transfer to easily migrate permissions after registering with DPM or after a user or group becomes obsolete.

Dataflow Performance Manager (DPM)

Register Data Collectors with DPM - If Data Collector uses file-based authentication and if you register the Data Collector from the Data Collector UI, you can now create DPM user accounts and groups during the registration process.
Aggregated statistics for DPM - When working with DPM, you can now configure a pipeline to write aggregated statistics to SDC RPC. Write statistics to SDC RPC for development purposes only. For a production environment, use a Kafka cluster or Amazon Kinesis Streams to aggregate statistics.

Origins

Dev SDC RPC with Buffering origin - A new development stage that receives records from an SDC RPC destination, temporarily buffering the records to disk before passing the records to the next stage in the pipeline. Use as the origin in an SDC RPC destination pipeline.
Amazon S3 origin enhancement - You can configure a new File Pool Size property to determine the maximum number of files that the origin stores in memory for processing after loading and sorting all files present on S3.

Other

New stage libraries - This release supports the following new stage libraries:
- Kudu versions 1.1 and 1.2
- Cloudera CDH version 5.10 distribution of Hadoop
- Cloudera version 5.10 distribution of Apache Kafka 2.1
Install external libraries using the Data Collector user interface - You can now use the Data Collector user interface to install external libraries to make them available to stages. For example, you can install JDBC drivers for stages that use JDBC connections. Or, you can install external libraries to call external Java code from the Groovy, Java, and Jython Evaluator processors.
Custom header enhancement - You can now use HTML in the ui.header.title configuration property to configure a custom header for the Data Collector UI. This allows you to specify the look and feel for any text that you use, and to include small images in the header.
Groovy enhancement - You can configure the processor to use the invokedynamic bytecode instruction.
Pipeline renaming - You can now rename a pipeline by clicking directly on the pipeline name when editing the pipeline, in addition to editing the Title general pipeline property.

What's New in 2.3.0.1

Data Collector version 2.3.0.1 includes the following new features and enhancements:

Oracle CDC Client origin enhancement - The origin can now track and adapt to schema changes when reading the dictionary from redo logs. When using the dictionary in redo logs, the origin can also generate events for each DDL that it reads.
New Data Collector property - The http.enable.forwarded.requests property in the Data Collector configuration file enables handling X-Forwarded-For, X-Forwarded-Proto, X-Forwarded-Port request headers issued by a reverse proxy or load balancer.
MongoDB origin enhancement The origin now supports using any string field as the offset field.

What's New in 2.3.0.0

Data Collector version 2.3.0.0 includes the following new features and enhancements:

Multithreaded Pipelines

You can use a multithreaded origin to generate multithreaded pipelines to perform parallel processing.

The new multithreaded framework includes the following changes:

HTTP Server origin - Listens on an HTTP endpoint and processes the contents of all authorized HTTP POST requests. Use the HTTP Server origin to receive high volumes of HTTP POST requests using multiple threads.
Enhanced Dev Data Generator origin - Can create multiple threads for testing multithreaded pipelines.
Enhanced runtime statistics - Monitoring a pipeline displays aggregated runtime statistics for all threads in the pipeline. You can also view the number of runners, i.e. threads and pipeline instances, being used.

CDC/CRUD Enhancements

With this release, certain Data Collector stages enable you to easily process change data capture (CDC) or transactional data in a pipeline. The sdc.operation.type record header attribute is now used by all CDC-enabled origins and CRUD-enabled stages:

CDC-enabled origins:

The MongoDB Oplog and Salesforce origins are now enabled for processing changed data by including the CRUD operation type in the sdc.operation.type record header attribute.
Though previously CDC-enabled, the Oracle CDC Client and JDBC Query Consumer for Microsoft SQL Server now include CRUD operation type in the sdc.operation.type record header attribute.

Previous operation type header attributes are still supported for backward-compatibility.

CRUD-enabled stages:

The JDBC Tee processor and JDBC Producer can now process changed data based on CRUD operations in record headers. The stages also include a default operation and unsupported operation handling.
The MongoDB and Elasticsearch destinations now look for the CRUD operation in the sdc.operation.type record header attribute. The Elasticsearch destination includes a default operation and unsupported operation handling.

Multitable Copy

You can use the new JDBC Multitable Consumer origin when you need to copy multiple tables to a destination system or for database replication. The JDBC Multitable Consumer origin reads database data from multiple tables through a JDBC connection. The origin generates SQL queries based on the table configurations that you define.

Configuration

Groups for file-based authentication - If you use file-based authentication, you can now create groups of users when multiple users use Data Collector. You configure groups in the associated realm.properties file located in the Data Collector configuration directory, $SDC_CONF.
If you use file-based authentication, you can also now view all user accounts granted access to the Data Collector, including the roles and groups assigned to each user.
LDAP authentication enhancements - You can now configure Data Collector to use StartTLS to make secure connections to an LDAP server. You can also configure the userFilter property to define the LDAP user attribute used to log in to Data Collector. For example, a username, uid, or email address.
Proxy configuration for outbound requests - You can now configure Data Collector to use an authenticated HTTP proxy for outbound requests to Dataflow Performance Manager (DPM).
Java garbage collector logging - Data Collector now enables logging for the Java garbage collector by default. Logs are written to $SDC_LOG/gc.log. You can disable the logging if needed.
Heap dump for out of memory errors - Data Collector now produces a heap dump file by default if it encounters an out of memory error. You can configure the location of the heap dump file or you can disable this default behavior.
Modifying the log level - You can now use the Data Collector UI to modify the log level to display messages at another severity level.

Pipelines

Pipeline renaming - You can now rename pipelines by editing the Title general pipeline property.
Field attributes - Data Collector now supports field-level attributes. Use the Expression Evaluator to add field attributes.

Origins

New HTTP Server origin - A multithreaded origin that listens on an HTTP endpoint and processes the contents of all authorized HTTP POST requests. Use the HTTP Server origin to read high volumes of HTTP POST requests using multiple threads.
New HTTP to Kafka origin - Listens on a HTTP endpoint and writes the contents of all authorized HTTP POST requests directly to Kafka. Use to read high volumes of HTTP POST requests and write them to Kafka.
New MapR DB JSON origin - Reads JSON documents from MapR DB JSON tables.
New MongoDB Oplog origin - Reads entries from a MongoDB Oplog. Use to process change information for data or database operations.
Directory origin enhancement - You can use regular expressions in addition to glob patterns to define the file name pattern to process files.
HTTP Client origin enhancement - You can now configure the origin to use the OAuth 2 protocol to connect to an HTTP service.
JDBC Query Consumer origin enhancements - The JDBC Consumer origin has been renamed to the JDBC Query Consumer origin. The origin functions the same as in previous releases. It reads database data using a user-defined SQL query through a JDBC connection. You can also now configure the origin to enable auto-commit mode for the JDBC connection and to disable validation of the SQL query.
MongoDB origin enhancements - You can now use a nested field as the offset field. The origin supports reading the MongoDB BSON timestamp for MongoDB versions 2.6 and later. And you can configure the origin to connect to a single MongoDB server or node.

Processors

Field Type Converter processor enhancement - You can now configure the processor to convert timestamp data in a long field to a String data type. Previously, you had to use one Field Type Converter processor to convert the long field to a datetime, and then use another processor to convert the datetime field to a string.
HTTP Client processor enhancements - You can now configure the processor to use the OAuth 2 protocol to connect to an HTTP service. You can also configure a rate limit for the processor, which defines the maximum number of requests to make per second.
JDBC Lookup processor enhancements - You can now configure the processor to enable auto-commit mode for the JDBC connection. You can also configure the processor to use a default value if the database does not return a lookup value for a column.
Salesforce Lookup processor enhancement - You can now configure the processor to use a default value if Salesforce does not return a lookup value for a field.
XML Parser enhancement - A new Multiple Values Behavior property allows you to specify the behavior when you define a delimiter element and the document includes more than one value: Return the first value as a record, return one record with a list field for each value, or return all values as records.

Destinations

New MapR DB JSON destination - Writes data as JSON documents to MapR DB JSON tables.
Azure Data Lake Store destination enhancement - You can now use the destination in cluster batch pipelines. You can also process binary and protobuf data, use record header attributes to write records to files and roll files, and configure a file suffix and the maximum number of records that can be written to a file.
Elasticsearch destination enhancement - The destination now uses the Elasticsearch HTTP API. With this API, the Elasticsearch version 5 stage library is compatible with all versions of Elasticsearch. Earlier stage library versions have been removed. Elasticsearch is no longer supported on Java 7. You’ll need to verify that Java 8 is installed on the Data Collector machine and remove this stage from the blacklist property in $SDC_CONF/sdc.properties before you can use it.

You can also now configure the destination to perform any of the following CRUD operations: create, update, delete, or index.
Hive Metastore destination enhancement - New table events now include information about columns and partitions in the table.
Hadoop FS, Local FS, and MapR FS destination enhancement - The destinations now support recovery after an unexpected stop of the pipeline by renaming temporary files when the pipeline restarts.
Redis destination enhancement - You can now configure a timeout for each key that the destination writes to Redis.

Executors

Hive Query executor enhancements:
- The executor can now execute multiple queries for each event that it receives.
- It can also generate event records each time it processes a query.

JDBC Query executor enhancement - You can now configure the executor to enable auto-commit mode for the JDBC connection.

Data Formats

Whole File enhancement - You can now specify a transfer rate to help control the resources used to process whole files. You can specify the rate limit in all origins that process whole files.

Expression Language

New pipeline functions - You can use the following new pipeline functions to return pipeline information:
- pipeline:id() - Returns the pipeline ID, a UUID that is automatically generated and used by Data Collector to identify the pipeline.
  Note: The existing pipeline:name() function now returns the pipeline ID instead of the pipeline name since pipeline ID is the correct way to identify a pipeline.
- pipeline:title() - Returns the pipeline title or name.
New record functions - You can use the following new record functions to work with field attributes:
- record:fieldAttribute (<field path>, <attribute name>) - Returns the value of the specified field attribute.
- record:fieldAttributeOrDefault (<field path>, <attribute name>, <default value>) - Returns the value of the specified field attribute. Returns the default value if the attribute does not exist or contains no value.
New string functions - You can use the following new string functions to transform string data:
- str:urlEncode (<string>, <encoding>) - Returns a URL encoded string from a decoded string using the specified encoding format.
- str:urlDecode (<string>, <encoding>) - Returns a decoded string from a URL encoded string using the specified encoding format.
New time functions - You can use the following new time functions to transform datetime data:
- time:dateTimeToMilliseconds (<Date object>) - Converts a Date object to an epoch or UNIX time in milliseconds.
- time:extractDateFromString(<string>, <format string>) - Extracts a Date object from a String, based on the specified date format.
- time:extractStringFromDateTZ (<Date object>, <timezone>, <format string>) - Extracts a string value from a Date object based on the specified date format and time zone.
New and enhanced miscellaneous functions - You can use the following new and enhanced miscellaneous functions:
- offset:column(<position>) - Returns the value of the positioned offset column for the current table. Available only in the additional offset column conditions of the JDBC Multitable Consumer origin.
- every function - You can now use the function with the hh() datetime variable in directory templates. This allows you to create directories based on the specified interval for hours.

What's New in 2.2.1.0

Data Collector version 2.2.1.0 includes the following new features and enhancements:

Processors

New Field Zip processor - Merges two List fields or two List-Map fields in the same record.
New Salesforce Lookup processor - Performs lookups in a Salesforce object and passes the lookup values to fields. Use the Salesforce Lookup to enrich records with additional data.
Value Replacer enhancement - You can now replace field values with nulls using a condition.

Destination

Whole file support in the Azure Data Lake Store destination - You can now use the whole file data format to stream whole files to Azure Data Lake Store.

What's New in 2.2.0.0

Data Collector version 2.2.0.0 includes the following new features and enhancements:

Event Framework

The Data Collector event framework enables the pipeline to trigger tasks in external systems based on actions that occur in the pipeline, such as running a MapReduce job after the pipeline writes a file to HDFS. You can also use the event framework to store event information, such as when an origin starts or completes reading a file.

For details, see the Event Framework chapter.

The event framework includes the following new features and enhancements:

New executor stages. A new type of stage that performs tasks in external systems upon receiving an event. This release includes the following executors:
- HDFS File Metadata executor - Changes file metadata such as the name, location, permissions, and ACLs.
- Hive Query executor - Runs a Hive or Impala query.
- JDBC Query executor - Runs a SQL query.
- MapReduce executor - Runs a custom MapReduce job or an Avro to Parquet MapReduce job.
Event generation. The following stages now generate events that you can use in a pipeline:
- Directory and File Tail origins - Generate events when they start and complete reading a file.
- Amazon S3 destination - Generates events when it completes writing to an object or streaming a whole file.
- Hadoop FS, Local FS, and MapR FS destinations - Generate events when they close an output file or complete streaming a whole file.
- Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors - Can run scripts that generate events.
- HDFS File Metadata executor - Generates events when it changes file metadata.
- MapReduce executor - Generates events when it starts a MapReduce job.
Dev stages. You can use the following stages to develop and test event handling:
- Dev Data Generator enhancement - You can now configure the Dev Data Generator to generate event records as well as data records. You can also specify the number of records in a batch.
- To Event - Generates event records using the incoming record as the body of the event record.

Installation

Installation requirements:
- Java requirement - Oracle Java 7 is supported but now deprecated. Oracle announced the end of public updates for Java 7 in April 2015. StreamSets recommends migrating to Java 8, as Java 7 support will be removed in a future Data Collector release.
- File descriptors requirement - Data Collector now requires a minimum of 32,768 open file descriptors.

Core installation includes the basic stage library only - The core RPM and tarball installations now include the basic stage library only, to allow Data Collector to use less disk space. Install additional stage libraries using the Package Manager for tarball installations or the command line for RPM and tarball installations.
Previously, the core installation also included the Groovy, Jython, and statistics stage libraries.

Configuration

New stage libraries. Data Collector now supports the following stage libraries:
- Apache Kudu version 1.0.x - Earlier Kudu versions are no longer supported.
- Cloudera CDH version 5.9 distribution of Apache Hadoop.
- Cloudera version 5.9 distribution of Apache Kafka 2.0.
- Elasticsearch version 5.0.x.
- Google Cloud Bigtable.
- Hortonworks HDP version 2.5 distribution of Apache Hadoop.
- MySQL Binary Log.
- Salesforce.
LDAP authentication - If you use LDAP authentication, you can now configure Data Collector to connect to multiple LDAP servers. You can also configure Data Collector to support an LDAP deployment where members are defined by uid or by full DN.
Java garbage collector - Data Collector now uses the Concurrent Mark Sweep (CMS) garbage collector by default. You can configure Data Collector to use a different garbage collector by modifying Java configuration options in the Data Collector environment configuration file.
Environment variables for Java configuration options. Data Collector now uses three environment variables to define Java configuration options:
- SDC_JAVA_OPTS - Includes configuration options for all Java versions. SDC_JAVA7_OPTS - Includes configuration options used only when Data Collector is running Java 7.
- SDC_JAVA8_OPTS - Includes configuration options used only when Data Collector is running Java 8.
New time zone property - You can configure the Data Collector UI to use UTC, the browser time zone, or the Data Collector time zone. The time zone property affects how dates and times display in the UI. The default is the browser time zone.

Origins

New MySQL Binary Log origin - Reads MySQL binary logs to generate records with change data capture information.
New Salesforce origin - Reads data from Salesforce. The origin can execute a SOQL query to read existing data from Salesforce. The origin can also subscribe to the Force.com Streaming API to receive notifications for changes to Salesforce data.
Directory origin enhancement - You can configure the Directory origin to read files from all subdirectories when using the last-modified timestamp for the read order.
JDBC Query Consumer and Oracle CDC Client origin enhancement - You can now configure the transaction isolation level that the JDBC Query Consumer and Oracle CDC Client origins use to connect to the database. Previously, the origins used the default transaction isolation level configured for the database.

Processors

New Spark Evaluator processor - Processes data based on a Spark application that you develop. Use the Spark Evaluator processor to develop a Spark application that performs custom processing within a pipeline.
Field Flattener processor enhancements - In addition to flattening the entire record, you can also now use the Field Flattener processor to flatten specific list or map fields in the record.
Field Type Converter processor enhancements - You can now use the Field Type Converter processor to change the scale of a decimal field. Or, if you convert a field with another data type to the Decimal data type, you can configure the scale to use in the conversion.
Field Pivoter processor enhancements - The List Pivoter processor has been renamed to the Field Pivoter processor. You can now use the processor to pivot data in a list, map, or list-map field. You can also use the processor to save the field name of the first-level item in the pivoted field.
JDBC Lookup and JDBC Tee processor enhancement - You can now configure the transaction isolation level that the JDBC Lookup and JDBC Tee processors use to connect to the database. Previously, the origins used the default transaction isolation level configured for the database.
Scripting processor enhancements - The Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors can generate event records and work with record header attributes. The sample scripts now include examples of both and a new tip for generating unique record IDs.
XML Flattener processor enhancement - You can now configure the XML Flattener processor to write the flattened data to a new output field. Previously, the processor wrote the flattened data to the same field.
XML Parser processor enhancement. You can now generate records from XML documents using simplified XPath expressions. This enables reading records from deeper within XML documents.

Destination

New Azure Data Lake Store destination - Writes data to Microsoft Azure Data Lake Store.
New Google Bigtable destination - Writes data to Google Cloud Bigtable.
New Salesforce destination - Writes data to Salesforce. New Wave Analytics destination. Writes data to Salesforce Wave Analytics. The destination creates a dataset with external data.
Amazon S3 destination change - The AWS KMS Key ID property has been renamed AWS KMS Key ARN. Data Collector upgrades existing pipelines seamlessly.
File suffix enhancement. You can now configure a file suffix, such as txt or json, for output files generated by Hadoop FS, Local FS, MapR FS, and the Amazon S3 destinations.
JDBC Producer destination enhancement - You can now configure the transaction isolation level that the JDBC Producer destination uses to connect to the database. Previously, the destination used the default transaction isolation level configured for the database.
Kudu destination enhancement - You can now configure the destination to perform one of the following write operations: insert, update, delete, or upsert.

Data Formats

XML processing enhancement - You can now generate records from XML documents using simplified XPath expressions with origins that process XML data and the XML Parser processor. This enables reading records from deeper within XML documents.
Consolidated data format properties - You now configure the data format and related properties on a new Data Format tab. Previously, data formats had individual configuration tabs, e.g., Avro, Delimited, Log.
Related properties, such as Charset, Compression Format, and Ignore Control Characters now appear on the Data Format tab as well.
Checksum generation for whole files - Destinations that stream whole files can now generate checksums for the files so you can confirm the accurate transmission of the file.

Pipeline Maintenance

Add labels to pipelines from the Home page - You can now add labels to multiple pipelines from the Data Collector Home page. Use labels to group similar pipelines. For example, you might want to group pipelines by database schema or by the test or production environment.
Reset the origin for multiple pipelines from the Home page - You can now reset the origin for multiple pipelines at the same time from the Data Collector Home page.

Rules and Alerts

Metric rules and alerts enhancements - The gauge metric type can now provide alerts based on the number of input, output, or error records for the last processed batch.

Expression Language Functions

New file functions - You can use the following new file functions to work with file paths:
- file:fileExtension(<filepath>) - Returns the file extension from a path.
- file:fileName(<filepath>) - Returns a file name from a path.
- file:parentPath(<filepath>) - Returns the parent path of the specified file or directory.
- file:pathElement(<filepath>, <integer>) - Returns the portion of the file path specified by a positive or negative integer.
- file:removeExtension(<filepath>) - Removes the file extension from a path.
New pipeline functions - You can use the following new pipeline functions to determine information about a pipeline:
- pipeline:name() - Returns the pipeline name.
- pipeline:version() - Returns the pipeline version when the pipeline has been published to Dataflow Performance Manager (DPM).
New time functions - You can use the following new time functions to transform datetime data:
- time:extractLongFromDate(<Date object>, <string>) - Extracts a long value from a Date object, based on the specified date format.
- time:extractStringFromDate(<Date object>, <string>) - Extracts a string value from a Date object, based on the specified date format.
- time:millisecondsToDateTime(<long>) - Converts an epoch or UNIX time in milliseconds to a Date object.