Data Collector User Guide

What is StreamSets Data Collector?

StreamSets Data CollectorTM is a lightweight, powerful design and execution engine that streams data in real time. Use Data Collector to route and process data in your data streams.

What is StreamSets Data Collector Edge?

StreamSets Data Collector EdgeTM (SDC Edge) is a lightweight execution agent without a UI that runs pipelines on edge devices. Use SDC Edge to read data from an edge device or to receive data from another pipeline and then act on that data to control an edge device.

What is StreamSets Control Hub?

StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Use Control Hub to allow your teams to build and execute large numbers of complex dataflows at scale.

Logging In and Creating a Pipeline in Data Collector

After you start Data Collector, you can log in to Data Collector and create your first pipeline.

Data Collector User Interface

Data Collector provides a web-based user interface (UI) to configure pipelines, preview data, monitor pipelines, and review snapshots of data.

Data Collector UI - Pipelines on the Home Page

Data Collector displays a list of all available pipelines and related information on the Home page. You can select a category of pipelines, such as Running Pipelines, to view a subset of all available pipelines.

What's New in 3.2.0.0

What's New in 3.1.2.0

What's New in 3.1.1.0

What's New in 3.1.0.0

What's New in 3.0.3.0

What's New in 3.0.2.0

What's New in 3.0.1.0

What's New in 3.0.0.0

What's New in 2.7.2.0

What's New in 2.7.1.1

What's New in 2.7.1.0

What's New in 2.7.0.0

What's New in 2.6.0.1

What's New in 2.6.0.0

What's New in 2.5.1.0

What's New in 2.5.0.0

What's New in 2.4.1.0

What's New in 2.4.0.0

What's New in 2.3.0.1

What's New in 2.3.0.0

What's New in 2.2.1.0

What's New in 2.2.0.0

Installation

You can install Data Collector and start it manually or run it as a service.

Full Installation and Launch (Manual Start)

Full Installation and Launch (Service Start)

Core Installation

You can download and install a core version of Data Collector, and then install individual stage libraries as needed. Use the core installation to install only the stage libraries that you want to use. The core installation allows Data Collector to use less disk space.

Install Additional Stage Libraries

Installation with Cloudera Manager

Run Data Collector from Docker

Installation with Cloud Service Providers

MapR Prerequisites

Due to licensing restrictions, StreamSets cannot distribute MapR libraries with Data Collector. As a result, you must perform additional steps to enable the Data Collector machine to connect to MapR. Data Collector does not display MapR origins and destinations in stage library lists nor the MapR Streams statistics aggregator in the pipeline properties until you perform these prerequisites.

Creating Another Data Collector Instance

Uninstallation

User Authentication

Data Collector can authenticate user accounts based on LDAP or files. Best practice is to use LDAP if your organization has it. By default, Data Collector uses file-based authentication.

Roles and Permissions

Enabling HTTPS

Data Collector Configuration

You can edit the Data Collector configuration file, $SDC_CONF/sdc.properties, to configure properties such as the host name and port number and account information for email alerts.

Data Collector Environment Configuration

Install External Libraries

Custom Stage Libraries

Credential Stores

Data Collector pipeline stages communicate with external systems to read and write data. Many of these external systems require credentials - user names or passwords - to access the data. When you configure pipeline stages for these external systems, you define the credentials that the stage uses to connect to the system.

Accessing Hashicorp Vault Secrets with Vault Functions (Deprecated)

Working with Data Governance Tools

You can configure Data Collector to integrate with data governance tools, giving you visibility into data movement - where the data came from, where it’s going to, and who is interacting with it.

Enabling External JMX Tools

Data Collector uses JMX metrics to generate the graphical display of the status of a running pipeline. You can provide the same JMX metrics to external tools if desired.

Upgrade

Pre Upgrade Tasks

In some situations, you must complete tasks before you upgrade.

Upgrade an Installation from the Tarball

Upgrade an Installation from the RPM Package

When you upgrade an installation from the RPM package, the new version uses the default configuration, data, log, and resource directories. If the previous version used the default directories, the new version has access to the files created in the previous version.

Upgrade an Installation with Cloudera Manager

Post Upgrade Tasks

In some situations, you must complete tasks within Data Collector or your Control Hub on-premises installation after you upgrade.

Working with Upgraded External Systems

When an external system is upgraded to a new version, you can continue to use existing Data Collector pipelines that connected to the previous version of the external system. You simply configure the pipelines to work with the upgraded system.

Troubleshooting an Upgrade

What is a Pipeline?

Data in Motion

Data passes through the pipeline in batches. This is how it works:

Designing the Data Flow

You can branch and merge streams in the pipeline.

Dropping Unwanted Records

You can drop records from the pipeline at each stage by defining required fields or preconditions for a record to enter a stage.

Error Record Handling

Record Header Attributes

Field Attributes

Field attributes are attributes that provide additional information about each field that you can use in pipeline logic, as needed.

Processing Changed Data

Control Character Removal

Development Stages

Technology Preview Functionality

Data Collector includes certain new features and stages with the Technology Preview designation. Technology Preview functionality is available for use in development and testing, but is not meant for use in production.

Test Origin for Preview

A test origin can provide test data for data preview to aid in pipeline development. In Control Hub, you can also use test origins when developing pipeline fragments. Test origins are not used when running a pipeline.

Understanding Pipeline States

Data Collector UI - Edit Mode

Pipeline Types and Icons in Documentation

In Data Collector, you can configure pipelines that are run by Data Collector and pipelines that are run by Data Collector Edge.

Retrying the Pipeline

Rate Limit

Simple and Bulk Edit Mode

Runtime Values

Runtime values are values that you define outside of the pipeline and use for stage and pipeline properties. You can change the values for each pipeline run without having to edit the pipeline.

Event Generation

Webhooks

Notifications

SSL/TLS Configuration

Implicit and Explicit Validation

Expression Configuration

Configuring a Pipeline

Data Formats Overview

Avro Data Format

Binary Data Format

Datagram Data Format

Delimited Data Format

Excel Data Format

Log Data Format

When you use an origin to read log data, you define the format of the log files to be read.

NetFlow Data Processing

Protobuf Data Format Prerequisites

SDC Record Data Format

Text Data Format with Custom Delimiters

Whole File Data Format

You can use the whole file data format to transfer entire files from an origin system to a destination system. With the whole file data format, you can transfer any type of file.

Reading and Processing XML Data

Writing XML Data

Origins

An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline.

Amazon S3

Amazon SQS Consumer

Azure IoT/Event Hub Consumer

CoAP Server