Data Collector User Guide

What is StreamSets Data Collector?

StreamSets Data CollectorTM is a lightweight, powerful design and execution engine that streams data in real time. Use Data Collector to route and process data in your data streams.

What is StreamSets Data Collector Edge?

StreamSets Data Collector EdgeTM (SDC Edge) is a lightweight execution agent without a UI that runs pipelines on edge devices. Use SDC Edge to read data from an edge device or to receive data from another pipeline and then act on that data to control an edge device.

What is StreamSets Control Hub?

StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Use Control Hub to allow your teams to build and execute large numbers of complex dataflows at scale.

Logging In and Creating a Pipeline in Data Collector

After you start Data Collector, you can log in to Data Collector and create your first pipeline.

Data Collector User Interface

Data Collector provides a web-based user interface (UI) to configure pipelines, preview data, monitor pipelines, and review snapshots of data.

Data Collector UI - Pipelines on the Home Page

Data Collector displays a list of all available pipelines and related information on the Home page. You can select a category of pipelines, such as Running Pipelines, to view a subset of all available pipelines.

Tutorials and Sample Pipelines

StreamSets provides multiple tutorials and sample pipelines to help you learn about using Data Collector.

Full Installation and Launch (Manual Start)

Full Installation and Launch (Service Start)

Common Installation

Core Installation

Supported Systems and Versions

Install Additional Stage Libraries

Installation with Cloudera Manager

Run Data Collector from Docker

Installation with Cloud Service Providers

MapR Prerequisites

Creating Another Data Collector Instance

Uninstallation

User Authentication

Roles and Permissions

Enabling HTTPS

Using a Reverse Proxy

Data Collector Configuration

Data Collector Environment Configuration

Install External Libraries

Custom Stage Libraries

Credential Stores

Accessing Hashicorp Vault Secrets with Vault Functions (deprecated)

Working with Data Governance Tools

Enabling External JMX Tools

Upgrade

Pre Upgrade Tasks

Upgrade an Installation from the Tarball

Upgrade an Installation from the RPM Package

Upgrade an Installation with Cloudera Manager

Post Upgrade Tasks

Working with Upgraded External Systems

Troubleshooting an Upgrade

What is a Pipeline?

Data in Motion

Designing the Data Flow

Sample Pipelines

Dropping Unwanted Records

Error Record Handling

Record Header Attributes

Field Attributes

Processing Changed Data

Control Character Removal

Development Stages

Shortcut Keys for Pipeline Design

Test Origin for Preview

Understanding Pipeline States

Deprecated Functionality

Data Collector UI - Edit Mode

Pipeline Types and Icons in Documentation

In Data Collector, you can configure pipelines that are run by Data Collector and pipelines that are run by Data Collector Edge.

Retrying the Pipeline

Rate Limit

Advanced Options

Pipelines and most pipeline stages include advanced options with default values that should work in most cases. By default, each pipeline and stage hides the advanced options. Advanced options can include individual properties or complete tabs.

Simple and Bulk Edit Mode

Runtime Values

Runtime values are values that you define outside of the pipeline and use for stage and pipeline properties. You can change the values for each pipeline run without having to edit the pipeline.

Many stages can use SSL/TLS encryption to securely connect to the external system.

Security in Amazon Stages

Security in Google Cloud Stages

Security in Kafka Stages

Kafka Message Keys

Authentication in Salesforce Stages

Implicit and Explicit Validation

Expression Configuration

Configuring a Pipeline

Data Formats Overview

Avro Data Format

Binary Data Format

Datagram Data Format

Delimited Data Format

Excel Data Format

Log Data Format

When you use an origin to read log data, you define the format of the log files to be read.

NetFlow Data Processing

Protobuf Data Format Prerequisites

SDC Record Data Format

Text Data Format with Custom Delimiters

Whole File Data Format

You can use the whole file data format to transfer entire files from an origin system to a destination system. With the whole file data format, you can transfer any type of file.

Reading and Processing XML Data

Writing XML Data

Origins

An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline.

Amazon S3

Amazon SQS Consumer

Azure Data Lake Storage Gen1 (deprecated)

Azure Data Lake Storage Gen2

Azure IoT/Event Hub Consumer

Google Pub/Sub Subscriber

Groovy Scripting

gRPC Client

Hadoop FS (deprecated)

JDBC Multitable Consumer

JDBC Query Consumer

JMS Consumer

Jython Scripting

Kafka Consumer (deprecated)

Kafka Multitopic Consumer

MapR Multitopic Streams Consumer

MapR Streams Consumer

NiFi HTTP Server (deprecated)

Omniture (deprecated)

OPC UA Client

Oracle Bulkload

Oracle CDC Client

PostgreSQL CDC Client

SAP HANA Query Consumer

SDC RPC (deprecated)

SFTP/FTP/FTPS Client

SQL Server 2019 BDC Multitable Consumer

SQL Server CDC Client

SQL Server Change Tracking

Start Jobs

Start Pipelines (deprecated)

System Metrics

TCP Server

Teradata Consumer (deprecated)

UDP Multithreaded Source

Databricks ML Evaluator (deprecated)

Delay

Encrypt and Decrypt Fields

Spark Evaluator (deprecated)

SQL Parser

Start Jobs

Start Pipelines (deprecated)

Static Lookup

Stream Selector

TensorFlow Evaluator

Value Replacer (deprecated)

Wait for Jobs

Wait for Pipelines (deprecated)

Whole File Transformer

Aerospike (deprecated)

Amazon S3

Azure Data Lake Storage (Legacy) (deprecated)

Azure Data Lake Storage Gen1 (deprecated)

Azure Data Lake Storage Gen2

Azure Event Hub Producer

Azure IoT Hub Producer

Databricks Delta Lake

Elasticsearch

Flume (deprecated)

Google BigQuery

Google BigQuery (Enterprise)

Google Bigtable

Google Cloud Storage

Google Pub/Sub Publisher

GPSS Producer (deprecated)

Hadoop FS

HBase

Hive Metastore

Hive Streaming (deprecated)

KineticaDB (deprecated)

MapR Streams Producer

MemSQL Fast Loader (deprecated)

Send Response to Origin

SFTP/FTP/FTPS Client

Snowflake

Snowflake File Uploader

Solr

Splunk

SQL Server 2019 BDC Bulk Loader

ADLS Gen1 File Metadata (deprecated)

ADLS Gen2 File Metadata

Amazon S3

Databricks Job Launcher

MapR FS File Metadata

Dataflow Triggers Overview

Pipeline Event Generation

Stage Event Generation

Executors

Logical Pairings

Event Records

Viewing Events in Data Preview, Snapshot, and Monitor Mode

Executing Pipeline Events in Data Preview

Summary

Solutions Overview

Converting Data to the Parquet Data Format

This solution describes how to convert Avro files to the columnar format, Parquet.

Automating Impala Metadata Updates for Drift Synchronization for Hive

This solution describes how to configure a Drift Synchronization Solution for Hive pipeline to automatically refresh the Impala metadata cache each time changes occur in the Hive metastore.

Managing Output Files

This solution describes how to design a pipeline that writes output files to a destination, moves the files to a different location, and then changes the permissions for the files.

Stopping a Pipeline After Processing All Available Data

This solution describes how to design a pipeline that stops automatically after it finishes processing all available data.

Offloading Data from Relational Sources to Hadoop

This solution describes how to offload data from relational database tables to Hadoop.

Sending Email During Pipeline Processing

This solution describes how to design a pipeline to send email notifications at different moments during pipeline processing.

Preserving an Audit Trail of Events

This solution describes how to design a pipeline that preserves an audit trail of pipeline and stage events that occur.

Loading Data into Databricks Delta Lake

You can use several solutions to load data into a Delta Lake table on Databricks.

Drift Synchronization Solution for Hive

The Drift Synchronization Solution for Hive detects drift in incoming data and updates corresponding Hive tables.

Drift Synchronization Solution for PostgreSQL

The Drift Synchronization Solution for PostgreSQL detects drift in incoming data and automatically creates or alters corresponding PostgreSQL tables as needed before the data is written.

Meet StreamSets Control Hub

StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Control Hub allows teams to build and execute large numbers of complex dataflows at scale.

Working with Control Hub

Request a Control Hub Organization and User Account

Register Data Collector with Control Hub

You must register a Data Collector to work with StreamSets Control Hub. When you register a Data Collector, Data Collector generates an authentication token that it uses to issue authenticated requests to Control Hub.

Pipeline Statistics

Pipeline Management with Control Hub

After you register a Data Collector with StreamSets Control Hub, you can manage how the pipelines work with Control Hub.

Control Hub Configuration File

Unregister Data Collector from Control Hub

You can unregister a Data Collector from StreamSets Control Hub when you no longer want to use that Data Collector installation with Control Hub.

Meet StreamSets Data Collector Edge

StreamSets Data Collector EdgeTM (SDC Edge) is a lightweight execution agent without a UI that runs pipelines on edge devices with limited resources. Use SDC Edge to read data from an edge device or to receive data from another pipeline and then act on that data to control an edge device.

Supported Platforms

Install SDC Edge

Download and install SDC Edge on each edge device where you want to run edge pipelines.

Getting Started with SDC Edge

Data Collector Edge (SDC Edge) includes several sample pipelines that make it easy to get started. You simply import one of the sample edge pipelines, create the appropriate Data Collector receiving pipeline, download and install SDC Edge on the edge device, and then run the sample edge pipeline.

Design Edge Pipelines

Edge pipelines run in edge execution mode. You design edge pipelines in Data Collector.

Design Data Collector Receiving Pipelines

Administer SDC Edge

Administering SDC Edge involves configuring, starting, shutting down, and viewing logs for the agent. When using StreamSets Control Hub, you can also use the SDC Edge command line interface to register SDC Edge with Control Hub.

Deploy Pipelines to SDC Edge

After designing edge pipelines in Data Collector, you deploy the edge pipelines to SDC Edge installed on an edge device. You run the edge pipelines on SDC Edge.

Downloading Pipelines from SDC Edge

Manage Pipelines on SDC Edge

After designing edge pipelines in Data Collector and then deploying the edge pipelines to SDC Edge, you can manage the pipelines on SDC Edge. Managing edge pipelines includes previewing, validating, starting, stopping, and monitoring the pipelines as well as resetting the origin for the pipelines.

Multithreaded Pipeline Overview

How It Works

Monitoring

Tuning Threads and Runners

Resource Usage

Multithreaded Pipeline Summary

Microservice Pipelines

A microservice pipeline is a pipeline that creates a fine-grained service to perform a specific task.

Stages for Microservice Pipelines

Sample Pipeline

Creating a Microservice Pipeline

Orchestration Pipeline Overview

Orchestration Stages

Orchestration Record

Sample Pipeline

SDC RPC Pipeline Overview (deprecated)

Deployment Architecture

When using SDC RPC pipelines, consider your needs and environment carefully as you design the deployment architecture.

Configuring the Delivery Guarantee

The delivery guarantee determines when a pipeline commits the offset. When configuring the delivery guarantee for SDC RPC pipelines, use the same option in origin and destination pipelines.

Defining the RPC ID

The RPC ID is a user-defined identifier that allows an SDC RPC origin and SDC RPC destination to recognize each other.

Enabling Encryption

You can enable SDC RPC pipelines to transfer data securely using SSL/TLS. To use SSL/TLS, enable TLS in both the SDC RPC destination and the SDC RPC origin.

Configuration Guidelines for SDC RPC Pipelines

Cluster Pipelines (deprecated)

Kafka Cluster Requirements

MapR Requirements

HDFS Requirements

Amazon S3 Requirements

Cluster EMR batch and cluster batch mode pipelines can process data from Amazon S3.

Cluster Pipeline Limitations

Data Preview Overview

Data Collector UI - Preview Mode

You can use Data Collector to view how data passes through the pipeline.

Preview Codes

Data preview displays different colors for different types of data. Preview also uses other codes and formatting to highlight changed fields.

Previewing a Single Stage

Previewing Multiple Stages

You can preview data for a group of linked stages within a pipeline.

Editing Preview Data

You can edit preview data to view how a stage or group of stages processes the changed data. Edit preview data to test for data conditions that might not appear in the preview data set.

Editing Properties

In data preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the expression in an Expression Evaluator to see how the expression alters data.

Rules and Alerts Overview

Metric Rules and Alerts

Data Rules and Alerts

Data Drift Rules and Alerts

Alert Webhooks

Configuring Email for Alerts

Pipeline Monitoring Overview

Data Collector UI - Monitor Mode

Viewing Pipeline and Stage Statistics

Monitoring Errors

Snapshots

Viewing the Run History

Understanding Pipeline States

Adding Labels to Pipelines

Exporting Pipelines

Exporting Pipelines for Control Hub

Duplicating a Pipeline

Duplicate a pipeline when you want to keep the existing version of a pipeline while continuing to configure a duplicate version. A duplicate is an exact copy of the original pipeline.

Deleting Pipelines

Providing an Activation Code

Viewing Data Collector Configuration Properties

Viewing Data Collector Directories

You can view the directories that the Data Collector uses. You might check the directories being used to access a file in the directory or to increase the amount of available space for a directory.

Viewing Data Collector Metrics

You can view metrics about Data Collector, such as the CPU usage or the number of pipeline runners in the thread pool.

Viewing Data Collector Logs

Shutting Down Data Collector

Restarting Data Collector

Viewing Users and Groups

If you use file-based authentication, you can view all user accounts granted access to this Data Collector instance, including the roles and groups assigned to each user.

Managing Usage Statistics Collection

Support Bundles

Health Inspector

REST Response

You can view REST response JSON data for different aspects of the Data Collector, such as pipeline configuration information or monitoring details.

Command Line Interface

Data Collector provides a command line interface that includes a basic cli command. Use the command to perform some of the same actions that you can complete from the Data Collector UI. Data Collector must be running before you can use the cli command.

Tutorial Overview

Before You Begin

Basic Tutorial

The basic tutorial creates a pipeline that reads a file from a directory, processes the data in two branches, and writes all data to a file system. You'll use data preview to help configure the pipeline, and you'll create a data alert and run the pipeline.

Extended Tutorial

The extended tutorial builds on the basic tutorial, using an additional set of stages to perform some data transformations and write to the Trash development destination. We'll also use data preview to test stage configuration.

Accessing Error Messages

Cluster Execution Mode

Glossary of Terms

Data Format Support

Origins