What is StreamSets Data Collector?
StreamSets Data CollectorTM is a lightweight, powerful design and execution engine that streams data in real time. Use Data Collector to route and process data in your data streams.
What is StreamSets Data Collector Edge?
StreamSets Data Collector EdgeTM (SDC Edge) is a lightweight execution agent without a UI that runs pipelines on edge devices. Use SDC Edge to read data from an edge device or to receive data from another pipeline and then act on that data to control an edge device.
What is StreamSets Control Hub?
StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Use Control Hub to allow your teams to build and execute large numbers of complex dataflows at scale.
Logging In and Creating a Pipeline in Data Collector
After you start Data Collector, you can log in to Data Collector and create your first pipeline.
Data Collector User Interface
Data Collector provides a web-based user interface (UI) to configure pipelines, preview data, monitor pipelines, and review snapshots of data.
Data Collector UI - Pipelines on the Home Page
Data Collector displays a list of all available pipelines and related information on the Home page. You can select a category of pipelines, such as Running Pipelines, to view a subset of all available pipelines.
Tutorials and Sample Pipelines
StreamSets provides multiple tutorials and sample pipelines to help you learn about using Data Collector.
Microservice Pipelines
A microservice pipeline is a pipeline that creates a fine-grained service to perform a specific task.
Stages for Microservice Pipelines
Sample Pipeline
Creating a Microservice Pipeline
SDC RPC Pipeline Overview (deprecated)
Deployment Architecture
When using SDC RPC pipelines, consider your needs and environment carefully as you design the deployment architecture.
Configuring the Delivery Guarantee
The delivery guarantee determines when a pipeline commits the offset. When configuring the delivery guarantee for SDC RPC pipelines, use the same option in origin and destination pipelines.
Defining the RPC ID
The RPC ID is a user-defined identifier that allows an SDC RPC origin and SDC RPC destination to recognize each other.
Enabling Encryption
You can enable SDC RPC pipelines to transfer data securely using SSL/TLS. To use SSL/TLS, enable TLS in both the SDC RPC destination and the SDC RPC origin.
Configuration Guidelines for SDC RPC Pipelines
Data Preview Overview
Data Collector UI - Preview Mode
You can use Data Collector to view how data passes through the pipeline.
Preview Codes
Data preview displays different colors for different types of data. Preview also uses other codes and formatting to highlight changed fields.
Previewing a Single Stage
Previewing Multiple Stages
You can preview data for a group of linked stages within a pipeline.
Editing Preview Data
You can edit preview data to view how a stage or group of stages processes the changed data. Edit preview data to test for data conditions that might not appear in the preview data set.
Editing Properties
In data preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the expression in an Expression Evaluator to see how the expression alters data.
Understanding Pipeline States
Starting Pipelines
Stopping Pipelines
Importing Pipelines
Sharing Pipelines
Adding Labels to Pipelines
Exporting Pipelines
Exporting Pipelines for Control Hub
Duplicating a Pipeline
Duplicate a pipeline when you want to keep the existing version of a pipeline while continuing to configure a duplicate version. A duplicate is an exact copy of the original pipeline.
Deleting Pipelines
Providing an Activation Code
Viewing Data Collector Configuration Properties
Viewing Data Collector Directories
You can view the directories that the Data Collector uses. You might check the directories being used to access a file in the directory or to increase the amount of available space for a directory.
Viewing Data Collector Metrics
You can view metrics about Data Collector, such as the CPU usage or the number of pipeline runners in the thread pool.
Viewing Data Collector Logs
Shutting Down Data Collector
Restarting Data Collector
Viewing Users and Groups
If you use file-based authentication, you can view all user accounts granted access to this Data Collector instance, including the roles and groups assigned to each user.
Managing Usage Statistics Collection
Support Bundles
Health Inspector
REST Response
You can view REST response JSON data for different aspects of the Data Collector, such as pipeline configuration information or monitoring details.
Command Line Interface
Data Collector provides a command line interface that includes a basic cli command. Use the command to perform some of the same actions that you can complete from the Data Collector UI. Data Collector must be running before you can use the cli command.
© 2022 StreamSets, Inc.