What is StreamSets Data Collector?
StreamSets Data CollectorTM is a lightweight, powerful design and execution engine that streams data in real time. Use Data Collector to route and process data in your data streams.
What is StreamSets Data Collector Edge?
StreamSets Data Collector EdgeTM (SDC Edge) is a lightweight execution agent without a UI that runs pipelines on edge devices. Use SDC Edge to read data from an edge device or to receive data from another pipeline and then act on that data to control
an edge device.
What is StreamSets Control Hub?
StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Use Control Hub to allow your teams to build and execute large numbers of complex dataflows at scale.
Logging In and Creating a Pipeline in Data Collector
After you start Data Collector, you can log in to Data Collector and create your first pipeline.
Data Collector User Interface
Data Collector provides a web-based user interface (UI) to configure pipelines, preview data, monitor pipelines, and review snapshots
of data.
Data Collector UI - Pipelines on the Home Page
Data Collector displays a list of all available pipelines and related information on the Home page. You can select a category
of pipelines, such as Running Pipelines, to view a subset of all available pipelines.
Tutorials and Sample Pipelines
StreamSets provides multiple tutorials and sample pipelines to help you learn about using Data Collector.
Resetting the Origin
You can reset the origin when you want the Data Collector to process all available data instead of processing data from the last-saved offset. Reset the origin when
the pipeline is not running.
Microservice Pipelines
A microservice pipeline is a pipeline that creates a fine-grained service to perform a specific task.
Deployment Architecture
When using SDC RPC pipelines, consider your needs and environment carefully as you design the deployment architecture.
Configuring the Delivery Guarantee
The delivery guarantee determines when a pipeline commits the offset. When configuring the delivery guarantee for SDC RPC
pipelines, use the same option in origin and destination pipelines.
Defining the RPC ID
The RPC ID is a user-defined identifier that allows an SDC RPC origin and SDC RPC destination to recognize each other.
Enabling Encryption
You can enable SDC RPC pipelines to transfer data securely using SSL/TLS. To use SSL/TLS, enable TLS in both the SDC RPC
destination and the SDC RPC origin.
Data Collector UI - Preview Mode
You can use Data Collector to view how data passes through the pipeline.
Preview Codes
Data preview displays different colors for different types of data. Preview also uses other codes and formatting to highlight
changed fields.
Previewing Multiple Stages
You can preview data for a group of linked stages within a pipeline.
Editing Preview Data
You can edit preview data to view how a stage or group of stages processes the changed data. Edit preview data to
test for data conditions that might not appear in the preview data set.
Editing Properties
In data preview, you can edit stage properties to see how the changes affect preview data. For example, you might
edit the expression in an Expression Evaluator to see how the expression alters data.
Duplicating a Pipeline
Duplicate a pipeline when you want to keep the existing version of a pipeline while continuing to configure a duplicate
version. A duplicate is an exact copy of the original pipeline.
Viewing Data Collector Directories
You can view the directories that the Data Collector uses. You might check the directories being used to access a file in the directory or to increase the amount of available
space for a directory.
Viewing Data Collector Metrics
You can view metrics about Data Collector, such as the CPU usage or the number of pipeline runners in the thread pool.
Viewing Users and Groups
If you use file-based authentication, you can view all user accounts granted access to this Data Collector instance, including the roles and groups assigned to each user.
REST Response
You can view REST response JSON data for different aspects of the Data Collector, such as pipeline configuration information or monitoring details.
Command Line Interface
Data Collector provides a command line interface that includes a basic cli command. Use the command to perform some of the same actions that you can complete from the Data Collector UI. Data Collector must be running before you can use the cli command.