What is StreamSets Data Collector?
         
            StreamSets         Data CollectorTM is a lightweight, powerful design and execution engine that streams data         in real time. Use Data Collector to         route and process data in your data streams. 
            
         
 What is StreamSets Data Collector Edge?
            StreamSets         Data Collector EdgeTM (SDC Edge) is a         lightweight execution agent without a UI that runs pipelines on edge devices. Use SDC Edge to         read data from an edge device or to receive data from another pipeline and then act on that         data to control
               an edge device.
            
         
 What is StreamSets Control Hub?
            StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Use             Control Hub to         allow your teams to build and execute large numbers of complex dataflows at scale. 
            
         
 Logging In and Creating a Pipeline in Data Collector
            After you start Data Collector, you         can log in to Data Collector and         create your first pipeline.
            
         
 Data Collector User Interface
            Data Collector     provides a web-based user interface (UI) to configure pipelines, preview data, monitor     pipelines, and review snapshots
               of data. 
            
         
 Data Collector UI - Pipelines on the Home Page
            Data Collector         displays a list of all available pipelines and related information on the Home page. You can         select a category
               of pipelines, such as Running Pipelines, to view a subset of all available         pipelines. 
            
         
 Tutorials and Sample Pipelines
            StreamSets         provides multiple tutorials and sample pipelines to help you learn about using Data Collector. 
            
         
 Pipeline Types and Icons in Documentation
            In Data Collector, you can configure pipelines that are run by Data Collector and         pipelines that are run by Data Collector Edge.
            
         
 Advanced Options
            Pipelines and most pipeline stages include advanced options with default values that         should work in most cases. By
               default, each pipeline and stage hides the advanced options.         Advanced options can include individual properties or
               complete tabs.
            
         
 Runtime Values
            Runtime values are values that you define outside of the pipeline and use for stage         and pipeline properties. You can
               change the values for each pipeline run without having to         edit the pipeline. 
            
         
 SSL/TLS Encryption
            Many stages can use SSL/TLS encryption to securely connect to the external system. 
         
 Log Data Format
            When you use an origin to read log data, you define the format of the log files to be   read. 
         
 Whole File Data Format
            You can use the whole file data format to transfer entire files from an origin system         to a destination system. With
               the whole file data format, you can transfer any type of file. 
            
         
 Converting Data to the Parquet Data Format
            This solution describes how to convert Avro files to the columnar format,         Parquet.
         
 Automating Impala Metadata Updates for Drift Synchronization for Hive
            This solution describes how to configure a Drift Synchronization Solution for Hive pipeline to automatically refresh the Impala metadata cache each time changes occur in         the Hive metastore.
            
         
 Managing Output Files
            This solution describes how to design a pipeline that writes output files to a         destination, moves the files to a different
               location, and then changes the permissions for         the files.
            
         
 Stopping a Pipeline After Processing All Available Data
            This solution describes how to design a pipeline that stops automatically after it         finishes processing all available
               data.
            
         
 Offloading Data from Relational Sources to Hadoop
            This solution describes how to offload data from relational database tables to         Hadoop.
         
 Sending Email During Pipeline Processing
            This solution describes how to design a pipeline to send email notifications at         different moments during pipeline
               processing.
            
         
 Preserving an Audit Trail of Events
            This solution describes how to design a pipeline that preserves an audit trail of         pipeline and stage events that occur.
         
 Loading Data into Databricks Delta Lake
            You can use several solutions to load data into a Delta Lake table on Databricks. 
         
 Drift Synchronization Solution for Hive
            The Drift Synchronization Solution for Hive detects drift in incoming data and updates corresponding Hive tables. 
            
         
 Drift Synchronization Solution for PostgreSQL
            The Drift Synchronization Solution for PostgreSQL detects drift in incoming data and automatically creates or alters corresponding         PostgreSQL tables as needed before
               the data is written.
            
         
 Meet StreamSets Control Hub
            StreamSets Control HubTM is a central point of control for all of your dataflow pipelines. Control Hub allows         teams to build and execute large numbers of complex dataflows at scale. 
            
         
 Register Data Collector with Control Hub
            You must register a Data Collector to         work with StreamSets Control Hub.         When you register a Data Collector, Data Collector         generates an authentication token that it uses to issue authenticated requests to Control Hub.
            
         
 Pipeline Management with Control Hub
            After you register a Data Collector with             StreamSets Control Hub,         you can manage how the pipelines work with Control Hub.
            
         
 Unregister Data Collector from Control Hub
            You can unregister a Data Collector from             StreamSets Control Hub         when you no longer want to use that Data Collector         installation with Control Hub. 
            
         
 Meet StreamSets Data Collector Edge
            StreamSets         Data Collector EdgeTM (SDC Edge) is a         lightweight execution agent without a UI that runs pipelines on edge devices with limited         resources.
               Use SDC Edge to         read data from an edge device or to receive data from another pipeline and then act on that         data to control
               an edge device.
            
         
 Install SDC Edge
            Download and install SDC Edge on         each edge device where you want to run edge pipelines. 
            
         
 Getting Started with SDC Edge
            Data Collector Edge (SDC Edge)         includes several sample pipelines that make it easy to get started. You simply import one of         the sample
               edge pipelines, create the appropriate Data Collector         receiving pipeline, download and install SDC Edge on the         edge device, and then run the sample edge pipeline.
            
         
 Design Edge Pipelines
            Edge pipelines run in edge execution mode. You design edge pipelines in Data Collector.
            
         
 Administer SDC Edge
            Administering SDC Edge         involves configuring, starting, shutting down, and viewing logs for the agent. When using             StreamSets Control Hub, you can also use the SDC Edge         command line interface to register SDC Edge with             Control Hub.
            
         
 Deploy Pipelines to SDC Edge
            After designing edge pipelines in Data Collector, you         deploy the edge pipelines to SDC Edge         installed on an edge device. You run the edge pipelines on SDC Edge.
            
         
 Manage Pipelines on SDC Edge
            After designing edge pipelines in Data Collector and         then deploying the edge pipelines to SDC Edge, you         can manage the pipelines on SDC Edge.         Managing edge pipelines includes previewing, validating, starting, stopping, and monitoring         the pipelines
               as well as resetting the origin for the pipelines.
            
         
 Microservice Pipelines
            A microservice pipeline is a pipeline that creates a fine-grained         service to perform a specific task. 
            
         
 Deployment Architecture
            When using SDC RPC pipelines, consider your needs and environment carefully as you     design the deployment architecture.
               
            
         
 Configuring the Delivery Guarantee
            The delivery guarantee determines when a pipeline commits the offset. When configuring   the delivery guarantee for SDC RPC
               pipelines, use the same option in origin and destination   pipelines.
            
         
 Defining the RPC ID
            The RPC ID is a user-defined identifier that allows an SDC RPC origin and SDC RPC         destination to recognize each other.
               
            
         
 Enabling Encryption
            You can enable SDC RPC pipelines to transfer data securely using SSL/TLS. To use     SSL/TLS, enable TLS in both the SDC RPC
               destination and the SDC RPC origin.
            
         
 Amazon S3 Requirements
            Cluster EMR batch and cluster batch mode pipelines can process data from Amazon         S3.
         
 Data Collector UI - Preview Mode
            You can use Data Collector to view how     data passes through the pipeline.
            
         
 Preview Codes
            Data preview displays different colors for different types of data. Preview also uses   other codes and formatting to highlight
               changed fields.
            
         
 Previewing Multiple Stages
            You can preview data for a group of linked stages within a pipeline.
         
 Editing Preview Data
            You can edit preview data to view how a stage or group of stages processes the         changed data. Edit preview data to
               test for data conditions that might not appear in the         preview data set.
            
         
 Editing Properties
            In data preview, you can edit stage properties to see how the changes affect preview         data. For example, you might
               edit the expression in an Expression Evaluator to see how the         expression alters data. 
            
         
 Duplicating a Pipeline
            Duplicate a pipeline when you want to keep the existing version of a pipeline while         continuing to configure a duplicate
               version. A duplicate is an exact copy of the original         pipeline. 
            
         
 Viewing Data Collector Directories
            You can view the directories that the Data Collector uses. You might   check the directories being used to access a file in the directory or to increase the amount of   available
               space for a directory. 
            
         
 Viewing Data Collector Metrics
            You can view metrics about Data Collector,  such         as the CPU usage or the number of pipeline runners in the thread pool.
            
         
 Viewing Users and Groups
            If you use file-based authentication, you can view all user accounts granted access         to this Data Collector instance, including the roles and groups assigned to each user.
            
         
 REST Response
            You can view REST response JSON data for different aspects of the Data Collector, such as     pipeline configuration information or monitoring details. 
            
         
 Command Line Interface
            Data Collector     provides a command line interface that includes a basic cli command. Use the     command to perform some of the same actions that you can complete from the Data Collector UI. Data Collector must be     running before you can use the cli command.
            
         
 Basic Tutorial
            The basic tutorial creates a pipeline that reads a file from a directory, processes the   data in two branches, and writes
               all data to a file system. You'll use data preview to help   configure the pipeline, and you'll create a data alert and run
               the pipeline. 
            
         
 Extended Tutorial
            The extended tutorial builds on the basic tutorial, using an additional set of stages to         perform some data transformations
               and write to the Trash development destination. We'll also         use data preview to test stage configuration.