Pipeline Maintenance
Understanding Pipeline States
A pipeline state is the current condition of the pipeline, such as "running" or "stopped". The pipeline state can display in the All Pipelines list. The state of a pipeline can also appear in the Data Collector log.
- EDITED - The pipeline has been created or modified, and has not run since the last modification.
- FINISHED - The pipeline has completed all expected processing and has stopped running.
- RUN_ERROR - The pipeline encountered an error while running and stopped.
- RUNNING - The pipeline is running.
- STOPPED - The pipeline was manually stopped.
- START_ERROR - The pipeline encountered an error while starting and failed to start.
- STOP_ERROR - The pipeline encountered an error while stopping.
- CONNECT_ERROR - When running a cluster-mode pipeline, Data Collector cannot connect to the underlying cluster manager, such as Mesos or YARN.
- CONNECTING - The pipeline is preparing to restart after a Data Collector restart.
- DISCONNECTED - The pipeline is disconnected from external systems, typically because Data Collector is restarting or shutting down.
- FINISHING - The pipeline is in the process of finishing all expected processing.
- RETRY - The pipeline is trying to run after encountering an error while running. This occurs only when the pipeline is configured for a retry upon error.
- RUNNING_ERROR - The pipeline encounters errors while running.
- STARTING - The pipeline is initializing, but hasn't started yet.
- STARTING_ERROR - The pipeline encounters errors while starting.
- STOPPING - The pipeline is in the process of stopping after a manual request to stop.
- STOPPING_ERROR - The pipeline encounters errors while stopping.
State Transition Examples
- Starting a pipeline
- When you successfully start a pipeline for the first time, a pipeline
transitions through the following
states:
(EDITED)... STARTING... RUNNING
- Stopping or restarting Data Collector
-
When Data Collector shuts down, running pipelines transition through the following states:
(RUNNING)... DISCONNECTING... DISCONNECTED
- Retrying a pipeline
- When a pipeline is configured to retry upon error, Data Collector performs the specified number of retries when the pipeline encounters errors while running.
- Stopping a pipeline
- When you successfully stop a pipeline, a pipeline transitions through the
following
states:
(RUNNING)... STOPPING... STOPPED
Starting Pipelines
You can start Data Collector pipelines when they are valid. When you start a pipeline, Data Collector runs the pipeline until you stop the pipeline or shut down Data Collector.
For most origins, when you restart a pipeline, Data Collector starts the pipeline from where it last stopped by default. You can reset the origin to read all available data.
A Kafka Consumer origin starts processing data based on the offset passed from the Kafka ZooKeeper.
You can start pipelines from the following locations:
- From the Home page, select pipelines in the list and then click the Start icon.
- From the pipeline canvas, click the Start icon.
If the Start icon is not enabled, the pipeline is not valid.
Starting Pipelines with Parameters
If you defined runtime parameters for a pipeline, you can specify the parameter values to use when you start the pipeline.
For more information, see Using Runtime Parameters.
Resetting the Origin
You can reset the origin when you want the Data Collector to process all available data instead of processing data from the last-saved offset. Reset the origin when the pipeline is not running.
- Amazon S3
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2
- Directory
- Elasticsearch
- File Tail
- Google Cloud Storage
- Groovy Scripting
- Hadoop FS Standalone
- HTTP Client
- JavaScript Scripting
- JDBC Multitable Consumer
- JDBC Query Consumer
- Jython Scripting
- Kinesis Consumer
- MapR DB JSON
- MapR FS Standalone
- MongoDB
- MongoDB Oplog
- MySQL Binary Log
- Salesforce
- SAP HANA Query Consumer
- SFTP/FTP/FTPS Client
- SQL Server 2019 BDC Multitable Consumer
- SQL Server CDC Client
- SQL Server Change Tracking
- Teradata Consumer
- Windows Event Log
For these origins, when you stop the pipeline, the Data Collector notes where it stopped processing data. When you restart the pipeline, it continues from where it left off by default. When you want the Data Collector to process all available data instead of continuing from where it stopped, reset the origin. For unique details about resetting the Kinesis Consumer origin, see Resetting the Kinesis Consumer Origin.
You can configure the Kafka and MapR Streams Consumer origins to process all available data by specifying an additional Kafka configuration property. You can reset the Azure IoT/Event Hub Consumer origin by deleting offset details in the Microsoft Azure portal. The remaining origin stages process transient data where resetting the origin has no effect.
You can reset the origin for multiple pipelines at the same time from the Home page. Or, you can reset the origin for a single pipeline from the pipeline canvas.
To reset the origin:
- Select multiple pipelines from the Home page, or view a single pipeline in the pipeline canvas.
- Click the More icon, and then click Reset Origin.
- In the Reset Origin Confirmation dialog box, click Yes to reset the origin.
Stopping Pipelines
Stop pipelines when you want Data Collector to stop processing data for the pipelines.
When stopping a pipeline, Data Collector waits for the pipeline to gracefully complete all tasks for the in-progress batch. In some situations, this can take several minutes.
For example, if a scripting processor includes code with a timed wait, Data Collector waits for the scripting processor to complete its task. Then, Data Collector waits for the rest of the pipeline to complete all tasks before stopping the pipeline.
When Data Collector runs a pipeline, it displays in the Data Collector UI in Monitor mode by default.
Forcing a Pipeline to Stop
When necessary, you can force Data Collector to stop a pipeline.
When forcing a pipeline to stop, Data Collector often stops processes before they complete, which can lead to unexpected results.
Importing Pipelines
Import pipelines to use pipelines developed on a different Data Collector or to restore backup files.
You can import pipelines that were developed on the same version of Data Collector or on an earlier version of Data Collector. Data Collector does not support importing a pipeline developed on a later version of Data Collector.
You can import pipelines from individual pipeline files, from a ZIP file containing multiple pipeline files, or from an external HTTP URL. Pipeline files are JSON files exported from a Data Collector.
Importing a Pipeline
- To import a single pipeline, from the Home page, click .
- In the Import Pipeline dialog box, enter a pipeline title and optional description.
- Browse and select the pipeline file, and then click Open.
- Click Import.
Importing a Set of Pipelines from an Archive File
You can import a set of pipelines from a ZIP file that contains multiple pipeline JSON files. When you import a set of pipelines, Data Collector imports the existing pipeline names. If necessary, you can rename the pipelines after the import.
- To import a set of pipelines, from the Home page, click .
- In the Import Pipelines from Archive dialog box, browse and select the ZIP file that contains the pipeline files, and then click Open.
- To import all pipelines in the file, click Import.
Importing a Pipeline from an HTTP URL
When you import a pipeline from an HTTP URL, you can rename the pipeline during the import.
Sharing Pipelines
When you create a pipeline, you become the owner of the pipeline. As the owner of a pipeline, you have all permissions for the pipeline, you can configure pipeline sharing, and you can change the owner of the pipeline. A pipeline can have a single user as the owner.
Like the pipeline owner, a user with the Admin role also has all permissions for all pipelines, can configure pipeline sharing and can change the pipeline owner.
By default, all other users have no access to pipelines. To allow other users to work with a pipeline, you must share the pipeline with the users or their groups, and configure pipeline permissions.
Permission | Description |
---|---|
Read | View and monitor the pipeline, and see alerts. View existing snapshot data. |
Write | Edit the pipeline and alerts. |
Execute | Start and stop the pipeline. Preview data and take a snapshot. |
When someone shares a pipeline with you, it displays in the Pipeline library under the Shared With Me label in the pipeline library.
For more information about roles and permissions, see Roles and Permissions.
Sharing a Pipeline
You can share a pipeline if you are the owner of the pipeline or a user with the Admin role.
You can configure pipeline sharing at any time, but pipeline permissions are only enforced when Data Collector is enabled to use pipeline access controls. The sharing configuration goes into effect when sharing is enabled.
Changing the Pipeline Owner
The pipeline owner has all permissions for the pipeline and can configure sharing for other users and groups. There can only be one pipeline owner.
Adding Labels to Pipelines
You can add labels to pipelines to group similar pipelines. For example, you might want to group pipelines by database schema or by the test or production environment.
<label1>/<label2>/<label3>
For example, to
group pipelines in the test environment by the origin system, you might add the
labels Test/HDFS
and Test/Elasticsearch
to the
appropriate pipelines.You can add labels to pipelines from the following locations:
- From the Home page, select pipelines in the list, click the
More icon, and then click Add
Labels. Enter labels and then click
Save.Note: Existing labels that have already been added to the pipeline are ignored.
- From the pipeline canvas, click the General tab and then enter labels for the Labels property.
Exporting Pipelines
Export pipelines to create backups or to use the pipelines with another Data Collector. You can export pipelines with or without plain text credentials configured in the pipeline. You can export a single pipeline or a set of pipelines.
Exporting Pipelines for Control Hub
If you develop pipelines in a Data Collector that is not registered with Control Hub, export valid pipelines for use in Control Hub.
If you develop pipelines in a Data Collector that is registered with Control Hub, publish the pipelines directly to Control Hub.
You can export a single pipeline or a set of pipelines. When you export pipelines for Control Hub, Data Collector exports the pipelines without plain text credentials.
Duplicating a Pipeline
Duplicate a pipeline when you want to keep the existing version of a pipeline while continuing to configure a duplicate version. A duplicate is an exact copy of the original pipeline.
When you duplicate a pipeline, you can rename the pipeline and specify the number of copies to make.