What is StreamSets Transformer?
StreamSets TransformerTM is a design and execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing
framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform transformations that require heavy processing
on the entire data set in batch or streaming mode.
Pipeline Processing on Spark
Transformer functions as a Spark client that launches distributed Spark applications.
Batch Case Study
Transformer can run pipelines in batch mode. A batch pipeline processes all available data in a single batch, and then stops.
Streaming Case Study
Transformer can run pipelines in streaming mode. A streaming pipeline maintains connections to origin systems and processes data
as it becomes available. The pipeline runs continuously until you manually stop it.
Installing Transformer
Transformer can work with Apache Spark that runs locally on a single machine or that runs on a cluster. The steps that you use
to install Transformer depend on your Spark installation.
What is a Transformer Pipeline?
A Transformer pipeline describes the flow of data from origin systems to destination systems and defines how to transform the data
along the way.
Transformer for Data Collector Users
For users already familiar with StreamSets Data Collector pipelines, here's how Transformer pipelines are similar... and different.
Execution Mode
Transformer pipelines can run in batch or streaming mode.
Local Pipelines
Local pipelines run on the local Spark installation on the Transformer machine.
Cluster Pipelines on Hadoop YARN
Cluster pipelines run on a cluster, where Spark distributes the processing across nodes in the cluster.
Cluster Pipelines on Databricks
Cluster pipelines run on a cluster, where Spark distributes the processing across nodes in the cluster.
Extra Spark Configuration
When you create a pipeline, you can define extra Spark configuration properties that determine how the pipeline runs
on Spark. Transformer passes the configuration properties to Spark when it launches the Spark application.
Partitioning
When you start a pipeline, StreamSets Transformer launches a Spark application. Spark runs the application just as it runs any other application, splitting the pipeline
data into partitions and performing operations on the partitions in parallel.
Caching Data
You can configure most origins and processors to cache data. You might enable caching when a stage passes data to
more than one downstream stage.
Performing Lookups
To look up data in a Transformer pipeline, use an additional origin in the pipeline to read the lookup data, then use a Join processor to join the lookup data with the primary pipeline data.
Ludicrous Performance Optimization
You can configure Transformer to run a pipeline in ludicrous mode.
Technology Preview Functionality
Transformer includes certain new features and stages with the Technology Preview designation. Technology Preview functionality
is available for use in development and testing, but is not meant for use in production.
Configuring a Pipeline
Configure a pipeline to define the flow of data. After you configure the pipeline, you can start the pipeline.
Origins Overview
An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline. Or, you can
use multiple origin stages in a pipeline and then join the origins using the Join processor.
Custom Schemas
When reading delimited or JSON data, you can configure an origin to use a custom schema to process the data. By default,
origins infer the schema from the data.
Preview Overview
You can preview data to help build or fine-tune a pipeline. You can preview complete or incomplete pipelines.
Preview Codes
In Preview mode, Transformer displays different colors for different types of data. Transformer uses other codes and formatting to highlight changed fields.
Processor Output Order
When previewing data for a processor, you can preview both the input and the output data. You can display the output
records in the order that matches the input records or in the order produced by the processor.
Editing Properties
When running preview, you can edit stage properties to see how the changes affect preview data. For example, you might
edit the condition in a Stream Selector processor to see how the condition alters which records pass to the different
output streams.
Pipeline Monitoring Overview
When Transformer runs a pipeline, you can view real-time statistics about the pipeline.
Pipeline and Stage Statistics
When you monitor a pipeline, you can view real-time summary statistics for the pipeline and for stages in the pipeline.
Spark Web UI
As you monitor a pipeline, you can also access the Spark web UI for the application launched for the pipeline. Use
the Spark web UI to monitor the Spark jobs executed for the launched application, just as you monitor any other Spark
application.
Pipeline Run History
You can view the run history of a pipeline when you configure or monitor a pipeline. View the run history from either
the Summary or History tab.