Transformer User Guide

What is StreamSets Transformer?

StreamSets TransformerTM is a design and execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform transformations that require heavy processing on the entire data set in batch or streaming mode.

Pipeline Processing on Spark

Transformer functions as a Spark client that launches distributed Spark applications.

Batch Case Study

Transformer can run pipelines in batch mode. A batch pipeline processes all available data in a single batch, and then stops.

Streaming Case Study

Transformer can run pipelines in streaming mode. A streaming pipeline maintains connections to origin systems and processes data as it becomes available. The pipeline runs continuously until you manually stop it.

Installation Overview

Installation Requirements

Installing Transformer

Transformer can work with Apache Spark that runs locally on a single machine or that runs on a cluster. The steps that you use to install Transformer depend on your Spark installation.

Stage-Related Prerequisites

Starting and Logging in to Transformer

What is a Transformer Pipeline?

A Transformer pipeline describes the flow of data from origin systems to destination systems and defines how to transform the data along the way.

Transformer for Data Collector Users

For users already familiar with StreamSets Data Collector pipelines, here's how Transformer pipelines are similar... and different.

Execution Mode

Transformer pipelines can run in batch or streaming mode.

Stage Library Match Requirement

Local Pipelines

Local pipelines run on the local Spark installation on the Transformer machine.

Cluster Pipelines on Hadoop YARN

Cluster pipelines run on a cluster, where Spark distributes the processing across nodes in the cluster.

Cluster Pipelines on Databricks

Cluster pipelines run on a cluster, where Spark distributes the processing across nodes in the cluster.

Extra Spark Configuration

When you create a pipeline, you can define extra Spark configuration properties that determine how the pipeline runs on Spark. Transformer passes the configuration properties to Spark when it launches the Spark application.

Partitioning

When you start a pipeline, StreamSets Transformer launches a Spark application. Spark runs the application just as it runs any other application, splitting the pipeline data into partitions and performing operations on the partitions in parallel.

Caching Data

You can configure most origins and processors to cache data. You might enable caching when a stage passes data to more than one downstream stage.

Performing Lookups

To look up data in a Transformer pipeline, use an additional origin in the pipeline to read the lookup data, then use a Join processor to join the lookup data with the primary pipeline data.

Ludicrous Performance Optimization

You can configure Transformer to run a pipeline in ludicrous mode.

Data Types

Technology Preview Functionality

Transformer includes certain new features and stages with the Technology Preview designation. Technology Preview functionality is available for use in development and testing, but is not meant for use in production.

Configuring a Pipeline

Configure a pipeline to define the flow of data. After you configure the pipeline, you can start the pipeline.

Origins Overview

An origin stage represents the source for the pipeline. You can use a single origin stage in a pipeline. Or, you can use multiple origin stages in a pipeline and then join the origins using the Join processor.

Custom Schemas

When reading delimited or JSON data, you can configure an origin to use a custom schema to process the data. By default, origins infer the schema from the data.

Destinations Overview

You can preview data to help build or fine-tune a pipeline. You can preview complete or incomplete pipelines.

Preview Codes

In Preview mode, Transformer displays different colors for different types of data. Transformer uses other codes and formatting to highlight changed fields.

Processor Output Order

When previewing data for a processor, you can preview both the input and the output data. You can display the output records in the order that matches the input records or in the order produced by the processor.

Previewing a Pipeline

Editing Properties

When running preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the condition in a Stream Selector processor to see how the condition alters which records pass to the different output streams.

Pipeline Monitoring Overview

When Transformer runs a pipeline, you can view real-time statistics about the pipeline.

Pipeline and Stage Statistics

When you monitor a pipeline, you can view real-time summary statistics for the pipeline and for stages in the pipeline.

Spark Web UI

As you monitor a pipeline, you can also access the Spark web UI for the application launched for the pipeline. Use the Spark web UI to monitor the Spark jobs executed for the launched application, just as you monitor any other Spark application.

Pipeline Run History

You can view the run history of a pipeline when you configure or monitor a pipeline. View the run history from either the Summary or History tab.