Transformer User Guide

What is StreamSets Transformer?

StreamSets TransformerTM is an execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform transformations that require heavy processing on the entire data set in batch or streaming mode.

Pipeline Processing on Spark

Transformer functions as a Spark client that launches distributed Spark applications.

Batch Case Study

Transformer can run pipelines in batch mode. A batch pipeline processes all available data in a single batch, and then stops.

Streaming Case Study

Transformer can run pipelines in streaming mode. A streaming pipeline maintains connections to origin systems and processes data at user-defined intervals. The pipeline runs continuously until you manually stop it.

Transformer for Data Collector Users

For users already familiar with StreamSets Data Collector pipelines, here's how Transformer pipelines are similar... and different.

Tutorials and Sample Pipelines

StreamSets provides tutorials and sample pipelines to help you learn about using Transformer.

Installation Requirements

Installing Transformer

Protecting Sensitive Data in Configuration Files

Customization with Environment Variables

Enabling HTTPS

Using a Reverse Proxy

Credential Stores

Stage-Related Prerequisites

External Libraries

User Authentication

Starting and Logging in to Transformer

Uninstallation

Overview

Pre Upgrade Tasks

Upgrade an Installation from the Tarball

Upgrade an Installation from the RPM Package

Post Upgrade Tasks

Troubleshooting an Upgrade

Registration with Control Hub Overview

Register Transformer

To register Transformer with Control Hub, you generate an authentication token and modify the Transformer configuration files.

Unregister Transformer

You can unregister a Transformer from Control Hub when you no longer want to use that Transformer installation with Control Hub.

Disconnected Mode

Control Hub Configuration File

You can customize how a registered Transformer works with Control Hub by editing the Control Hub configuration file, $TRANSFORMER_CONF/dpm.properties, located in the Transformer installation.

Overview

Amazon EMR

Amazon EMR Serverless

Cloudera Data Engineering

Databricks

Google Dataproc

Hadoop YARN

SQL Server 2019 Big Data Cluster

What is a Transformer Pipeline?

A Transformer pipeline describes the flow of data from origin systems to destination systems and defines how to transform the data along the way.

Sample Pipelines

Transformer provides sample pipelines that you can use to learn about Transformer features or as a template for building your own pipelines.

Stage Library Match Requirement

Local Pipelines

Typically, you run a Transformer pipeline on a cluster. You can also run a pipeline on a Spark installation on the Transformer machine. This is known as a local pipeline.

Spark Executors

A Transformer pipeline runs on one or more Spark executors.

Partitioning

When you start a pipeline, StreamSets Transformer launches a Spark application. Spark runs the application just as it runs any other application, splitting the pipeline data into partitions and performing operations on the partitions in parallel.

Offset Handling

Batch Header Attributes

Batch header attributes are attributes in batch headers that you can use in pipeline logic.

Delivery Guarantee

Transformer's offset handling ensures that, in times of sudden failures, a Transformer pipeline does not lose data - it processes data at least once. If a sudden failure occurs at a particular time, up to one batch of data may be reprocessed. This is an at-least-once delivery guarantee.

Caching Data

You can configure most origins and processors to cache data. You might enable caching when a stage passes data to more than one downstream stage.

Performing Lookups

Expressions in Pipeline and Stage Properties

Data Types

Deprecated Functionality

Execution Mode

Cluster Callback URL

Preprocessing Script

Extra Spark Configuration

Ludicrous Processing Mode

Runtime Values

Simple and Bulk Edit Mode

Validation

Amazon Security

Security in Kafka Stages

Kafka Message Keys

SQL Server 2019 JDBC Connection Information

Configuring a Pipeline

PostgreSQL JDBC Table

Snowflake

SQL Server JDBC Table

Slowly Changing Dimension

Surrogate Key Generator

You can preview data to help build or fine-tune a pipeline. You can preview complete or incomplete pipelines.

Preview Codes

In Preview mode, Transformer displays different colors for different types of data. Transformer uses other codes and formatting to highlight changed fields.

Processor Output Order

When previewing data for a processor, you can preview both the input and the output data. You can display the output records in the order that matches the input records or in the order produced by the processor.

Previewing a Pipeline

Editing Properties

When running preview, you can edit stage properties to see how the changes affect preview data. For example, you might edit the condition in a Stream Selector processor to see how the condition alters which records pass to the different output streams.

Overview

When Transformer runs a pipeline, you can view real-time statistics about the pipeline.

Pipeline and Stage Statistics

When you monitor a pipeline, you can view real-time summary statistics for the pipeline and for stages in the pipeline.

Cluster and Spark URLs

In monitor mode, the Monitoring panel provides URLs for the cluster or the Spark application that runs the pipeline.

Pipeline Run History

You can view the run history of a pipeline when you configure or monitor a pipeline. View the run history from either the Summary or History tab.

Providing an Activation Code

Viewing Transformer Configuration Properties

Viewing Transformer Directories

You can view the directories that Transformer uses. You might check the directories being used to access a file in the directory or to increase the amount of available space for a directory.

Viewing Transformer Metrics

Log Files

Shutting Down Transformer

You can shut down and then manually launch Transformer to apply changes to the Transformer configuration file, environment configuration file, or user logins.

Restarting Transformer

You can restart Transformer to apply changes to the Transformer configuration file, environment configuration file, or user logins. During the restart process, Transformer shuts down and then automatically restarts.

Opting Out of Usage Statistics Collection

You can help to improve Transformer by allowing StreamSets to collect usage statistics about Transformer system performance and features that you use. This information helps StreamSets to improve product performance and to make product development decisions.

Pipelines

Origins

StreamSets Expression Language