Release Notes

5.0.x Release Notes

The Transformer 5.0.0 release occurred on May 30, 2022.

New Features and Enhancements

Stage enhancements
  • Snowflake stages property rename and enhancement - The Additional Snowflake Configuration Properties property is now named Connection Properties and is moved from the Advanced tab to the Connection tab. In addition, you can specify credential functions in the property value to retrieve secrets stored in a credential store. This change affects the following stages:
    • Snowflake origin
    • Snowflake Lookup processor
    • Snowflake destination
Transformer logs
Transformer uses the Apache Log4j 2.17.2 library to write log data. In previous releases, Transformer used the Apache Log4j 1.x library which is now end-of-life.
Proxy server configuration
To configure Transformer to use a proxy server for outbound network requests, define proxy properties when you set up the deployment.
Previously, you configured Transformer to use a proxy server by defining Java configuration options for the deployment and then setting the STREAMSETS_BOOTSTRAP_JAVA_OPTS environment variable on the Transformer machine.

5.0.0 Fixed Issues

  • You cannot preview or validate a pipeline using embedded Spark libraries.
  • Transformer 4.0.0 or later cannot load runtime resources for a pipeline running on a Hadoop YARN Cloudera distribution cluster.

5.0.x Known Issues

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • Due to memory issues in older Databricks clusters, communication failures can occur when running pipelines on those clusters. The memory issues can generate error messages such as the following:
    Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC.
    GC overhead limit exceeded
    Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
    Workaround: To address the memory issues, try one or both of the following solutions:
    • Fine tune the Spark configuration properties related to memory, such as spark.driver.memory, spark.driver.cores, spark.executor.memory, and spark.executor.cores.
    • Increase the memory on Spark cluster nodes.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.3.x Release Notes

The Transformer 4.3.0 release occurred on April 29, 2022.

New Features and Enhancements

Cluster support
New stage
Stage enhancements
  • Amazon S3 origin property rename - The Bucket property is now named Bucket and Path. It has always allowed entering a path that includes the asterisk (*) and question mark (?) wildcards.
  • New empty dataframe behavior for JDBC Table origins - When there is no data to be read, the following origins now pass the table schema in an empty dataframe:
    • JDBC Table origin
    • MySQL JDBC Table origin
    • Oracle JDBC Table origin
    • PostgreSQL JDBC Table origin

    In previous releases, these origins passed an empty schema with empty dataframes. This change has no upgrade impact because it includes a new Use Empty Schemas property that passes an empty schema with empty dataframes.

    To preserve backward compatibility, the Use Empty Schemas property is enabled for all upgraded pipelines. For new pipelines, this property is disabled by default.

  • Partition Base Path origin property - The following origins now allow specifying a base path for partitions in a Partition Base Path property:
    • ADLS Gen1 origin
    • ADLS Gen2 origin
    • Amazon S3 origin
    • File origin
    • Google Cloud Storage origin
    • MapR FS origin
  • Skip Empty Batches destination property - The following destinations can now skip writing empty batches when you select the Skip Empty Batches property:
    • ADLS Gen1 destination
    • ADLS Gen2 destination
    • Amazon S3 destination
    • File destination
    • Google Cloud Storage destination
    • MapR FS destination

4.3.0 Fixed Issues

  • When a job has failover enabled and the Transformer that runs the pipeline fails, and the same Transformer attempts to recover the pipeline, the pipeline enters a RUN_ERROR state and fails with the following error message:
    RUN_ERROR:Error while trying to check the status of disconnected driver pipeline: st.shaded.org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyReader not found for media type=text/html;charset=utf-8, type=interface com.streamsets.datacollector.execution.PipelineState, genericType=interface com.streamsets.datacollector.execution.PipelineState.

4.3.x Known Issues

  • Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • Due to memory issues in older Databricks clusters, communication failures can occur when running pipelines on those clusters. The memory issues can generate error messages such as the following:
    Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC.
    GC overhead limit exceeded
    Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
    Workaround: To address the memory issues, try one or both of the following solutions:
    • Fine tune the Spark configuration properties related to memory, such as spark.driver.memory, spark.driver.cores, spark.executor.memory, and spark.executor.cores.
    • Increase the memory on Spark cluster nodes.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.2.x Release Notes

The Transformer 4.2.0 release occurred on January 21, 2022.

New Features and Enhancements

Internal update
This release includes internal updates to support an upcoming StreamSets DataOps Platform Control Hub feature.
Note: All new Transformer deployments on StreamSets DataOps Platform will use Transformer version 4.2.0 or higher. Existing deployments are not affected.
Clusters
Stage enhancements
  • Amazon S3 destination - When configuring the destination, you can now use the s3 URI scheme, in addition to the s3a scheme. Best practice is to use s3 with EMR clusters and s3a with all other clusters.
  • Field Replacer processor - Use Spark SQL expressions to generate new values for specified fields. You can use quotation marks to specify a string.
  • Google Big Query origin - The origin can now read from Google BigQuery views.
Connections
  • Amazon EMR cluster connections include the following enhancements:
    • You can configure Amazon EMR cluster connections to assume another role.
    • You can specify a SSH EC2 Key ID property for the EC2 SSH key to be used on nodes of the cluster.
Deprecation and testing update
  • The Cloudera CDH 5.x stage libraries are now deprecated. As a result, StreamSets no longer tests Transformer against Cloudera CDH 5.x.
Additional enhancements
  • CyberArk credential store support - You can use CyberArk as a credential store for Transformer.
  • Cluster URL access - When monitoring a Control Hub job for a Databricks pipeline, when you view the job summary, you can now access the Databricks cluster job URL.

4.2.0 Fixed Issues

  • Redshift destinations fail to write partitioned data when running on Databricks cluster versions 7.x and later. The pipeline fails with the following error:
    java.sql.SQLException: Invalid operation: Mandatory url is not present in manifest file.
  • The Scala processor always checks if a batch is empty instead of checking only when the Skip Empty Batches property is enabled. This slows performance.
  • Runtime resources are not accessible from Transformer pipelines.
  • When provisioning a Databricks cluster, the user-defined tags defined in cluster configuration properties are not being set.
  • When provisioning a Databricks cluster, the policy_id parameter defined in the cluster configuration properties is ignored.
  • For pipelines run on Databricks clusters, resources are staged in the EBS volumes instead of the Databricks distributed file system (DBFS), and the resources are not being removed when no longer needed.

4.2.x Known Issues

  • As noted in the StreamSets Technical Service Bulletin, Transformer 3.12.0 and later are not vulnerable to the Apache Log4j zero-day vulnerability documented in CVE-2021-44228.

    However, StreamSets highly recommends that you update all clusters that run Transformer pipelines to protect against the zero-day vulnerability.

  • When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security provider, Conscrypt.

    SQL Server pipelines fail with the following error:

    Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
    Redshift pipelines fail with the following error:
    Could not connect to Redshift DB due to error: [JDBC Driver] null

    Workaround: Use one of the following methods to disable Conscrypt:

    • For existing clusters, on all cluster nodes, edit the $JAVA_HOME/jre/lib/security/java.security security configuration file, and disable the org.conscrypt.OpenSSLProvider property.
    • For new clusters, provision clusters with the dataproc:dataproc.conscrypt.provider.enable property set to false.
  • Due to memory issues in older Databricks clusters, communication failures can occur when running pipelines on those clusters. The memory issues can generate error messages such as the following:
    Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC.
    
    GC overhead limit exceeded
    
    Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
    Workaround: To address the memory issues, try one or both of the following solutions:
    • Fine tune the Spark configuration properties related to memory, such as spark.driver.memory, spark.driver.cores, spark.executor.memory, and spark.executor.cores.
    • Increase the memory on Spark cluster nodes.
  • When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
  • When a job has failover enabled and the Transformer that runs the pipeline fails, and the same Transformer attempts to recover the pipeline, the pipeline enters a RUN_ERROR state and fails with the following error message:
    RUN_ERROR:Error while trying to check the status of disconnected driver pipeline: st.shaded.org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyReader not found for media type=text/html;charset=utf-8, type=interface com.streamsets.datacollector.execution.PipelineState, genericType=interface com.streamsets.datacollector.execution.PipelineState.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.1.x Release Notes

The Transformer 4.1.0 release occurred on September 27, 2021.

New Features and Enhancements

Clusters
  • Amazon EMR:
    • AWS tags - When provisioning an Amazon EMR cluster, you can specify AWS tags for the cluster.
    • Regions - You can now specify additional regions for EMR clusters.
  • Google Dataproc:
    • Dataproc labels - When provisioning a Google Dataproc cluster, you can specify Dataproc labels for the cluster.
    • Credentials files - You can now specify relative paths in addition to absolute paths to service account credentials files.
    • Regions - You can now specify additional regions for Dataproc clusters.
  • Databricks:
    • Job submission - When you start a pipeline, Transformer now submits the pipeline to a Databricks cluster directly as a workload, creating an ephemeral job.
      Previously, Transformer created one-time jobs, which counted against the job limit on the account. Ephemeral jobs do not count towards the job limit.
      Note: The details of ephemeral jobs do not display with regular jobs through the Databricks job menu. For details, see Upgrade Impact.
    • Init script enhancement - When provisioning a Databricks cluster on Azure, you can now use Azure cluster-scoped init scripts stored on Azure Blob File System that are accessible using an ADLS Gen2 storage account.
Stages
  • New JDBC Query origin - Use the JDBC Query origin to read data from database tables with a custom query.
  • JDBC origin renamed - To clarify the difference between this existing origin and the new JDBC Query origin, the JDBC origin is now known as the JDBC Table origin.
Credential stores
  • Hashicorp Vault credential store - You can use Hashicorp Vault as a credential store for Transformer.
Additional enhancements
  • Job functions - You can now use job functions when you configure any pipeline property that allows expressions.
  • Enabling HTTPS for Transformer - You can now store the keystore and truststore files in the Transformer resources directory, <installation_dir>/externalResources/resources, and then enter a path relative to that directory when you define the keystore and truststore location. This can have upgrade impact.

Upgrade Impact

Java JDK 11 enforcement for Scala 2.12 installations
With this release, when Transformer is prebuilt with Scala 2.12, it requires a Java JDK 11 installation. In previous releases, though required by Transformer prebuilt with Scala 2.12, a Java JDK 11 installation was not enforced.
If you upgrade to Transformer 4.1.0 prebuilt with Scala 2.12, you must have Java JDK 11 installed on the Transformer machine for Transformer to start.
Databricks job submission change
With this release, Transformer submits jobs to Databricks differently from previous releases.
In previous releases, with each pipeline run, Transformer creates a standard Databricks job, but uses it only once. This job counts toward the Databricks jobs limit.
With this release, Transformer submits ephemeral jobs to Databricks. An ephemeral job runs only once, and does not count towards the Databricks job limit. However, the details for the ephemeral jobs are only available for 60 days, and are not available through the Databricks jobs menu. For information about accessing job details, see Accessing Databricks Job Details.
HDInsight pipelines with ADLS stages
With this release, when you include an ADLS Gen1 or Gen2 stage in a pipeline that runs on an Apache Spark for HDInsight cluster, the stage must use the ADLS cluster-provided libraries stage library.
Enabling HTTPS for Transformer
With this release, when you enable HTTPS for Transformer, you can store the keystore and truststore files in the Transformer resources directory, <installation_dir>/externalResources/resources. You can then enter a path relative to that directory when you define the keystore and truststore location in the Transformer configuration properties.
In previous releases, you could store the keystore and truststore files in the Transformer configuration directory, <installation_dir>/etc, and then define the location to the file using a path relative to that directory. You can continue to store the file in the configuration directory, but StreamSets recommends moving it to the resources directory when you upgrade.

4.1.0 Fixed Issues

  • Pipelines with ADLS stages that run on Azure HDInsight 4.0 clusters with Transformer built for Spark 2.4 fail to start. This fix might cause upgrade impact.
  • Transformer does not enforce the Java JDK 11 requirement for Transformer prebuilt with Scala 2.12. This fix might cause upgrade impact.
  • When pipeline failover is enabled for a Control Hub job that runs a Transformer pipeline, the job can hang in a failover Transformer in a STARTING state when the Spark job completes before the failover Transformer fully takes over the Control Hub job.
  • Record header attributes are added to Hive tables as new columns during pipeline preview if the Write to Destinations preview property is enabled.
  • A pipeline fails to start when a Kafka origin is configured to read messages starting from a specified offset.

4.1.x Known Issues

  • When provisioning a Databricks cluster, user-defined tags defined in cluster configuration properties are not being set.
  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.

4.0.x Release Notes

The Transformer 4.0.0 release occurred on June 21, 2021.

New Features and Enhancements

Spark 3 and Scala 2.12 support

Transformer supports using Spark 3.0 and Scala 2.12 for some cluster types. As a result, StreamSets now provides different installation packages for Transformer.

For information about the clusters that support Spark 3.0, see Cluster Compatibility Matrix. For information about the features available in different versions of Spark, see Spark Versions and Available Features.

Stages
  • New Amazon Redshift origin - Use the Amazon Redshift origin to read data from an Amazon Redshift table.
Clusters
  • Amazon EMR enhancements:
    • Additional EMR support - You can run pipelines on EMR 6.1.x or later 6.x.x clusters. For all supported versions, see Cluster Compatibility Matrix.
    • Bootstrap actions support - When you provision a cluster, you can define bootstrap actions scripts in cluster configuration properties or you can use bootstrap actions scripts stored on Amazon S3.
  • Databricks clusters:
    • Additional Databricks support - You can run pipelines on Databricks 7.x and 8.x clusters. For all supported versions, see Cluster Compatibility Matrix.
    • Cluster-scoped init script support - When you provision a cluster, you can define cluster-scoped init scripts in cluster configuration properties. You can also use cluster-scoped init scripts stored on DBFS or S3. Specifying a location on Azure is not available at this time.
    • Databricks failover support - You can configure pipeline failover for Databricks pipelines.
  • Application Name enhancement - When specifying an application name for a cluster, you can now use underscores in addition to alphanumeric characters.
Connections

With this release, the following stages support using connections:

  • MySQL JDBC Table origin
  • Oracle JDBC Table origin
  • PostgreSQL JDBC Table origin
  • SQL Server JDBC Table origin
  • Amazon Redshift origin and destination - Available after the Data Collector 4.1.0 release.
Additional enhancements
  • TRANSFORMER_EXTERNAL_RESOURCES environment variable - An optional root directory for external resources, such as external libraries and runtime resources.

    The default location is $TRANSFORMER_DIST/externalResources.

4.0.0 Fixed Issues

  • When you force stop an EMR pipeline, the Spark job on EMR continues to run until the last batch is written.

    With this fix, when you force stop an EMR pipeline, Transformer first tries to stop the Spark job through the YARN service in the cluster. If the YARN service is not reachable, Transformer sends a new step to the EMR cluster with the stop command.

    As a result, if the YARN service is not reachable, Transformer can only force stop the pipeline when all of the following are true:
    • The pipeline runs on EMR 5.28 or later with support for step concurrency.
    • The Step Concurrency property in the pipeline is set to 2 or higher.
    • A step becomes available.
  • When a Databricks pipeline successfully completes, Transformer indicates that it has finished running. However, on the Databricks cluster, the Spark job seems to be cancelled instead.
  • Pipelines that require passing resources such as private key files to Spark executors fail to run on EMR clusters.
  • Upgrading pipelines with the Amazon S3 destination created on Transformer 3.15.0 or earlier to Transformer 3.16.x - 3.18.x can generate errors related to the Partition by Fields stage property.

  • Errors occur when using the Amazon S3 origin and destination in the same pipeline when reading from and writing to different regions.
  • The Field Renamer processor does not rename fields for empty batches.

4.0.x Known Issues

  • A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed status in Databricks.

    Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.

  • At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.When pipeline failover is enabled for a Control Hub job that runs a Transformer pipeline, and the Spark job completes before a failover Transformer fully takes over the Control Hub, the Control Hub job can hang in the failover Transformer in a STARTING state with the following error:

    CONTAINER_0102 - Cannot change state from STARTING to FINISHING

    Workaround: To correctly finish the Control Hub job, use Control Hub to force stop the job and wait until the job reaches an INACTIVE_ERROR state. Then, acknowledge the error.

  • Starting a cluster pipeline multiple times, in quick succession, can cause the pipeline to hang with the following error:

    Transformer Spark app is already running. Waiting for callback...

    Workaround: Wait a few seconds before starting the pipeline again.

  • Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
  • A pipeline fails to start when a Kafka origin is configured to read messages starting from a specified offset.
  • The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
  • Record header attributes are added to Hive tables as new columns during pipeline preview if the Write to Destinations preview property is enabled.
  • Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.

    Workaround: Wait a few seconds before starting the pipeline again.

  • The File origin processes files with mixed schemas.