Release Notes
5.0.x Release Notes
The Transformer 5.0.0 release occurred on May 30, 2022.
New Features and Enhancements
- Stage enhancements
-
- Snowflake stages property rename and enhancement - The
Additional Snowflake Configuration Properties property is now named
Connection Properties and is moved from the Advanced tab to the
Connection tab. In addition, you can specify credential functions in the
property value to retrieve secrets stored in a credential store. This
change affects the following stages:
- Snowflake origin
- Snowflake Lookup processor
- Snowflake destination
- Snowflake stages property rename and enhancement - The
Additional Snowflake Configuration Properties property is now named
Connection Properties and is moved from the Advanced tab to the
Connection tab. In addition, you can specify credential functions in the
property value to retrieve secrets stored in a credential store. This
change affects the following stages:
- Transformer logs
- Transformer uses the Apache Log4j 2.17.2 library to write log data. In previous releases, Transformer used the Apache Log4j 1.x library which is now end-of-life.
- Proxy server configuration
- To configure Transformer to use a proxy server for outbound network requests, define proxy properties when you set up the deployment.
5.0.0 Fixed Issues
- You cannot preview or validate a pipeline using embedded Spark libraries.
- Transformer 4.0.0 or later cannot load runtime resources for a pipeline running on a Hadoop YARN Cloudera distribution cluster.
5.0.x Known Issues
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google Dataproc
pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- Due to memory issues in older Databricks clusters, communication failures can occur
when running pipelines on those clusters. The memory issues can generate error
messages such as the
following:
Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC. GC overhead limit exceeded Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
Workaround: To address the memory issues, try one or both of the following solutions:- Fine tune the Spark configuration properties related to memory, such as
spark.driver.memory
,spark.driver.cores
,spark.executor.memory
, andspark.executor.cores
. - Increase the memory on Spark cluster nodes.
- Fine tune the Spark configuration properties related to memory, such as
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- A successful Transformer
pipeline run on a provisioned Databricks cluster displays a Cancelled or Failed
status in Databricks.
Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Starting a cluster pipeline multiple times, in quick succession, can cause the
pipeline to hang with the following error:
Transformer Spark app is already running. Waiting for callback...
Workaround: Wait a few seconds before starting the pipeline again.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
-
Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.
Workaround: Wait a few seconds before starting the pipeline again.
- The File origin processes files with mixed schemas.
4.3.x Release Notes
The Transformer 4.3.0 release occurred on April 29, 2022.
New Features and Enhancements
- Cluster support
-
- Cloudera Data Engineering cluster support - Transformer now supports running pipelines on Cloudera Data Engineering virtual clusters.
- Cloudera CDP Private Cloud Base 7.1.x support - Transformer now supports Hadoop YARN clusters on Cloudera CDP Private Cloud Base 7.1.x.
- EMR connection retry properties - You can configure the
following new properties to define how Transformer retries a failed request or throttling error for an EMR cluster:
- Max Retries
- Retry Base Delay
- Throttling Retry Base Delay
- Max Backoff
- New stage
-
- New JSON Parser processor - Use the JSON Parser processor to parse a JSON object embedded in a string field.
- Stage enhancements
-
- Amazon S3 origin property rename - The Bucket property is now named Bucket and Path. It has always allowed entering a path that includes the asterisk (*) and question mark (?) wildcards.
- New empty dataframe behavior for JDBC Table origins - When there is
no data to be read, the following origins now pass the table schema
in an empty dataframe:
- JDBC Table origin
- MySQL JDBC Table origin
- Oracle JDBC Table origin
- PostgreSQL JDBC Table origin
In previous releases, these origins passed an empty schema with empty dataframes. This change has no upgrade impact because it includes a new Use Empty Schemas property that passes an empty schema with empty dataframes.
To preserve backward compatibility, the Use Empty Schemas property is enabled for all upgraded pipelines. For new pipelines, this property is disabled by default.
- Partition Base Path origin property - The following origins now
allow specifying a base path for partitions in a Partition Base Path
property:
- ADLS Gen1 origin
- ADLS Gen2 origin
- Amazon S3 origin
- File origin
- Google Cloud Storage origin
- MapR FS origin
- Skip Empty Batches destination property - The following destinations
can now skip writing empty batches when you select the Skip Empty
Batches property:
- ADLS Gen1 destination
- ADLS Gen2 destination
- Amazon S3 destination
- File destination
- Google Cloud Storage destination
- MapR FS destination
4.3.0 Fixed Issues
- When a job has failover enabled and the Transformer that runs the pipeline fails, and the same Transformer attempts to recover the pipeline, the pipeline enters a
RUN_ERROR
state and fails with the following error message:RUN_ERROR:Error while trying to check the status of disconnected driver pipeline: st.shaded.org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyReader not found for media type=text/html;charset=utf-8, type=interface com.streamsets.datacollector.execution.PipelineState, genericType=interface com.streamsets.datacollector.execution.PipelineState.
4.3.x Known Issues
- Pipelines that run on a Spark 3.x Cloudera Data Engineering cluster can fail when they include a File stage that reads from or writes to HDFS with Kerberos enabled.
- When trying to access Amazon Redshift or Microsoft SQL Server from a Google
Dataproc pipeline, the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- Due to memory issues in older Databricks clusters, communication failures can
occur when running pipelines on those clusters. The memory issues can generate
error messages such as the
following:
Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC. GC overhead limit exceeded Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
Workaround: To address the memory issues, try one or both of the following solutions:- Fine tune the Spark configuration properties related to memory, such
as
spark.driver.memory
,spark.driver.cores
,spark.executor.memory
, andspark.executor.cores
. - Increase the memory on Spark cluster nodes.
- Fine tune the Spark configuration properties related to memory, such
as
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or
Failed status in Databricks.
Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Starting a cluster pipeline multiple times, in quick succession, can cause the
pipeline to hang with the following error:
Transformer Spark app is already running. Waiting for callback...
Workaround: Wait a few seconds before starting the pipeline again.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
-
Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.
Workaround: Wait a few seconds before starting the pipeline again.
- The File origin processes files with mixed schemas.
4.2.x Release Notes
The Transformer 4.2.0 release occurred on January 21, 2022.
New Features and Enhancements
- Internal update
- This release includes internal updates to support an upcoming StreamSets DataOps Platform Control Hub feature.
- Clusters
-
- Amazon EMR:
- You can specify a SSH EC2 Key ID property for the EC2 SSH key to be used on nodes of the cluster.
- You can configure pipelines on an EMR cluster to assume another role.
- Databricks:
- You can run pipelines on Databricks 9.1 LTS clusters.
- You can access the Databricks job details when you view a job run summary from the job History tab.
- Google Dataproc 2.0 - You can run pipelines on Google Dataproc 2.0.
- Amazon EMR:
- Stage enhancements
-
- Amazon S3 destination - When configuring the
destination, you can now use the
s3
URI scheme, in addition to thes3a
scheme. Best practice is to uses3
with EMR clusters ands3a
with all other clusters. - Field Replacer processor - Use Spark SQL expressions to generate new values for specified fields. You can use quotation marks to specify a string.
- Google Big Query origin - The origin can now read from Google BigQuery views.
- Amazon S3 destination - When configuring the
destination, you can now use the
- Connections
-
- Amazon EMR cluster connections include the following
enhancements:
- You can configure Amazon EMR cluster connections to assume another role.
- You can specify a SSH EC2 Key ID property for the EC2 SSH key to be used on nodes of the cluster.
- Amazon EMR cluster connections include the following
enhancements:
- Deprecation and testing update
-
- The Cloudera CDH 5.x stage libraries are now deprecated. As a result, StreamSets no longer tests Transformer against Cloudera CDH 5.x.
- Additional enhancements
-
- CyberArk credential store support - You can use CyberArk as a credential store for Transformer.
- Cluster URL access - When monitoring a Control Hub job for a Databricks pipeline, when you view the job summary, you can now access the Databricks cluster job URL.
4.2.0 Fixed Issues
- Redshift
destinations fail to write partitioned data when running on Databricks cluster
versions 7.x and later. The pipeline fails with the following error:
java.sql.SQLException: Invalid operation: Mandatory url is not present in manifest file.
- The Scala processor always checks if a batch is empty instead of checking only when the Skip Empty Batches property is enabled. This slows performance.
- Runtime resources are not accessible from Transformer pipelines.
- When provisioning a Databricks cluster, the user-defined tags defined in cluster configuration properties are not being set.
- When provisioning
a Databricks cluster, the
policy_id
parameter defined in the cluster configuration properties is ignored. - For pipelines run on Databricks clusters, resources are staged in the EBS volumes instead of the Databricks distributed file system (DBFS), and the resources are not being removed when no longer needed.
4.2.x Known Issues
- As noted in the StreamSets
Technical Service Bulletin, Transformer 3.12.0 and later are not vulnerable to the Apache Log4j zero-day
vulnerability documented in CVE-2021-44228.
However, StreamSets highly recommends that you update all clusters that run Transformer pipelines to protect against the zero-day vulnerability.
- When trying to
access Amazon Redshift or Microsoft SQL Server from a Google Dataproc pipeline,
the pipeline fails due to a known issue with the default Dataproc Java security
provider, Conscrypt.
SQL Server pipelines fail with the following error:
Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval' : 30000 ms. Check logs for error.
Redshift pipelines fail with the following error:Could not connect to Redshift DB due to error: [JDBC Driver] null
Workaround: Use one of the following methods to disable Conscrypt:
- For existing clusters, on all cluster nodes, edit the
$JAVA_HOME/jre/lib/security/java.security
security configuration file, and disable theorg.conscrypt.OpenSSLProvider
property. - For new clusters, provision clusters with the
dataproc:dataproc.conscrypt.provider.enable
property set tofalse
.
- For existing clusters, on all cluster nodes, edit the
- Due to memory
issues in older Databricks clusters, communication failures can occur when
running pipelines on those clusters. The memory issues can generate error
messages such as the
following:
Failed to start the pipeline, Databricks Cluster Driver is up but is not responsive, likely due to GC. GC overhead limit exceeded Driver Error: Last callback from Transformer Spark application exceeds the specified time configured in 'transformer.driver.max.inactive.interval'
Workaround: To address the memory issues, try one or both of the following solutions:- Fine tune the Spark configuration properties related to memory, such
as
spark.driver.memory
,spark.driver.cores
,spark.executor.memory
, andspark.executor.cores
. - Increase the memory on Spark cluster nodes.
- Fine tune the Spark configuration properties related to memory, such
as
- When previewing a Dataproc pipeline that is configured both to provision and to terminate a cluster, Transformer fails to terminate the cluster when preview completes.
- When a job has
failover enabled and the Transformer that runs the pipeline fails, and the same Transformer attempts to recover the pipeline, the pipeline enters a
RUN_ERROR
state and fails with the following error message:RUN_ERROR:Error while trying to check the status of disconnected driver pipeline: st.shaded.org.glassfish.jersey.message.internal.MessageBodyProviderNotFoundException: MessageBodyReader not found for media type=text/html;charset=utf-8, type=interface com.streamsets.datacollector.execution.PipelineState, genericType=interface com.streamsets.datacollector.execution.PipelineState.
- A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or
Failed status in Databricks.
Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Starting a
cluster pipeline multiple times, in quick succession, can cause the pipeline to
hang with the following error:
Transformer Spark app is already running. Waiting for callback...
Workaround: Wait a few seconds before starting the pipeline again.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
-
Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.
Workaround: Wait a few seconds before starting the pipeline again.
- The File origin processes files with mixed schemas.
4.1.x Release Notes
The Transformer 4.1.0 release occurred on September 27, 2021.
New Features and Enhancements
- Clusters
-
- Amazon EMR:
- AWS tags - When provisioning an Amazon EMR cluster, you can specify AWS tags for the cluster.
- Regions - You can now specify additional regions for EMR clusters.
- Google Dataproc:
- Dataproc labels - When provisioning a Google Dataproc cluster, you can specify Dataproc labels for the cluster.
- Credentials files - You can now specify relative paths in addition to absolute paths to service account credentials files.
- Regions - You can now specify additional regions for Dataproc clusters.
- Databricks:
- Job submission - When you start a pipeline,
Transformer now submits the pipeline to a Databricks cluster
directly as a workload, creating an ephemeral job.
Previously, Transformer created one-time jobs, which counted against the job limit on the account. Ephemeral jobs do not count towards the job limit.Note: The details of ephemeral jobs do not display with regular jobs through the Databricks job menu. For details, see Upgrade Impact.
- Init script enhancement - When provisioning a Databricks cluster on Azure, you can now use Azure cluster-scoped init scripts stored on Azure Blob File System that are accessible using an ADLS Gen2 storage account.
- Job submission - When you start a pipeline,
Transformer now submits the pipeline to a Databricks cluster
directly as a workload, creating an ephemeral job.
- Amazon EMR:
- Stages
-
- New JDBC Query origin - Use the JDBC Query origin to read data from database tables with a custom query.
- JDBC origin renamed - To clarify the difference between this existing origin and the new JDBC Query origin, the JDBC origin is now known as the JDBC Table origin.
- Credential stores
-
- Hashicorp Vault credential store - You can use Hashicorp Vault as a credential store for Transformer.
- Additional enhancements
-
- Job functions - You can now use job functions when you configure any pipeline property that allows expressions.
- Enabling HTTPS for Transformer - You can now store the keystore and truststore files in the Transformer resources directory,
<installation_dir>/externalResources/resources
, and then enter a path relative to that directory when you define the keystore and truststore location. This can have upgrade impact.
Upgrade Impact
- Java JDK 11 enforcement for Scala 2.12 installations
- With this release, when Transformer is prebuilt with Scala 2.12, it requires a Java JDK 11 installation. In previous releases, though required by Transformer prebuilt with Scala 2.12, a Java JDK 11 installation was not enforced.
- Databricks job submission change
- With this release, Transformer submits jobs to Databricks differently from previous releases.
- HDInsight pipelines with ADLS stages
- With this release, when you include an ADLS Gen1 or Gen2 stage in a pipeline
that runs on an Apache Spark for HDInsight cluster, the stage must use the
ADLS cluster-provided libraries
stage library. - Enabling HTTPS for Transformer
- With this release, when you enable HTTPS for Transformer, you can store the keystore and truststore files in the Transformer resources directory,
<installation_dir>/externalResources/resources
. You can then enter a path relative to that directory when you define the keystore and truststore location in the Transformer configuration properties.
4.1.0 Fixed Issues
- Pipelines with ADLS stages that run on Azure HDInsight 4.0 clusters with Transformer built for Spark 2.4 fail to start. This fix might cause upgrade impact.
- When pipeline failover
is enabled for a Control Hub
job that runs a Transformer
pipeline, the job can hang in a failover Transformer in
a
STARTING
state when the Spark job completes before the failover Transformer fully takes over the Control Hub job. - Record header attributes are added to Hive tables as new columns during pipeline preview if the Write to Destinations preview property is enabled.
- A pipeline fails to start when a Kafka origin is configured to read messages starting from a specified offset.
4.1.x Known Issues
- When provisioning a Databricks cluster, user-defined tags defined in cluster configuration properties are not being set.
- A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or
Failed status in Databricks.
Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.
- At this time, you cannot preview data using the cluster manager configured for the pipeline when the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines containing Hive and MapR Hive stages can produce results, but use the metastore URI in the Hive configuration file, ignoring the optional Metastore URI stage property.
- Starting a cluster
pipeline multiple times, in quick succession, can cause the pipeline to hang
with the following error:
Transformer Spark app is already running. Waiting for callback...
Workaround: Wait a few seconds before starting the pipeline again.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
-
Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.
Workaround: Wait a few seconds before starting the pipeline again.
- The File origin processes files with mixed schemas.
4.0.x Release Notes
The Transformer 4.0.0 release occurred on June 21, 2021.
New Features and Enhancements
- Spark 3 and Scala 2.12 support
-
Transformer supports using Spark 3.0 and Scala 2.12 for some cluster types. As a result, StreamSets now provides different installation packages for Transformer.
For information about the clusters that support Spark 3.0, see Cluster Compatibility Matrix. For information about the features available in different versions of Spark, see Spark Versions and Available Features.
- Stages
-
- New Amazon Redshift origin - Use the Amazon Redshift origin to read data from an Amazon Redshift table.
- Clusters
-
- Amazon EMR enhancements:
- Additional EMR support - You can run pipelines on EMR 6.1.x or later 6.x.x clusters. For all supported versions, see Cluster Compatibility Matrix.
- Bootstrap actions support - When you provision a cluster, you can define bootstrap actions scripts in cluster configuration properties or you can use bootstrap actions scripts stored on Amazon S3.
- Databricks clusters:
- Additional Databricks support - You can run pipelines on Databricks 7.x and 8.x clusters. For all supported versions, see Cluster Compatibility Matrix.
- Cluster-scoped init script support - When you provision a cluster, you can define cluster-scoped init scripts in cluster configuration properties. You can also use cluster-scoped init scripts stored on DBFS or S3. Specifying a location on Azure is not available at this time.
- Databricks failover support - You can configure pipeline failover for Databricks pipelines.
- Application Name enhancement - When specifying an application name for a cluster, you can now use underscores in addition to alphanumeric characters.
- Amazon EMR enhancements:
- Connections
-
With this release, the following stages support using connections:
- Additional enhancements
-
- TRANSFORMER_EXTERNAL_RESOURCES environment variable - An optional root
directory for external resources, such as external libraries and runtime
resources.
The default location is $TRANSFORMER_DIST/externalResources.
- TRANSFORMER_EXTERNAL_RESOURCES environment variable - An optional root
directory for external resources, such as external libraries and runtime
resources.
4.0.0 Fixed Issues
- When you force
stop an EMR pipeline, the Spark job on EMR continues to run until the last batch
is written.
With this fix, when you force stop an EMR pipeline, Transformer first tries to stop the Spark job through the YARN service in the cluster. If the YARN service is not reachable, Transformer sends a new step to the EMR cluster with the stop command.
As a result, if the YARN service is not reachable, Transformer can only force stop the pipeline when all of the following are true:- The pipeline runs on EMR 5.28 or later with support for step concurrency.
- The Step Concurrency property in the pipeline is set to 2 or higher.
- A step becomes available.
- When a Databricks pipeline successfully completes, Transformer indicates that it has finished running. However, on the Databricks cluster, the Spark job seems to be cancelled instead.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on EMR clusters.
-
Upgrading pipelines with the Amazon S3 destination created on Transformer 3.15.0 or earlier to Transformer 3.16.x - 3.18.x can generate errors related to the Partition by Fields stage property.
- Errors occur when using the Amazon S3 origin and destination in the same pipeline when reading from and writing to different regions.
- The Field Renamer processor does not rename fields for empty batches.
4.0.x Known Issues
- A successful Transformer pipeline run on a provisioned Databricks cluster displays a Cancelled or
Failed status in Databricks.
Workaround: When the Databricks job status for a pipeline on a provisioned Databricks cluster differs from job status in Control Hub, trust the status reported by Control Hub.
- At this time, you
cannot preview data using the cluster manager configured for the pipeline when
the pipeline includes Delta Lake, Hive, or MapR Hive stages. Pipelines
containing Hive and MapR Hive stages can produce results, but use the metastore
URI in the Hive configuration file, ignoring the optional Metastore URI stage
property.When
pipeline failover is enabled for a Control Hub job that runs a Transformer pipeline, and the Spark job completes before a failover Transformer fully takes over the Control Hub, the Control Hub job can hang in the failover Transformer in a
STARTING
state with the following error:CONTAINER_0102 - Cannot change state from STARTING to FINISHING
Workaround: To correctly finish the Control Hub job, use Control Hub to force stop the job and wait until the job reaches an
INACTIVE_ERROR
state. Then, acknowledge the error. - Starting a cluster
pipeline multiple times, in quick succession, can cause the pipeline to hang
with the following error:
Transformer Spark app is already running. Waiting for callback...
Workaround: Wait a few seconds before starting the pipeline again.
- Pipelines that require passing resources such as private key files to Spark executors fail to run on any cluster type except EMR, where this issue has been fixed.
- A pipeline fails to start when a Kafka origin is configured to read messages starting from a specified offset.
- The Google BigQuery destination fails to write numeric data with a scale greater than 9 to BigQuery columns defined as BIGNUMERIC data type.
- Record header attributes are added to Hive tables as new columns during pipeline preview if the Write to Destinations preview property is enabled.
-
Restarting a cluster pipeline shortly after starting it can cause the cluster to use the same Spark application ID for both pipeline runs, leading to errors.
Workaround: Wait a few seconds before starting the pipeline again.
- The File origin processes files with mixed schemas.