File

The File destination writes files to Hadoop Distributed File System (HDFS) or a local file system. The File destination writes data based on the specified data format and creates a separate file for every partition.

The File destination writes to HDFS using connection information stored in a Hadoop configuration file.

When you configure the File destination, you specify the output directory and write mode to use. When overwriting related partitions, first complete the overwrite partition requirement.

You select the data format to write and configure related properties. You can specify fields to use for partitioning files. You can also drop unrelated master records when using the destination as part of a slowly changing dimension pipeline.

You can also specify HDFS configuration properties for a HDFS-compatible system. Any specified properties override those defined in the Hadoop configuration file.

Directory Path

The File destination writes files to a directory in HDFS or a local file system.

To specify the directory, enter the path to the directory. The format of the directory path depends on the file system that you want to write to:

HDFS: To write files to HDFS, use the following format for the directory path:; hdfs://<authority>/<path>; For example, to write to the /user/hadoop/files directory on HDFS, enter the following path:; hdfs://nameservice/user/hadoop/files
Local file system: To write files to a local file system, use the following format for the directory path:; file:///<directory>; For example, to write to the /Users/transformer/output directory on the local file system, enter the following path:; file:///Users/transformer/output

The destination creates the directory if it doesn't exist. The user that Transformer uses to launch the Spark application must have write access to the root of the specified directory path. For a cluster pipeline, the user that launches the Spark application depends on the cluster type configured for the pipeline. For a local pipeline, the user that launches the Spark application is the user that runs the Transformer process.

Write Mode

The write mode determines how the File destination writes files to the destination system. When writing files, the resulting file names are based on the data format of the files.

The File destination includes the following write modes:

Overwrite files: Removes all files in the directory before creating new files.
Overwrite related partitions: Removes all files in a partition before creating new files for the partition. Partitions with no data to be written are left intact.; For example, say you have a directory with ten partitions. If the processed data belongs in two partitions, the destination overwrites the two partitions with the new data. The other eight partitions remain unchanged.; Use to overwrite partitions in a file directory, like when writing to a slowly changing partitioned file dimension.; Note: Before using this option, Spark must be configured to allow overwriting data within a partition.
Write new files to new directory: Creates a new directory and writes new files to the directory. Generates an error if the specified directory exists when you start the pipeline.
Write new or append to existing files: Creates new files in an existing directory. If a file of the same name exists in the directory, the destination appends data to the file.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.

Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When writing to a file system, Spark creates one output file per partition. When you configure the destination, you can specify fields to partition by. You can alternatively use a Repartition processor earlier in the pipeline to partition by fields or to specify the number of partitions that you want to use.

When partitioning, you can use the Overwrite Related Partitions write option to overwrite only the partitions where data must be written, leaving other partitions intact. Note that this results in replacing the existing files in those partitions with new files.

For example, say you want the destination to write data to different partitions based on country codes. In the destination, you specify the countrycode field in the Partition by Field property and set the Write Mode property to Overwrite Related Partitions. When writing only Belgian data, the destination overwrites existing files in the BE partition with a single file of the latest data and leaves all other partitions untouched.

Overwrite Partition Requirement

When writing to partitioned files, the File destination can overwrite files within affected partitions rather than overwriting the entire data set. For example, if output data includes only data within a 03-2019 partition, then the destination can overwrite the files in the 03-2019 partition and leave all other partitions untouched.

To overwrite partitioned files, Spark must be configured to allow overwriting data within a partition. When writing to unpartitioned files, no action is needed.

To enable overwriting partitions, set the spark.sql.sources.partitionOverwriteMode Spark configuration property to dynamic.

You can configure the property in Spark, or you can configure the property in individual pipelines. Configure the property in Spark when you want to enable overwriting partitions for all Transformer pipelines.

To enable overwriting partitions for an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.

Data Formats

The File destination writes records based on the specified data format.

The destination can write using the following data formats:

Avro

The destination writes an Avro file for each partition and includes the Avro schema in each file.