Hive

The Hive destination writes files of a specified file format to a Hive table. Hive is a transactional storage layer that works on top of Hadoop Distributed File System (HDFS) and Apache Spark. Hive stores files in tables on HDFS.

By default, the destination writes to Hive using connection information stored in Hive configuration files on the Transformer machine. You can alternatively specify the location of an external Hive Metastore where the configuration information is stored.

The destination can write to a new or existing Hive table. If the table doesn't exist, the destination creates the table. If the table exists, the destination can either append data to the table, overwrite all existing data, or overwrite related partitions in the table.

When you configure the Hive destination, you specify the schema and table to write to and the write mode to use. You can configure the table columns to partition the data by, and you can define how the destination handles schemas that have changed. When needed, you can specify URIs for an external Hive Metastore where configuration information is stored.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.

Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When writing files to Hive, you can configure the Hive destination to partition the data by columns.

Partitioning by columns is particularly useful when the destination writes to a Hive table used by data scientists who run analytic queries on the data. To increase the analytics performance, partition the data by the columns that the data scientists frequently include in the WHERE or GROUP BY clauses in their queries. For example, if the data scientists plan to run queries where the date column is a certain value, sending all records with the same date value to the same output file and subdirectory improves the efficiency of those queries.

The destination handles the configured partition columns differently, based on whether the destination writes to a new table or an existing table:

Writes to a new table

When the Hive destination writes to a new table and partition columns are not defined, the destination uses the same number of partitions that Spark uses to process the upstream pipeline stages. The destination randomly redistributes the data to balance the data across the partitions, and then writes one output file for each partition to the specified table path. For example, if Spark splits the pipeline data into 20 partitions, the destination writes 20 output files to the specified table path.

When the destination writes to a new table and partition columns are defined, the destination redistributes the data by the specified column, placing records with the same value for the specified column in the same partition. The destination creates a single file for each partition, writing each file to a subfolder within the table path.

For example, let's say that you configure the Hive destination to write to the table path /hive/ordersand to partition the data by the month column. The destination creates 12 partitions, placing all records where month is january in one partition, all records where month is february in the next partition, and so on. The destination creates the following subdirectories and writes a single file to each subdirectory:

/hive/orders/january
/hive/orders/february
...
/hive/orders/december

Writes to an existing table

When the Hive destination writes to an existing table and partition columns are not defined, the destination automatically uses the same partitioning as the existing table.

When the destination writes to an existing table with defined partition columns, the destination writes based on how the Write Mode stage property is defined:

When set to Append rows to existing table or Overwrite related partitions in existing table, the destination uses existing partition columns in the table, ignoring any parition columns defined in stage properties.
When set to Overwrite complete existing table, the destination deletes all data in the table and writes to the table using partition columns defined in stage properties.

Data Formats

The Hive destination can write files of the following formats to a Hive table:

Avro
ORC
Parquet
Record Columnar
Sequence
Text

Hive determines the exact format and file name of the output files. For more information, see the Hive documentation.

Additional Hive Configuration Properties

When needed, you can pass additional Hive configuration properties to Hive as additional Spark configuration properties. Configure additional Spark configuration properties on the Cluster tab of the pipeline properties panel.

Hive configuration properties defined in the pipeline override those defined in Hive configuration files.

Configuring a Hive Destination

Use the Hive destination to write files to a Hive table.

In the Properties panel, on the General tab, configure the following properties:

General Property	Description
Name	Stage name.
Description	Optional description.

On the Hive tab, configure the following properties:

Hive Property	Description
Schema	Schema where the Hive table is located.
Table	Hive table to write to.
Hive Metastore URI	Optional comma-separated list of URIs to an external Hive metastore that contains the Hive connection information to use. Use the following URI format: `thrift://<metastore-host>:<metastore:port>` When not defined, the destination uses connection information defined in the `hive-site.xml` and `core-site.xml` Hive configuration files on the Transformer machine.
Partition Columns	List of table columns to partition the data by. The destination partitions the data so that records with the same value for the specified columns are in the same partition. Click the Add icon to specify another column to partition by.
Write Mode	Mode to write to an existing Hive table: Append data to existing table - Appends new data to the existing table. Overwrite related partitions in existing table - Replaces all data in a partition before writing new data to the partition. To use this option, you must add an extra Spark configuration property named `spark.sql.sources.partitionOverwriteMode`, and set it to `dynamic`. Define extra Spark configuration properties on the Cluster tab of the pipeline properties panel. Overwrite complete existing table - Replaces all data in the existing table with new data. By default, the destination appends data to existing tables.
Data Format	Format of the data. Select one of the following options: Avro ORC Parquet Record Columnar Sequence Text
Additional Configuration	Additional HDFS configuration properties to use. Specified properties override those in Hadoop configuration files. To add properties, click the Add icon and define the HDFS property name and value. Use the property names and values as expected by your version of Hadoop. Tip: You can specify additional Hive configuration properties by defining pipeline runtime properties.