MapR Event Store

The MapR Event Store origin reads data from one or more topics in MapR Streams. All messages in a batch must use the same schema. MapR Streams uses Kafka APIs to process messages. Use the origin only in pipelines that run on MapR distributions of Hadoop YARN clusters.

MapR is now HPE Ezmeral Data Fabric. This documentation uses "MapR" to refer to both MapR and HPE Ezmeral Data Fabric.

When configuring the MapR Event Store origin, you specify the topics the origin reads, and where to start reading each topic. The origin can start processing from the first message, the last message, or a specified offset.

You specify the maximum number of messages to read from any partition in each batch. You can define additional configuration properties to pass to MapR Streams. You can configure the origin to include Kafka message keys in records.

You select the data format of the data and configure related properties. When processing delimited or JSON data, you can define a custom schema for reading the data and configure related properties.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. When the pipeline starts processing a new batch, Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.

For a MapR Event Store origin, Spark determines the partitioning based on the number of partitions in the topics being read. For example, if a MapR Event Store origin is configured to read from 10 topics that each have 5 partitions, Spark creates a total of 50 partitions to read from MapR Streams.

Spark uses these partitions while the pipeline processes the batch unless a processor causes Spark to shuffle the data. To change the partitioning in the pipeline, use the Repartition processor.

Topics and Offsets

The MapR Event Store origin reads messages from one or more topics that you specify. You define the starting offset to indicate the first message to read in each partition of a topic.

Use one of the following methods to identify the starting offset:

Earliest: The origin reads all available messages, starting with the first message in each partition of each topic.
Latest: The origin reads the last message in each partition of each topic and any subsequent messages added to those topics after the pipeline starts.
Specific offsets: The origin reads messages starting from a specified offset for each partition in each topic. If an offset is not specified for a partition in a topic, the origin returns an error.

When reading the last message in a batch, the origin saves the offset from that message. In the subsequent batch, the origin starts reading from the next message.

For example, suppose the orders_exp and orders_reg topics have two partitions, 0 and 1. To have the origin read from the partitions starting with the third message, which has an offset of 2, configure the origin as follows:

Data Formats

The MapR Event Store origin generates records based on the specified data format.

The origin can read the following data formats:

Avro

The origin generates a record for every message.

Note: To use the Avro data format, Apache Spark version 2.4 or later must be installed on the Transformer machine and on each node in the cluster.

You can use one of the following methods to specify the location of the Avro schema definition:

In Pipeline Configuration - Use the schema defined in the stage properties.
Confluent Schema Registry - Retrieve the schema from Confluent Schema Registry. Confluent Schema Registry is a distributed storage layer for Avro schemas. You specify the URL to Confluent Schema Registry and whether to look up the schema by the schema ID or subject.

Delimited

The origin generates a record for every message. You can specify a custom delimiter, quote, and escape character used in the data.

By default, the origin names the first field _c0, the second field _c1, and so on. The origin also infers data types from the data by default. You can rename the fields downstream with a Field Renamer processor, or you can specify a custom schema in the origin.

When you specify a custom schema, the origin uses the field names and data types defined in the schema, applying the first field in the schema to the first field in the record, and so on.

By default, when the origin encounters parsing errors, it stops the pipeline. When processing data with a custom schema, the origin handles parsing errors based on the configured error handling.

JSON

The origin generates a record for every message.

By default, the origin uses the field names, field order, and data types in the message.

When you specify a custom schema, the origin matches the field names in the schema to those in the data, then applies the data types and field order defined in the schema.

By default, when the origin encounters parsing errors, it stops the pipeline. When processing data with a custom schema, the origin handles parsing errors based on the configured error handling.

Text

The origin generates a record for every message.

The record includes a single field named Value where the origin writes the string data.

Configuring a MapR Event Store Origin

Configure a MapR Event Store origin to read data from MapR Streams. Use the origin only in pipelines that run on MapR distributions of Hadoop YARN clusters.

On the Properties panel, on the General tab, configure the following properties:

General Property	Description
Name	Stage name.
Description	Optional description.
Load Data Only Once	Reads data while processing the first batch of a pipeline run and caches the results for reuse throughout the pipeline run. Select this property for lookup origins. When configuring lookup origins, do not limit the batch size. All lookup data should be read in a single batch.
Cache Data	Caches data processed for a batch so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages. Caching can limit pushdown optimization when the pipeline runs in ludicrous mode. Available when Load Data Only Once is not enabled. When the origin loads data once, the origin caches data for the entire pipeline run.
Skip Offset Tracking	Skips tracking offsets. The origin reads all available data for each batch that the pipeline processes, limited by any batch-size configuration for the origin.

On the MapR Event Store tab, configure the following properties:

MapR Event Store Property	Description
Topic List	List of topics to read. Click the Add icon to add additional topics. You can use simple or bulk edit mode to configure the topics.
Include Message Keys	Includes Kafka message keys in a String field named `key`. Can be used with all data formats except Delimited. To process Kafka message keys with the JSON data format, Apache Spark version 2.4 or later must be installed on the Transformer machine and on each node in the cluster.
Starting Offset	The first message to read: Earliest - Reads messages starting with the first messages in each topic. Latest - Reads messages starting with the last message in each topic. Specific Offsets - For each topic, reads messages from a specified partition and position.
Specific Offsets	For each topic, the first message to read from each partition. For the first topic, enter the topic name, then if needed, click the Add Partition icon to add fields for specifying the partition names and starting positions for the topic. For additional topics, click the Add Topic icon to add another topic field, and add partition information as needed. You must specify an offset for each partition in a topic. Available when Starting Offset is set to Specific Offsets.
Max Messages per Partition	In each batch, the maximum number of messages the origin reads from each partition in a topic.
Additional Configurations	Additional Kafka configuration properties supported by MapR Streams to pass to MapR Streams. To add properties, click the Add icon and define the property name and value. Use `kafka.` as a prefix for the property names, as follows: `kafka.<kafka property name>`

On the Data Format tab, configure the following property:

Data Format Property	Description
Data Format	Format of the data in messages. Select one of the following formats: Avro (Spark 2.4 or later) Delimited JSON Text

For Avro data, click the Schema tab and configure the following properties:

Avro Property	Description
Avro Schema Location	Location of the Avro schema definition to use to process data: In Pipeline Configuration - Use the schema specified in the Avro Schema property. Confluent Schema Registry - Retrieve the schema from Confluent Schema Registry.
Avro Schema	Avro schema definition used to process the data. Overrides any existing schema definitions associated with the data. You can optionally use the `runtime:loadResource` function to use a schema definition stored in a runtime resource file. Available when Avro Schema Location is set to In Pipeline Configuration.
Register Schema	Registers the specified Avro schema with Confluent Schema Registry. Available when Avro Schema Location is set to In Pipeline Configuration.
Schema Registry URLs	Confluent Schema Registry URLs used to look up the schema. To add a URL, click Add. Use the following format to enter the URL: `http://<host name>:<port number>` Available when Avro Schema Location is set to In Pipeline Configuration.
Basic Auth User Info	Confluent Schema Registry `basic.auth.user.info` credential. Available when Avro Schema Location is set to Confluent Schema Registry.
Lookup Schema By	Method used to look up the schema in Confluent Schema Registry: Subject - Look up the specified Avro schema subject. Schema ID - Look up the specified Avro schema ID. Available when Avro Schema Location is set to In Pipeline Configuration.
Schema Subject	Avro schema subject to look up in Confluent Schema Registry. If the specified subject has multiple schema versions, the origin uses the latest schema version for that subject. To use an older version, find the corresponding schema ID, and then set the Look Up Schema By property to Schema ID. Available when Avro Schema Location is set to In Pipeline Configuration.
Schema ID	Avro schema ID to look up in the Confluent Schema Registry. Available when Avro Schema Location is set to In Pipeline Configuration.

For delimited data, on the Data Format tab, optionally configure the following properties:

Delimited Property	Description
Delimiter Character	Delimiter character used in the data. Select one of the available options or select Other to enter a custom character. You can enter a Unicode control character using the format `\uNNNN`, where N is a hexadecimal digit from the numbers 0-9 or the letters A-F. For example, enter `\u0000` to use the null character as the delimiter or `\u2028` to use a line separator as the delimiter.
Quote Character	Quote character used in the data.
Escape Character	Escape character used in the data
Includes Header	Indicates that the data includes a header line. When selected, the origin uses the first line to create field names and begins reading with the second line.

To use a custom schema for delimited or JSON data, click the Schema tab and configure the following properties:

Schema Property	Description
Schema Mode	Mode that determines the schema to use when processing data: Infer from Data The origin infers the field names and data types from the data. Use Custom Schema - JSON Format The origin uses a custom schema defined in the JSON format. Use Custom Schema - DDL Format The origin uses a custom schema defined in the DDL format. Note that the schema is applied differently depending on the data format of the data.
Schema	Custom schema to use to process the data. Enter the schema in DDL or JSON format, depending on the selected schema mode.
Error Handling	Determines how the origin handles parsing errors: Permissive - When the origin encounters a problem parsing any field in the record, it creates a record with the field names defined in the schema, but with null values in every field. Drop Malformed - When the origin encounters a problem parsing any field in the record, it drops the entire record from the pipeline. Fail Fast - When the origin encounters a problem parsing any field in the record, it stops the pipeline.
Original Data Field	Field where the data from the original record is written when the origin cannot parse the record. When writing the original record to a field, you must add the field to the custom schema as a String field. Available when using permissive error handling.