MySQL JDBC Table
The MySQL JDBC Table origin reads data from a MySQL table. The origin can read all of the columns from a table or only the specified columns from a table. To read from one or more tables using a custom query, use the JDBC Query origin.
When you configure the MySQL JDBC Table origin, you specify database connection information and any additional JDBC configuration properties you want to use. You configure the table to read, and optionally define the columns to read from the table. You can also use a connection to configure the origin.
You specify the offset column and the number of partitions used to read from the database table. The data type allowed for the offset column depends on the specified number of partitions. You can also configure an additional predicate for the query.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets, which enables reading the entire data set each time you start the pipeline.
You can optionally configure advanced properties such as specifying the fetch size, custom offset queries, and the JDBC driver to bundle with the pipeline.
Before you use the MySQL JDBC Table origin, install a JDBC driver.
StreamSets has tested this origin on MySQL 5.7 with the MySQL Connector/J 8.0.12 driver.
Installing the MySQL JDBC Driver
Before using the MySQL JDBC Table origin, you must install a MySQL JDBC driver. Install the driver as an external library for the JDBC stage library.
If you install a driver provided by MySQL, the origin automatically detects the JDBC driver class name from the configured JDBC connection string. If you install a third-party driver, you must specify the driver class name on the Advanced tab of the origin.
By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.
For the MySQL JDBC Table origin, Transformer determines the partitioning based on the number of partitions that you configure for the origin. Spark creates one connection to the database for each partition.
Spark uses these partitions throughout the pipeline unless a processor causes Spark to shuffle the data. When you need to change the partitioning in the pipeline, use the Repartition processor.
- The size and configuration of the cluster.
- The amount of data being processed.
- The number of concurrent connections that can be made to the database.
If the pipeline fails because the origin encounters an out of memory error, you likely need to increase the number of partitions for the origin.
Offset Column Requirement
The MySQL JDBC Table origin uses a single offset column. The offset column should contain unique, incremental values, and should not contain null values. The origin does not process records with null offset values. The offset column must also be of a supported data type.
When a table includes a single primary key column, the origin uses it as the offset column, by default.
You can configure the origin to use a different offset column. You might specify an alternate offset column when the table uses a composite key or when the data type of the primary key column is not supported.
- Create partitions
- When creating partitions, the origin determines the data to be processed and
then divides the data into partitions based on ranges of offset values.
For example, say you have rows with integer offsets from 1 to 1000 and you configure the origin to create two partitions. The first partition might include records with offsets from 1-500, and the second partition, the offsets from 501-1000.
- Track processing
- The origin tracks processing using values in the offset column. When reading the last row in a batch, the origin saves the value from the offset column. In the subsequent batch, the origin starts reading from the following row.
Supported Offset Data Types
The supported data types for an offset column differ based on the number of partitions that you want the origin to use when reading the data.
Partitions | Supported Offset Data Type |
---|---|
One partition |
|
One or more partitions |
|
Null Offset Value Handling
By default, the MySQL JDBC Table origin does not process records with null offset values. You can configure the origin to process those records by enabling the Partition for NULL property on the Advanced tab.
When you enable the Partition for NULL property, the origin queries the table for rows with null offset values, then groups the resulting records into a single partition. As a result, when the table includes null offset values, each batch of data contains a partition of records with null offset values.
Default Offset Queries
The MySQL JDBC Table origin uses two offset queries to determine the offset values to use when querying the database. The default queries work for most cases. On the rare occasion when you want to produce different results, you can configure custom offset queries to override the default queries.
- Min/max offset query
- This query returns the minimum and maximum values in the offset column. The origin uses these values to process all existing data in the table.
- Max offset query
- This query returns the maximum offset in the offset column. The origin uses this value along with the last-saved offset to process all new data that arrived since processing the last batch.
Custom Offset Queries
Configure custom offset queries to override the default offset queries for the MySQL JDBC Table origin.
- Custom Min/Max Query
-
Returns the minimum and maximum value to use as offsets when querying the database. Configure this query to override the default min/max query that generates the first batch processed by the pipeline.
- Custom Max Query
- Returns a maximum value to use as an offset when querying the database. Configure this query to override the default max query that generates the max offset to process after the first batch.
Specify a custom max query along with the custom min/max query to define a range of data for the pipeline to process, such as data generated in 2019.
For example, say you want to process only the data with
offsets 1000 to 8000, inclusive. And you want the first batch to process one
thousand records. To do this, you configure the custom min/max query to return
1000
and 2000
. This sets the lower boundary of
the data that is processed and defines the size of the first batch. To set the upper
boundary of the data that is processed, you set the custom max query to
8000
.
In the first batch, the origin processes records with offsets between 1000 and 2000, inclusive. In the second batch, the origin processes any new records with offsets between 2001 and 8000, inclusive. Now, say the last record in the second batch has an offset value of 2500. Then, the third batch processes any new records with offsets between 2501 and 8000, and so on.
MySQL Data Types
The following table lists the MySQL data types that the MySQL JDBC Table origin supports and the Transformer data types they are converted to.
MySQL data types not listed in the table are not supported.
MySQL Data Type | Transformer Data Type |
---|---|
Binary, Blob, Varbinary | Binary |
Bigint Unsigned | Decimal |
Bigint, Medium Int Unsigned | Long |
Char, Text, Varchar | String |
Date, Year | Date |
Datetime, Time, Timestamp | Timestamp |
Decimal, Double, Float, Numeric | Decimal |
Enum, Json, Set | String |
Int, Mediumint, Smallint, Smallint Unsigned, Tinyint, Tinyint Unsigned | Integer |
Point, Linestring | Binary |