SQL Server JDBC Table
The SQL Server JDBC Table origin reads data from a SQL Server table. The origin can read all of the columns from a table or only the specified columns from a table. To read from one or more tables using a custom query, use the JDBC Query origin.
When you configure the SQL Server JDBC Table origin, you specify database connection information and any additional JDBC configuration properties you want to use. You can also use a connection to configure the origin.
You configure the table to read, and optionally define the columns to read from the table. You specify the offset column and the number of partitions used to read from the database table. The data type allowed for the offset column depends on the specified number of partitions. You can also configure an additional predicate for the query.
You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets, which enables reading the entire data set each time you start the pipeline.
You can optionally configure advanced properties such as specifying the fetch size, custom offset queries, and the JDBC driver to bundle with the pipeline.
StreamSets has tested this origin on Microsoft SQL Server 2017 with the SQL Server JDBC 8.4.0 JRE8 driver.
SQL Server JDBC Driver
The SQL Server JDBC Table origin includes the Microsoft SQL Server JDBC driver, so you do not need to install a driver before using the origin.
If you want to use a custom driver, you can install it as an external library for the JDBC stage library.
By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.
Partitioning
Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.
For the SQL Server JDBC Table origin, Transformer determines the partitioning based on the number of partitions that you configure for the origin. Spark creates one connection to the database for each partition.
Spark uses these partitions throughout the pipeline unless a processor causes Spark to shuffle the data. When you need to change the partitioning in the pipeline, use the Repartition processor.
- The size and configuration of the cluster.
- The amount of data being processed.
- The number of concurrent connections that can be made to the database.
If the pipeline fails because the origin encounters an out of memory error, you likely need to increase the number of partitions for the origin.
Offset Column Requirement
The SQL Server JDBC Table origin uses a single offset column. The offset column should contain unique, incremental values, and should not contain null values. The origin does not process records with null offset values. The offset column must also be of a supported data type.
When a table includes a single primary key column, the origin uses it as the offset column, by default.
You can configure the origin to use a different offset column. You might specify an alternate offset column when the table uses a composite key or when the data type of the primary key column is not supported.
- Create partitions
- When creating partitions, the origin determines the data to be processed and
then divides the data into partitions based on ranges of offset values.
For example, say you have rows with integer offsets from 1 to 1000 and you configure the origin to create two partitions. The first partition might include records with offsets from 1-500, and the second partition, the offsets from 501-1000.
- Track processing
- The origin tracks processing using values in the offset column. When reading the last row in a batch, the origin saves the value from the offset column. In the subsequent batch, the origin starts reading from the following row.
Supported Offset Data Types
The supported data types for an offset column differ based on the number of partitions that you want the origin to use when reading the data.
Partitions | Supported Offset Data Type |
---|---|
One partition |
|
One or more partitions |
|
Null Offset Value Handling
By default, the SQL Server JDBC Table origin does not process records with null offset values. You can configure the origin to process those records by enabling the Partition for NULL property on the Advanced tab.
When you enable the Partition for NULL property, the origin queries the table for rows with null offset values, then groups the resulting records into a single partition. As a result, when the table includes null offset values, each batch of data contains a partition of records with null offset values.
Default Offset Queries
The SQL Server JDBC Table origin uses two offset queries to determine the offset values to use when querying the database. The default queries work for most cases. On the rare occasion when you want to produce different results, you can configure custom offset queries to override the default queries.
- Min/max offset query
- This query returns the minimum and maximum values in the offset column. The origin uses these values to process all existing data in the table.
- Max offset query
- This query returns the maximum offset in the offset column. The origin uses this value along with the last-saved offset to process all new data that arrived since processing the last batch.
Custom Offset Queries
Configure custom offset queries to override the default offset queries for the SQL Server JDBC Table origin.
- Custom Min/Max Query
-
Returns the minimum and maximum value to use as offsets when querying the database. Configure this query to override the default min/max query that generates the first batch processed by the pipeline.
- Custom Max Query
- Returns a maximum value to use as an offset when querying the database. Configure this query to override the default max query that generates the max offset to process after the first batch.
Specify a custom max query along with the custom min/max query to define a range of data for the pipeline to process, such as data generated in 2019.
For example, say you want to process only the data with
offsets 1000 to 8000, inclusive. And you want the first batch to process one
thousand records. To do this, you configure the custom min/max query to return
1000
and 2000
. This sets the lower boundary of
the data that is processed and defines the size of the first batch. To set the upper
boundary of the data that is processed, you set the custom max query to
8000
.
In the first batch, the origin processes records with offsets between 1000 and 2000, inclusive. In the second batch, the origin processes any new records with offsets between 2001 and 8000, inclusive. Now, say the last record in the second batch has an offset value of 2500. Then, the third batch processes any new records with offsets between 2501 and 8000, and so on.
SQL Server Data Types
The following table lists the SQL Server data types that the SQL Server JDBC Table origin supports and the Transformer data types they are converted to.
SQL Server Data Type | Transformer Data Type |
---|---|
Bigint | Long |
Binary, Varbinary | Binary |
Bit | Boolean |
Char, Nchar, Ntext, Nvarchar, Text, Varchar | String |
Date | Date |
Datetime, Datetime2, Smalldatetime, Time | Timestamp |
Datetimeoffset | String |
Decimal, Numeric, Money, Smallmoney | Decimal |
Float | Double |
Image, Geography, Geometry | Binary |
Int, Tinyint | Integer |
Real | Float |
Smallint | Short |
XML | String |