SQL Server Change Tracking

Supported pipeline types: Data Collector

The SQL Server Change Tracking origin processes data from Microsoft SQL Server change tracking tables. The origin can process data from tables with simple numeric primary keys. The origin cannot process data from tables with compound or non-numeric primary keys.

By default, the origin generates a record with change tracking information and the latest version of each record from the data tables. You can configure it to use only the change tracking information. The origin uses multiple threads to enable parallel processing of data.

Use the SQL Server Change Tracking origin to generate records from change tracking tables. To read data from Microsoft SQL Server change data capture (CDC) tables, use the SQL Server CDC Client origin. For more information about the differences between change tracking and CDC data, see the Microsoft SQL Server documentation.

The SQL Server Change Tracking origin includes the CRUD operation type in a record header attribute so generated records can be easily processed by CRUD-enabled destinations. For an overview of Data Collector changed data processing and a list of CRUD-enabled destinations, see Processing Changed Data.

You might use this origin to perform database replication. You can use a separate pipeline with the JDBC Query Consumer or JDBC Multitable Consumer origin to read existing data. Then start a pipeline with the SQL Server Change Tracking origin to process subsequent changes.

When you configure the origin, you can define groups of change tracking tables in the same database and any initial offsets to use. When you omit initial offsets, the origin processes only incoming data.

To determine how the origin connects to the database, you specify connection information, a query interval, number of retries, and any custom JDBC configuration properties that you need.

You specify whether you want to include the latest version of the data in generated records or whether to include only change tracking data. You define the number of threads that the origin uses to read from the tables and the strategy that the origin uses to create each batch of data. You also define the initial order that the origin uses to read the tables.

To use a JDBC version older than 4.0, you can specify the driver class name and define a health check query.

The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Permission Requirements

To use the SQL Server Change Tracking origin, the user associated with the database credentials must have the following permissions:

VIEW CHANGE TRACKING permission on the database.
When using the default record generation to join change tracking data with the current version of the data, the user must have SELECT permission on at least the primary key column for each associated data table.
If processing data only from the change tracking data, the user does not need this permission.

Installing the JDBC Driver

Before you use the SQL Server Change Tracking origin, install the JDBC driver for the database. You cannot access the database until you install the required driver.

Note: StreamSets has tested the SQL Server Change Tracking origin with SQL Server 2017 with the Microsoft JDBC driver 4.2 for SQL Server.

For information about installing additional drivers, see Install External Libraries in the Data Collector documentation.

Multithreaded Processing

The SQL Server Change Tracking origin performs parallel processing and enables the creation of a multithreaded pipeline.

When you start the pipeline, the SQL Server Change Tracking origin retrieves the list of change tracking-enabled tables with a valid minimum change tracking version that are defined in the table configuration. The origin then uses multiple concurrent threads based on the Number of Threads property. Each thread reads data from a single table.

Note: The Maximum Pool Size property on the Advanced tab defines the maximum number of connections the origin can make to the database. It must be equal to or greater than the value defined for the Number of Threads property.

As the pipeline runs, each thread connects to the origin system, creates a batch of data, and passes the batch to an available pipeline runner. A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors and destinations in the pipeline and performs all pipeline processing after the origin.

Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they are needed, generating an empty batch at regular intervals. You can configure the Runner Idle Time pipeline property to specify the interval or to opt out of empty batch generation.

Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline runners, the order that batches are written to destinations is not ensured.

For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.

Example

Say you are reading from 10 tables. You set the Number of Threads property to 5 and the Maximum Pool Size property to 6. When you start the pipeline, the origin retrieves the list of tables. The origin then creates five threads to read from the first five tables, and by default Data Collector creates a matching number of pipeline runners. Upon receiving data, a thread passes a batch to each of the pipeline runners for processing.

At any given moment, the five pipeline runners can each process a batch, so this multithreaded pipeline processes up to five batches at a time. When incoming data slows, the pipeline runners sit idle, available for use as soon as the data flow increases.

Batch Strategy

Each origin thread creates a batch of data from a single table. You can define one of the following strategies that the threads use to create each batch:

Process All Available Rows from the Table

Each thread creates multiple batches of data from one table, until all available rows are read from that table. The thread runs one SQL query for all batches created from the table. Then, the thread switches to the next available table, running another SQL query to read all available rows from that table.

For example, let's say that the batch size for the origin is set to 100. The origin is configured to use two concurrent threads and to read from four tables, each of which contains 1,000 rows. The first thread runs one SQL query to create 10 batches of 100 rows each from table1, while the second thread uses the same strategy to read data from table2. When table1 and table2 are fully read, the threads switch to table3 and table4 and complete the same process. When the first thread finishes reading from table3, the thread switches back to the next available table to read all available data from the last saved offset.

Switch Tables

Each thread creates a set of batches from one table based on the Batches from Result Set property, and then switches to the next available table to create the next set of batches. The thread runs an initial SQL query to create the first set of batches from the table. The database caches the remaining rows in a result set in the database for the same thread to access again, and then the thread switches to the next available table. A table is available in the following situations:

The table does not have an open result set cache. In this case, the thread runs an initial SQL query to create the first batch, caching the remaining rows in a result set in the database.
The table has an open result set cache created by that same thread. In this case, the thread creates the batch from the result set cache in the database rather than running another SQL query.

A table is not available when the table has an open result set cache created by another thread. No other thread can read from that table until the result set is closed.

When you configure a switch table strategy, define the result set cache size and the number of batches that a thread can create from the result set. After a thread creates the configured number of batches, the database closes the result set and then a different thread can read from the table.

Note: By default, the origin instructs the database to cache an unlimited number of result sets. A thread can create an unlimited number of batches from that result set.

For example, let's say that the batch size for the origin is set to 100. The origin is configured to use two concurrent threads and to read from four tables, each of which contains 10,000 rows. You set the result set cache size to 500 and set the number of batches read from the result set to 5.

Thread1 runs an SQL query on table1, which returns all 10,000 rows. The thread creates a batch when it reads the first 100 rows. The next 400 rows are cached as a result set in the database. Since thread2 is similarly processing table2, thread1 switches to the next available table, table3, and repeats the same process. After creating a batch from table3, thread1 switches back to table1 and retrieves the next batch of rows from the result set that it previously cached in the database.

After thread1 creates five batches using the result set cache for table1, the database closes the result set cache. Thread1 switches to the next available table. A different thread runs an SQL query to read additional rows from table1, beginning from the last saved offset.

Table Configuration

When you configure the SQL Server Change Tracking origin, you define multiple tables using a single set of table configuration properties and you can define multiple table configurations to process multiple groups of change tables. When you define table configurations you can define the following properties for each set of tables:

Schema - The schema where the tables reside.

Table name pattern - Use SQL-like syntax to define a set of tables to process. For example, the table name pattern st% matches tables whose names start with "st". The default pattern, %, matches all tables in the schema.
For more information about valid patterns for the SQL LIKE syntax, see https://msdn.microsoft.com/en-us/library/ms179859.aspx.
Table exclusion pattern - When necessary, use a regex pattern to exclude certain tables that match the table name pattern from being read.
For example, say you want to process all change tracking tables in the schema except for those that start with "dept". You can use the default % for the table name pattern, and enter dept* for the table exclusion pattern.

For more information about using regular expressions with Data Collector, see Regular Expressions Overview.
Initial offset - The SQL Server Change Tracking origin uses the SYS_CHANGE_VERSION column as the offset column. To process existing data, define the offset value to use. The offset is used for all tables included in the table configuration.
When not set, the origin processes only incoming data.

Important: When processing an offset, the origin starts with the first value greater than the specified offset.

Initial Table Order Strategy

You can define the initial order that the origin uses to read the tables.

Define one of the following initial table order strategies:

None: Reads the tables in the order that they are listed in the database.
Alphabetical: Reads the tables in alphabetical order.

The origin uses the table order strategy only for the initial reading of the tables. When threads switch back to previously read tables, they read from the next available table, regardless of the defined order.

Generated Record

The SQL Server Change Tracking origin can generate a record in the following ways:

Change tracking and current data

By default, when the SQL Server Change Tracking origin generates a record, it includes the data from the change tracking table and performs an outer join with the current version of the table.

The resulting record includes the following:

Change tracking fields such as SYS_CHANGE_VERSION, SYS_CHANGE_CREATION_VERSION, SYS_CHANGE_OPERATION, etc.
The latest version of the record, when available.

Important: Unlike a record generated by a CDC origin, the change tracking record includes the latest version of the record, not the version of the record created by the change.

Change tracking only

You can configure the origin to omit the join and produce a record with only the change tracking data. The resulting record includes the following:

Change tracking fields such as SYS_CHANGE_VERSION, SYS_CHANGE_CREATION_VERSION, etc.
Primary key field for the changed record, provided by the change tracking table.

All generated records include change tracking information in record header attributes.

Record Header Attributes

The SQL Server Change Tracking origin generates JDBC record header attributes that provide additional information about each record, such as the original data type of a field or the source tables for the record.

The origin also includes the sdc.operation.type attribute and information from the SQL Server change tracking table. The SQL Server Change Tracking header attributes are prefixed with "jdbc".

You can use the record:attribute or record:attributeOrDefault functions to access the information in the attributes. For more information about working with record header attributes, see Working with Header Attributes.

The origin provides the following header attributes:

Header Attribute Name	Description
sdc.operation.type	The origin uses the following values to represent the operation type: 1 for INSERT 2 for DELETE 3 for UPDATE
jdbc.tables	Provides a comma-separated list of source tables for the fields in the record. Note: Not all JDBC drivers provide this information.
jdbc.<column name>.jdbcType	Provides the numeric value of the original SQL data type for each field in the record. See the Java documentation for a list of the data types that correspond to numeric values.
jdbc.<column name>.jdbc.precision	Provides the original precision for all numeric and decimal fields.
jdbc.<column name>.jdbc.scale	Provides the original scale for all numeric and decimal fields.
jdbc.SYS_CHANGE_COLUMNS	Lists the columns that have changed since the last sync. Returns NULL when column change tracking is not enabled, when the operation is insert or delete, or when all non-primary key columns were updated at once.
jdbc.SYS_CHANGE_CONTEXT	Provides change context information when available.
jdbc.SYS_CHANGE_CREATION_VERSION	Provides the version number associated with the last insert operation.
jdbc.SYS_CHANGE_OPERATION	Indicates the type of change that occurred: I for insert D for delete U for update
jdbc.SYS_CHANGE_VERSION	Provides the version number of the most recent change to the row.

For details about the SYS_CHANGE change tracking attributes, see the SQL Server documentation.

CRUD Operation Header Attributes

When generating records, the SQL Server Change Tracking origin specifies the operation type in both of the following record header attributes:

sdc.operation.type

The SQL Server Change Tracking origin writes the operation type to the sdc.operation.type record header attribute.

The origin uses the following values in the sdc.operation.type record header attribute to represent the operation type:

1 for INSERT
2 for DELETE
3 for UPDATE

If you use a CRUD-enabled destination in the pipeline such as JDBC Producer or Elasticsearch, the destination can use the operation type when writing to destination systems. When necessary, you can use an Expression Evaluator or scripting processors to manipulate the value in the sdc.operation.type header attribute. For an overview of Data Collector changed data processing and a list of CRUD-enabled destinations, see Processing Changed Data.

When using CRUD-enabled destinations, the destination looks for the operation type in the sdc.operation.type attribute only.

jdbc.SYS_CHANGE_OPERATION

The SQL Server Change Tracking origin also writes the CRUD operation type to the jdbc.SYS_CHANGE_OPERATION record header attribute. However note that CRUD-enabled stages only use the sdc.operation.type header attribute, they do not check the jdbc.SYS_CHANGE_OPERATION attribute.

Event Generation

The SQL Server Change Tracking origin can generate events that you can use in an event stream. When you enable event generation, the origin generates an event when it completes processing the data returned by the specified queries for all tables.

SQL Server Change Tracking events can be used in any logical way. For example:

With the Pipeline Finisher executor to stop the pipeline and transition the pipeline to a Finished state when the origin completes processing available data.
When you restart a pipeline stopped by the Pipeline Finisher executor, the origin continues processing from the last-saved offset unless you reset the origin.

For an example, see Case Study: Stop the Pipeline.

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Record

Event records generated by SQL Server Change Tracking origin have the following event-related record header attributes:

Record Header Attribute	Description
sdc.event.type	Event type. Uses the following type: no-more-data - Generated when the origin completes processing all data in the specified change tables.
sdc.event.version	Integer that indicates the version of the event record type.
sdc.event.creation_timestamp	Epoch timestamp when the stage created the event.

The no-more-data event record includes no record fields.

Configuring a SQL Server Change Tracking Origin

Configure a SQL Server Change Tracking origin to process records from change tables that have simple numeric primary keys.

In the Properties panel, on the General tab, configure the following properties:

General Property	Description
Name	Stage name.
Description	Optional description.
Produce Events	Generates event records when events occur. Use for event handling.
On Record Error	Error record handling for the stage: Discard - Discards the record. Send to Error - Sends the record to the pipeline for error handling. Stop Pipeline - Stops the pipeline.

On the JDBC tab, configure the following properties:

JDBC Property	Description
JDBC Connection String	Connection string to use to connect to the database.
Use Credentials	Enables entering credentials on the Credentials tab. Use when you do not include credentials in the JDBC connection string.
Queries per Second	Maximum number of queries to run in a second across all partitions and tables. Use 0 for no limit. Default is 10.
Number of Threads	Number of threads the origin generates and uses for multithreaded processing. Configure the Maximum Pool Size property on the Advanced tab to be equal to or greater than this value.
Per Batch Strategy	Strategy to create each batch of data: Switch Tables - When performing only multithreaded table processing, each thread creates a batch of data from one table, and then switches to the next available table to create the next batch. Define the Result Set Cache Size and the Batches from Result Set properties when you configure a switch tables strategy. Process All Available Rows From the Table - When performing only multithreaded table processing, each thread creates multiple batches of data from one table, until all available rows are read from that table. When performing multithreaded partition processing or a mix of table and partition processing, the behavior for each batch strategy is more complicated. For details, see Understanding the Processing Queue.
Max Batch Size (records)	Maximum number of records to include in a batch.
Batches from Result Set	Number of batches to create from the result set. After a thread creates this number of batches, the database closes the result set and then another thread can read from the same table. Use a positive integer to set a limit on the number of batches created from the result set. Use -1 to opt out of this property. By default, the origin creates an unlimited number of batches from the result set, keeping the result set open as long as possible.
Result Set Cache Size	Number of result sets to cache in the database. Use a positive integer to set a limit on the number of cached result sets. Use -1 to opt out of this property. By default, the origin caches an unlimited number of result sets.
Max Clob Size (characters)	Maximum number of characters to be read in a Clob field. Larger data is truncated.
Max Blob Size (bytes)	Maximum number of bytes to be read in a Blob field.
Number of Retries on SQL Error	Number of times a thread tries to read a batch of data after receiving an SQL error. After a thread retries this number of times, the thread handles the error based on the error handling configured for the origin. Use to handle transient network or connection issues that prevent a thread from reading a batch of data. Default is 0.
Data Time Zone	Time zone to use to evaluate datetime-based offset column conditions.
No-more-data Event Generation Delay (seconds)	Number of seconds to delay generating the no-more-data event. Use to allow the specified number of seconds to pass to verify that no additional data arrives before generating the no-more-data event.
Convert Timestamp To String	Enables the origin to write timestamps as string values rather than datetime values. Strings maintain the precision stored in the source database. For example, strings can maintain the precision of a high-precision SQL Server datetime2 field. When writing timestamps to Data Collector date or time data types that do not store nanoseconds, the origin stores any nanoseconds from the timestamp in a field attribute.
Fetch Size	Maximum number of rows to fetch and store in memory on the Data Collector machine. The size cannot be zero. Default is 1,000.
Additional JDBC Configuration Properties	Additional JDBC configuration properties to use. To add properties, click Add and define the JDBC property name and value. Use the property names and values as expected by JDBC.

On the Change Tracking tab, define one or more table configurations. Using simple or bulk edit mode, click the Add icon to define another table configuration.

Change Tracking Property	Description
Schema Name	Name of the schema to use for this table configuration.
Table Name Pattern	Pattern of the table names to read for this table configuration. Use the SQL LIKE syntax to define the pattern. Default is the percentage wildcard (%) which matches all tables in the schema.
Table Exclusion Pattern	Pattern of the table names to exclude from being read for this table configuration. Use a Java-based regular expression, or regex, to define the pattern. Leave empty if you do not need to exclude any tables.
Initial Offset	Offset value to use for this table configuration when the pipeline starts. When processing an offset, the origin starts with the first value greater than the specified offset. Use -1 to opt out of an initial offset. With the initial offset set to -1, the origin ignores existing data and begins processing with new, incoming changes.

To enter JDBC credentials separately from the JDBC connection string, on the Credentials tab, configure the following properties:

Credentials Property	Description
Username	User name for the JDBC connection.
Password	Password for the JDBC account. Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores. For more information about credential stores, see Credential Stores in the Data Collector documentation.

When using JDBC versions older than 4.0, on the Legacy Drivers tab, optionally configure the following properties:

Legacy Drivers Property	Description
JDBC Class Driver Name	Class name for the JDBC driver. Required for JDBC versions older than version 4.0.
Connection Health Test Query	Optional query to test the health of a connection. Recommended only when the JDBC version is older than 4.0.

On the Advanced tab, optionally configure the following properties:

The defaults for these properties should work in most cases:

Advanced Property	Description
Maximum Pool Size	Maximum number of connections to create. Must be equal to or greater than the value of the Number of Threads property. Default is 1.
Minimum Idle Connections	Minimum number of connections to create and maintain. To define a fixed connection pool, set to the same value as Maximum Pool Size. Default is 1.
Connection Timeout	Maximum time to wait for a connection. Use a time constant in an expression to define the time increment. Default is 30 seconds, defined as follows: `${30 * SECONDS}`
Idle Timeout	Maximum time to allow a connection to idle. Use a time constant in an expression to define the time increment. Use 0 to avoid removing any idle connections. When the entered value is close to or more than the maximum lifetime for a connection, Data Collector ignores the idle timeout. Default is 10 minutes, defined as follows: `${10 * MINUTES}`
Max Connection Lifetime	Maximum lifetime for a connection. Use a time constant in an expression to define the time increment. Use 0 to set no maximum lifetime. When a maximum lifetime is set, the minimum valid value is 30 minutes. Default is 30 minutes, defined as follows: `${30 * MINUTES}`
Auto Commit	Determines if auto-commit mode is enabled. In auto-commit mode, the database commits the data for each record. Default is disabled.
Enforce Read-only Connection	Creates read-only connections to avoid any type of write. Default is enabled. Disabling this property is not recommended.
Transaction Isolation	Transaction isolation level used to connect to the database. Default is the default transaction isolation level set for the database. You can override the database default by setting the level to any of the following: Read committed Read uncommitted Repeatable read Serializable
Initial Table Order Strategy	Initial order used to read the tables: None - Reads the tables in the order that they are listed in the database. Alphabetical - Reads the tables in alphabetical order.