SQL Server Change Tracking
Supported pipeline types:
|
By default, the origin generates a record with change tracking information and the latest version of each record from the data tables. You can configure it to use only the change tracking information. The origin uses multiple threads to enable parallel processing of data.
Use the SQL Server Change Tracking origin to generate records from change tracking tables. To read data from Microsoft SQL Server change data capture (CDC) tables, use the SQL Server CDC Client origin. For more information about the differences between change tracking and CDC data, see the Microsoft SQL Server documentation.
The SQL Server Change Tracking origin includes the CRUD operation type in a record header attribute so generated records can be easily processed by CRUD-enabled destinations. For an overview of Data Collector changed data processing and a list of CRUD-enabled destinations, see Processing Changed Data.
You might use this origin to perform database replication. You can use a separate pipeline with the JDBC Query Consumer or JDBC Multitable Consumer origin to read existing data. Then start a pipeline with the SQL Server Change Tracking origin to process subsequent changes.
When you configure the origin, you can define groups of change tracking tables in the same database and any initial offsets to use. When you omit initial offsets, the origin processes only incoming data.
To determine how the origin connects to the database, you specify connection information, a query interval, number of retries, and any custom JDBC configuration properties that you need.
You specify whether you want to include the latest version of the data in generated records or whether to include only change tracking data. You define the number of threads that the origin uses to read from the tables and the strategy that the origin uses to create each batch of data. You also define the initial order that the origin uses to read the tables.
To use a JDBC version older than 4.0, you can specify the driver class name and define a health check query.
The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Permission Requirements
- VIEW CHANGE TRACKING permission on the database.
- When using the default record generation to join change tracking data with the
current version of the data, the user must have SELECT permission on at least
the primary key column for each associated data table.
If processing data only from the change tracking data, the user does not need this permission.
Installing the JDBC Driver
Before you use the SQL Server Change Tracking origin, install the JDBC driver for the database. You cannot access the database until you install the required driver.
For information about installing additional drivers, see Install External Libraries in the Data Collector documentation.
Multithreaded Processing
The SQL Server Change Tracking origin performs parallel processing and enables the creation of a multithreaded pipeline.
When you start the pipeline, the SQL Server Change Tracking origin retrieves the list of change tracking-enabled tables with a valid minimum change tracking version that are defined in the table configuration. The origin then uses multiple concurrent threads based on the Number of Threads property. Each thread reads data from a single table.
As the pipeline runs, each thread connects to the origin system, creates a batch of data, and passes the batch to an available pipeline runner. A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors and destinations in the pipeline and performs all pipeline processing after the origin.
Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread. When the flow of data slows, the pipeline runners wait idly until they are needed, generating an empty batch at regular intervals. You can configure the Runner Idle Time pipeline property to specify the interval or to opt out of empty batch generation.
Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since batches are processed by different pipeline runners, the order that batches are written to destinations is not ensured.
For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.
Example
Say you are reading from 10 tables. You set the Number of Threads property to 5 and the Maximum Pool Size property to 6. When you start the pipeline, the origin retrieves the list of tables. The origin then creates five threads to read from the first five tables, and by default Data Collector creates a matching number of pipeline runners. Upon receiving data, a thread passes a batch to each of the pipeline runners for processing.
At any given moment, the five pipeline runners can each process a batch, so this multithreaded pipeline processes up to five batches at a time. When incoming data slows, the pipeline runners sit idle, available for use as soon as the data flow increases.
Batch Strategy
Each origin thread creates a batch of data from a single table. You can define one of the following strategies that the threads use to create each batch:
- Process All Available Rows from the Table
- Each thread creates multiple batches of data from one table, until all available rows are read from that table. The thread runs one SQL query for all batches created from the table. Then, the thread switches to the next available table, running another SQL query to read all available rows from that table.
- Switch Tables
- Each thread creates a set of batches from one table based on the Batches
from Result Set property, and then switches to the next available table to
create the next set of batches. The thread runs an initial SQL query to
create the first set of batches from the table. The database caches the
remaining rows in a result set in the database for the same thread to access
again, and then the thread switches to the next available table. A table is
available in the following situations:
- The table does not have an open result set cache. In this case, the thread runs an initial SQL query to create the first batch, caching the remaining rows in a result set in the database.
- The table has an open result set cache created by that same thread. In this case, the thread creates the batch from the result set cache in the database rather than running another SQL query.
Table Configuration
- Schema - The schema where the tables reside.
- Table name pattern - Use SQL-like syntax to define a set of tables to process.
For example, the table name pattern st% matches tables whose names start with
"st". The default pattern, %, matches all tables in the schema.
For more information about valid patterns for the SQL LIKE syntax, see https://msdn.microsoft.com/en-us/library/ms179859.aspx.
- Table exclusion pattern - When necessary, use a regex pattern to exclude certain
tables that match the table name pattern from being read.
For example, say you want to process all change tracking tables in the schema except for those that start with "dept". You can use the default % for the table name pattern, and enter dept* for the table exclusion pattern.
For more information about using regular expressions with Data Collector, see Regular Expressions Overview.
- Initial offset - The SQL Server Change Tracking origin uses the
SYS_CHANGE_VERSION column as the offset column. To process existing data, define
the offset value to use. The offset is used for all tables included in the table
configuration.
When not set, the origin processes only incoming data.
Important: When processing an offset, the origin starts with the first value greater than the specified offset.
Initial Table Order Strategy
You can define the initial order that the origin uses to read the tables.
- None
- Reads the tables in the order that they are listed in the database.
- Alphabetical
- Reads the tables in alphabetical order.
The origin uses the table order strategy only for the initial reading of the tables. When threads switch back to previously read tables, they read from the next available table, regardless of the defined order.
Generated Record
- Change tracking and current data
-
By default, when the SQL Server Change Tracking origin generates a record, it includes the data from the change tracking table and performs an outer join with the current version of the table.
The resulting record includes the following:- Change tracking fields such as SYS_CHANGE_VERSION, SYS_CHANGE_CREATION_VERSION, SYS_CHANGE_OPERATION, etc.
- The latest version of the record, when available.
Important: Unlike a record generated by a CDC origin, the change tracking record includes the latest version of the record, not the version of the record created by the change. - Change tracking only
- You can configure the origin to omit the join and produce a record with only the change tracking data. The resulting record includes the following:
All generated records include change tracking information in record header attributes.
Record Header Attributes
The SQL Server Change Tracking origin generates JDBC record header attributes that provide additional information about each record, such as the original data type of a field or the source tables for the record.
The origin also includes the sdc.operation.type attribute and information from the SQL Server change tracking table. The SQL Server Change Tracking header attributes are prefixed with "jdbc".
You can use the record:attribute or record:attributeOrDefault functions to access the information in the attributes. For more information about working with record header attributes, see Working with Header Attributes.
Header Attribute Name | Description |
---|---|
sdc.operation.type |
The origin uses the following values to represent the operation
type:
|
jdbc.tables |
Provides a comma-separated list of source
tables for the fields in the record.
Note: Not all JDBC drivers
provide this information.
|
jdbc.<column name>.jdbcType | Provides the numeric value of the original SQL data type for each field in the record. See the Java documentation for a list of the data types that correspond to numeric values. |
jdbc.<column name>.jdbc.precision | Provides the original precision for all numeric and decimal fields. |
jdbc.<column name>.jdbc.scale | Provides the original scale for all numeric and decimal fields. |
jdbc.SYS_CHANGE_COLUMNS | Lists the columns that have changed since the last
sync. Returns NULL when column change tracking is not enabled, when the operation is insert or delete, or when all non-primary key columns were updated at once. |
jdbc.SYS_CHANGE_CONTEXT | Provides change context information when available. |
jdbc.SYS_CHANGE_CREATION_VERSION | Provides the version number associated with the last insert operation. |
jdbc.SYS_CHANGE_OPERATION | Indicates the type of change that occurred:
|
jdbc.SYS_CHANGE_VERSION | Provides the version number of the most recent change to the row. |
For details about the SYS_CHANGE change tracking attributes, see the SQL Server documentation.
CRUD Operation Header Attributes
- sdc.operation.type
- The SQL Server Change Tracking origin writes the operation type to the sdc.operation.type record header attribute.
- jdbc.SYS_CHANGE_OPERATION
- The SQL Server Change Tracking origin also writes the CRUD operation type to the jdbc.SYS_CHANGE_OPERATION record header attribute. However note that CRUD-enabled stages only use the sdc.operation.type header attribute, they do not check the jdbc.SYS_CHANGE_OPERATION attribute.
Event Generation
The SQL Server Change Tracking origin can generate events that you can use in an event stream. When you enable event generation, the origin generates an event when it completes processing the data returned by the specified queries for all tables.
- With the Pipeline Finisher executor to
stop the pipeline and transition the pipeline to a Finished state when
the origin completes processing available data.
When you restart a pipeline stopped by the Pipeline Finisher executor, the origin continues processing from the last-saved offset unless you reset the origin.
For an example, see Case Study: Stop the Pipeline.
For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.
Event Record
Record Header Attribute | Description |
---|---|
sdc.event.type | Event type. Uses the following type:
|
sdc.event.version | Integer that indicates the version of the event record type. |
sdc.event.creation_timestamp | Epoch timestamp when the stage created the event. |
The no-more-data event record includes no record fields.