The CDC Source offers five different ways to retrieve change data. The format you receive the data in is determined by the “CDC processing mode” you select in the CDC Source Editor UI.
The two primary modes are All Changes, or Net Changes. Net is the best mode to use for most ETL processes, although it does introduce some overhead on the source system – see the descriptions of the modes below for details.
To describe the modes, we’ll use the following scenario.
- CDC is enabled on the AdventureWorksDW DimCustomer table (similar to the example in this post)
- Three new customers (CustomerKey = 30000, 30001, 30002) are added
- The LastName for two of the customers (30001, 30002) is set to SMITH
- One of the new customers is deleted (CustomerKey = 30002)
The SQL statements for these steps looks like this:
-- INSERT - Add new customers
SET IDENTITY_INSERT [DimCustomer_CDC] ON
INSERT INTO [dbo].[DimCustomer_CDC] (CustomerKey, CustomerAlternateKey, FirstName, LastName)
(30000, N'AW00030000', N'Mister30000', N'NotSmith'),
(30001, N'AW00030001', N'Mister30001', N'NotSmith'),
(30002, N'AW00030002', N'Mister30002', N'NotSmith')
SET IDENTITY_INSERT [DimCustomer_CDC] OFF
-- UPDATE - Set last names
SET LastName = N'SMITH'
WHERE CustomerKey IN (30001, 30002)
-- DELETE - Remove recently added customer
DELETE FROM [dbo].[DimCustomer_CDC] WHERE CustomerKey = 30002
In this process mode, a row will be returned for each change applied to the source table. In this example, the CDC Source returns six rows – three INSERTs, two UPDATEs, and one DELETE. The rows will be ordered by LSN (i.e. in the order in which they were applied). This mode is typically used in ETL processes that require heavy auditing (usually for compliance reasons). While this mode has less overhead on the source system (it simply returns the captured rows, and requires no additional roll-up processing), it does complicate the ETL process, as you will need to merge the changes yourself.
The order that the operations took place can be determined by the __$seqval field.
This processing mode is similar to using the cdc.fn_cdc_get_all_changes function.
Note: You will (typically) not want to use the CDC Splitter when processing all changes, as sending your change rows down multiple paths will make it difficult to apply them in the correct order.
All with old values
This processing mode is similar to All, except that you get two rows for each UPDATE statement – a row containing the values Before Update, and one containing the values After Update. Using our example, the CDC Source returns eight rows – three INSERTs, two UPDATEs with Before Update values, two UPDATEs with After Update values, and one DELETE.
To distinguish between old and new values, check the __$operation value. A value of 3 indicates the row has Before Update values, 4 means After Update values.
When using this processing mode, the CDC server will “roll-up” all changes for a key into a single row. In this example, the CDC Source returns two rows – two INSERTs for the new rows. The value for CustomerKey 20001 will already have the updated LastName value. CustomerKey 20002 is not returned at all, since it was deleted in the end. This mode does have a performance impact, as it requires the source server to apply the roll-up logic to figure out the final state of the row. However, it greatly simplifies the ETL logic needed for processing the change data – see the CDC Overview post for an example of how to processes Net changes.
This processing mode is similar to using the cdc.fn_cdc_get_net_changes function.
Note: This mode requires that the source table has a primary key or unique index. It also requires that the @supports_net_changes parameter was set to True when CDC was enabled on the source table. See sys.sp_cdc_enable_table for more details.
Net with update mask
This processing mode is similar to Net, except that there will be an additional boolean column (__$<column-name>__Changed) that indicates whether a column value changed. This may be a useful processing mode if you are only monitoring/processing changes to certain fields, or wish to provide additional processing logic when changes are made to a specific field.
Note: In this example, the __Changed field would be True for all rows, since they are all new (from the perspective of this load). If a change was made to an existing row, unchanged columns would have a False value.
Net with merge
The key things to keep in mind for this processing mode are:
- UPDATE and INSERT rows are grouped together with the same operation value (__$operation = 5)
- Grouping the rows together makes it easier to use the changes with a MERGE statement
From the cdc.fn_cdc_get_net_changes entry in Books Online for the all with merge row_filter_option:
Returns the LSN of the final change to the row in the metadata columns __$start_lsn. The column __$operation will be one of two values: 1 for delete and 5 to indicate that the operation needed to apply the change is either an insert or an update. The column __$update_mask is always NULL.
Because the logic to determine the precise operation for a given change adds to query complexity, this option is designed to improve query performance when it is sufficient to indicate that the operation needed to apply the change data is either an insert or an update, but it is not necessary to explicitly distinguish between the two. This option is most attractive in target environments where a merge operation is available directly, such as a SQL Server 2008 environment.
You’ll use the following data flow pattern when using this change mode:
- Retrieve changes with CDC Source
- Use CDC Splitter to divide rows
- Note: only the DELETE and UPDATE paths will receive rows, as INSERTs are grouped with UPDATEs in this mode
- Stage DELETE rows
- Stage UPDATE rows
- Use an Execute SQL Task after the data flow to apply DELETE changes in a batch operation
- Use an Execute SQL Task to run a MERGE statement to process UPDATE rows in a single batch
This pattern is similar to the approach I described in my previous CDC post, except 1) new rows are not inserted into the final table within the data flow and 2) MERGE is used instead of UPDATE FROM … SELECT.