Processing Modes for the CDC Source

The CDC Source offers five different ways to retrieve change data. The format you receive the data in is determined by the “CDC processing mode” you select in the CDC Source Editor UI.

Cdc_source_ui

The two primary modes are All Changes, or Net Changes. Net is the best mode to use for most ETL processes, although it does introduce some overhead on the source system – see the descriptions of the modes below for details.

To describe the modes, we’ll use the following scenario.

  • CDC is enabled on the AdventureWorksDW DimCustomer table (similar to the example in this post)
  • Three new customers (CustomerKey = 30000, 30001, 30002) are added
  • The LastName for two of the customers (30001, 30002) is set to SMITH
  • One of the new customers is deleted (CustomerKey = 30002)

The SQL statements for these steps looks like this:

All

In this process mode, a row will be returned for each change applied to the source table. In this example, the CDC Source returns six rows – three INSERTs, two UPDATEs, and one DELETE. The rows will be ordered by LSN (i.e. in the order in which they were applied). This mode is typically used in ETL processes that require heavy auditing (usually for compliance reasons). While this mode has less overhead on the source system (it simply returns the captured rows, and requires no additional roll-up processing), it does complicate the ETL process, as you will need to merge the changes yourself.

Cdc_all

The order that the operations took place can be determined by the __$seqval field.

This processing mode is similar to using the cdc.fn_cdc_get_all_changes function.

Note: You will (typically) not want to use the CDC Splitter when processing all changes, as sending your change rows down multiple paths will make it difficult to apply them in the correct order.

All with old values

This processing mode is similar to All, except that you get two rows for each UPDATE statement – a row containing the values Before Update, and one containing the values After Update. Using our example, the CDC Source returns eight rows –  three INSERTs, two UPDATEs with Before Update values, two UPDATEs with After Update values, and one DELETE.

Cdc_all_with_old

To distinguish between old and new values, check the __$operation value. A value of 3 indicates the row has Before Update values, 4 means After Update values.

Net

When using this processing mode, the CDC server will “roll-up” all changes for a key into a single row. In this example, the CDC Source returns two rows – two INSERTs for the new rows. The value for CustomerKey 20001 will already have the updated LastName value. CustomerKey 20002 is not returned at all, since it was deleted in the end. This mode does have a performance impact, as it requires the source server to apply the roll-up logic to figure out the final state of the row. However, it greatly simplifies the ETL logic needed for processing the change data – see the CDC Overview post for an example of how to processes Net changes.

Cdc_net

This processing mode is similar to using the cdc.fn_cdc_get_net_changes function.

Note: This mode requires that the source table has a primary key or unique index. It also requires that the @supports_net_changes parameter was set to True when CDC was enabled on the source table. See sys.sp_cdc_enable_table for more details.

Net with update mask

This processing mode is similar to Net, except that there will be an additional boolean column (__$<column-name>__Changed) that indicates whether a column value changed. This may be a useful processing mode if you are only monitoring/processing changes to certain fields, or wish to provide additional processing logic when changes are made to a specific field.

Cdc_net_update_mask

Note: In this example, the __Changed field would be True for all rows, since they are all new (from the perspective of this load). If a change was made to an existing row, unchanged columns would have a False value.

Net with merge

The key things to keep in mind for this processing mode are:

  1. UPDATE and INSERT rows are grouped together with the same operation value (__$operation = 5)
  2. Grouping the rows together makes it easier to use the changes with a MERGE statement

From the cdc.fn_cdc_get_net_changes entry in Books Online for the all with merge row_filter_option:

Returns the LSN of the final change to the row in the metadata columns __$start_lsn. The column __$operation will be one of two values: 1 for delete and 5 to indicate that the operation needed to apply the change is either an insert or an update. The column __$update_mask is always NULL.

Because the logic to determine the precise operation for a given change adds to query complexity, this option is designed to improve query performance when it is sufficient to indicate that the operation needed to apply the change data is either an insert or an update, but it is not necessary to explicitly distinguish between the two. This option is most attractive in target environments where a merge operation is available directly, such as a SQL Server 2008 environment.

You’ll use the following data flow pattern when using this change mode:

  • Retrieve changes with CDC Source
  • Use CDC Splitter to divide rows
    • Note: only the DELETE and UPDATE paths will receive rows, as INSERTs are grouped with UPDATEs in this mode
  • Stage DELETE rows
  • Stage UPDATE rows
  • Use an Execute SQL Task after the data flow to apply DELETE changes in a batch operation
  • Use an Execute SQL Task to run a MERGE statement to process UPDATE rows in a single batch

This pattern is similar to the approach I described in my previous CDC post, except 1) new rows are not inserted into the final table within the data flow and 2) MERGE is used instead of UPDATE FROM … SELECT.

Cdc_net_with_merge

 

 

 

5 thoughts on “Processing Modes for the CDC Source”

  1. I have a Data Warehouse that we are converting to load incrementally. I have all the packages built from source to OLAP, and it was working fine. We then had to change the source database to pull off of new Stand-alone/Read Only Logshipping servers and for tables with a large number of transactions, it hangs on the CDC Source component. I have it setup for net changes and can't even get preview in the component. If I change it to ALL it loads a preview, but takes longer than it should. The packages were working fine, and the new servers have way more horsepower, so I don't have a clue as to why it's hanging.

    Is it not possible to do Incremental Loads against a Stand-Alone database?

    I had to change my Initial Load packages to use: Set CDC Start -> Get Processing Range -> Set Processed Range. So are we not able to use CDC Initial Load against a Stand-Alone database also?

    I've been looking everywhere for someone who has, or at least has tried to run CDC against a Stand-Alone Database, but haven't found anything. Thank you for any help!

  2. Not sure if my last post submitted.

    I am trying to re-program our EDW to run incrementally. I was able to use the above CDC Source on our LIVE source, but now we have switched to a Log Shipping server that is Read-Only/Stand-Alone and refreshes twice a day.

    I have it setup to get net changes, and the component just hangs. I've reduced the records down to a little over 1000 records and it still takes over 3 minutes to run. Any more than that, it doesn't complete. Is there an issue with the CDC Source Component handling bigger volumes of data? We contacted Attunity and were told that we needed to buy their Enterprise product in order for our code to work. Any suggestions?

    1. hi,
      I am a new user of SSIS+CDC. I had the 'CDC Source' not working problems issue couple days ago. Sometime it appeared running and running and would not go to next step, and sometime I got error about the 'ADO .NET reference is not an object (similar like that)' error.
      I think it is a CDC Source component bug which is related to something else. So I upgraded my SQL Server to 2012 SP2, Upgrade the ASP. NET mvc to latest version, and installed the Visual Studio 2012 Shell (Isolated). Then the problem got fixed. CDC source works good after that.
      I am not very sure with installation fixed the problem, but you can try my solution.
      Good Luck

Leave a Reply

Your email address will not be published. Required fields are marked *