CDC in SSIS for SQL Server 2012

SQL Server 2012 introduces new components that make it easier to do Change Data Capture (CDC) using SSIS. This blog post provides a quick walkthrough of how to use them.

Note: The documentation for the CDC components is not in Books Online yet. It will be appearing in the RTM refresh.

New Task and Components

CDC Control Task

The CDC Control task is used to control the life cycle of change data capture (CDC) packages. It handles CDC package synchronization with the initial load package, the management of Log Sequence Number (LSN) ranges that are processed in a run of a CDC package. In addition, the CDC Control task deals with error scenarios and recovery.

CDC Source

The CDC source reads a range of change data from CDC change tables and delivers the changes downstream to other SSIS components.

CDC Splitter

The CDC splitter splits a single flow of change rows from a CDC Source component into different data flows for Insert, Update and Delete operations. It is essentially a “smart” Conditional Split transform that automatically handles the standard values of the __$operation column.

Walkthrough

Database Setup

For sample data, we will create a new database (CDCTest), and select a subset of rows from the AdventureWorksDW DimCustomer table into a sample table (DimCustomer_CDC). This will become the Source table for this demo.

We then enable CDC on the database, and create a capture instance for the DimCustomer_CDC table.

We can see that a number of tables have been added under the cdc schema, and that SQL agent jobs have been created to capture changes being made to this table.

CDC tablesCDC SQL Agent jobs

For the Destination, we’ll create a separate table – DimCustomer_Destination – with the same structure as the Source.

In real life this would be in a separate database, and usually on a completely different server (otherwise, why are you mirroring the changes?), but for the purposes of this walkthrough, we’ll keep it all together.

We’re ready to start consuming changes with SSIS.

SSIS Packages

Our processing logic will be split into two packages – an Initial Load package that will read all of the data in the source table, and an Incremental Load package that will process change data on subsequent runs.

Initial Load

This package will only be run once, and handles the initial load of data in the source table (DimCustomer_CDC). The package uses the following logic:

  1. Use the CDC Control Task to mark the initial load start LSN
  2. Transfer all of the data from the source table into our destination table
  3. Use the CDC Control Task to mark the initial load end LSN

Package creation steps:

Create a new SSIS package

Add a CDC Control Task. Double click the Control Task to bring up the editor.

  • Add a new ADO.NET connection manager for the Source database
  • Set CDC Control Operation to Mark initial load start
  • Create a new package variable (CDC_State) to hold the CDC state information.
  • Set the connection manager for the Destination database
  • Create a table for storing the state ([cdc_states]). This table will be used to track the CDC load information, so that you only pick up new changes each time the incremental load package is run. It will be created in the Destination database.
  • Set the state name (CDC_State). This value acts as a key for the CDC state information. Packages that are accessing the same CDC data should be using a common CDC state name.

CDC Control Task Editor

Add a Data Flow Task, and connect it to the CDC Control Task

  • Configure the Data Flow task to transfer all of the data from the Source to the Destination

Add a second CDC Control Task. Connect the success constraint of the Data Flow Task to it.

  • Configure the second CDC Control Task with the same settings as the first one, except the CDC Control Operation should be set to Mark initial load end.

CDC Control Task Editor

The package will now look like this:

image

When we run the package, all of the data currently in the Source table will be transferred to the Destination, and the initial CDC state markers will be created. If we select from the cdc_states table, we can see that there is now a “CDC_State” entry. Note, the state entry is an encoded string that is used by the CDC components – you should not have to edit or deal with it directly.

image

Incremental Load

This package will be run every time we want to grab the latest changes from our Source table. It will store the CDC state every time it runs, ensuring that we only pick up new changes every time we run the package. It will use the following logic:

  1. Create staging tables for updated and deleted rows (so we can process the changes in a batch – more about that below)
  2. Use a CDC Control Task to retrieve the CDC state from the destination table
  3. Use a CDC Source to retrieve our change data
  4. Use a CDC Splitter transform to redirect the rows based on their operation (New, Updated, and Deleted)
  5. Insert the new rows into the Destination table
  6. Stage the Updated and Deleted rows
  7. Process the Updated and Deleted rows using Execute SQL Tasks
  8. Use a CDC Control Task to update the CDC state

Package creation steps:

Add an Execute SQL Task to create staging tables

  • Create a connection manager for the Destination database (set the ConnectionType to ADO.NET to reuse the same connection manager used by the CDC Control Task)
  • Enter the SQL statements to create two staging tables that match the Destination table. For example:

Add a CDC Control Task. Connect the Execute SQL task to the CDC Control Task

  • Create a connection manager for the Source database
  • Set the CDC Control Operation to Get processing range
  • Create a new CDC state variable (CDC_state)
  • Create a connection manager for the Destination database
  • Select the state table (this was created by the Initial Load package) – [dbo].[cdc_states]
  • Set the State name – this must match what was used in the Initial Load package (CDC_State)

image

Add a Data Flow Task. Connected it to the CDC Control Task.

Add a CDC Source component

  • Set the Connection Manager to the Source database
  • Select the source table (DimCustomer_CDC)
  • Set the CDC processing mode to Net
  • Select the CDC_State variable
  • Click the Columns tab to make sure we’re pulling back all of the right information, then click OK.

image

Add a CDC Splitter transform

Add an ADO.NET Destination – rename it to “New rows”

  • Connect the InsertOutput of the CDC Splitter to the “New rows” destination
  • Double click the “New rows” destination to bring up its editor
  • Set the Destination connection manager, and select the main destination table (DimCustomer_Destination)
  • Click the Mappings tab. The columns should automatically match by name. The CDC columns (the ones starting with __$) can be ignored

Add two more ADO.NET Destinations, mapping the DeleteOutput to the stg_DimCustomer_DELETES table, and UpdateOutput to stg_DimCustomer_UPDATES. We will update the final Destination table using batch SQL statements after this data flow. An alternative design here would be to use an OLE DB Command transform to perform the updates and deletes. The OLE DB Command approach has some performance problems though, as the transform operates on a row by row basic (i.e. it issues one query per row).

image

Back in the Control Flow, add two Execute SQL tasks. These tasks will perform the batch update/delete using the data we loaded into the staging tables. The queries look like this (note, I took columns out of the update statement to keep things short – normally you’d include all of the columns here):

Add a CDC Control Task. It should have the same settings as the first CDC Control Task in the package, except the CDC control operation is Mark processed range.

Finally, add an Execute SQL Task to drop the staging tables. Alternatively, you can leave the staging tables in place, just truncate them.

Your package should look like this:

image

Running the Incremental Load Package

If we run the Incremental Load package at this point, it should run successfully, but not transfer any rows. That’s because we haven’t made any changes yet to the Source table. Let’s do that now by running the following script against the Source table:

If we enable a Data Viewer in the Incremental Load package and run it, we’ll see that the CDC Source picks up all of the rows we’ve changed. We can see that some of the rows are __$operation = 4 (update), while the rest are 2 (new rows).

image

When the package completes, we see that the data flow moved a total of 17,995 rows (11 updates, and the rest are inserts).

image

Because the CDC Control Task updated LSN values stored in the CDC state table, if we run the package a second time (without making any changes to the source table), we see that no rows get transferred in the data flow.

image

Wrap Up

I hope you found this walkthrough of the new CDC components in SQL Server 2012 helpful. I will continue posting more information about the CDC Components in follow up posts. Please let me know if you have any specific questions / topics you’d like me to describe further.

52 thoughts on “CDC in SSIS for SQL Server 2012”

  1. Hi Uzzie,

    Could you provide more details about what you mean? What part of the "incremental load" isn't being done?

    Since these components make use of SQL Server CDC, you should make sure that it is correctly configured. You might want to run queries directly against the CDC tables to make sure that change rows are being picked up. There are steps to follow in Books Online – http://msdn.microsoft.com/en-us/library/bb522489….

    1. Hi Matt,
      Like Uzzie, I too am having the same incremental load issue. The initial load works fine, but when I run the incremental load, nothing gets picked up. I checked the value stored in the cdc_state table and the lsn and timestamp values don't seem to correspond to the lsn or timestamp in the cdc.lsn_time_mapping table. The lsn in the cdc_table also do not correspond to the lsn in the cdc table of the cdc enable table. For example (using numbers instead of lsn to illustrate), the last "lsn" in the cdc table would be 10001, but the "lsn" in the cdc_state table would be 20001. So when I do an update, the new "lsn" capture would be 10002, so when I run the incremental load, the range starts with 20001 which doesn't include the "lsn" of the most recent update. I hope I'm explaining it clearly. Thanks.

  2. No – this feature only works with Change Data Capture.
    Change Tracking is a synchronous event (i.e. you have to deal with the change as it happens) – you don't usually integrate CT with an ETL process directly. Instead, you'd redirect the changes to a separate table, and read them with an SSIS data flow later on. CT doesn't give you the same level of tracking or flexibility that CDC provides (but it can be a much lighter weight process).

    1. Hi Ron,
      The CDC components come with a windows service that you'll use to configure CDC against Oracle. As part of the configuration, the service will setup some empty CDC tables in a SQL Server instance to mimic your Oracle data. When you configure CDC in SSIS, you'll point the components at the SQL instance (this essentially tricks SSIS into thinking it's talking to SQL Server, and not Oracle… by using the same CDC tables/API, we can keep the functionality in sync).

      I'm working on getting some posts about CDC with Oracle.

      Thanks!

      1. Is there any posts on how to setup this with oracle I have been strugling with it for some time and begging to pull my hair out. ! If anyone has any articles on this it would be much appreciated as I can setup the first steps fine and have the tables but must be doing something wrong with the flow controls.

        Cheers

  3. Hey Matt,
    Do you have any links or blogs that I can go to for Oracle CDC? I have everything setup and its seems to be working properly but I am not seeing any data landing in the cdc mirror tables. I am making changes in Oracle and I can see the 'Status' move from IDLE to PROCESSING but I am not seeing any changed data.

    Thanks…

    1. I know what the issue is. If the table is partitioned in Oracle it does not work. If I recreate the table as a non partitioned table it works. Maybe there is something else I need to do for a partitioned table but just FYI for everyone.

  4. It may be because the update sql is slightly off.

    WHERE CustomerKey < 500 should be WHERE CustomerKey > 11500

    and

    WHERE CustomerKey <= 11000 AND CustomerKey >= 11010 should be WHERE CustomerKey >= 11000 AND CustomerKey < 11010

    At least that was true for me and my copy of AW.

    1. Hi Cory,

      Your SSIS package can write the changes anywhere. You'd use the CDC components to detect/read the change, and then use the Oracle Destination to write them to Oracle. You can find the 2012 Oracle connectors here.

  5. It would depend on how you add the new column. To avoid having to do a full load, you could:

    1. Make the schema change
    – update the table
    – update the CDC capture information
    – update your SSIS package

    2. Insert the new column values

    3. Run your incremental CDC package to pick up the new column values.

    If you insert the new values before you update your CDC capture/SSIS package, then you'd need to use another way to get all of the data over (since those column changes wouldn't be picked up by CDC). Resetting your CDC process and running another full load would do it, but you could also script something out that just moves the values for the new column.

  6. The CDC components in SQL 2012 work with SQL Server, and Oracle.
    Attunity has a separate solution that works with DB2 (and other systems), but they are not part of the SQL 2012 license.

  7. I am a bit confused with the CDC for Oracle and how the architecture is supposed to look like in order to use packages like the ones described above. When I configure the CDC for Oracle, what happens on the SQL Server side? What more steps is that I need to take to have "everything" set up?

  8. IN SSIS I have implemented change data capture. I have made initial load and incremental load for the entire table which is related to data warehouse. Also made two master packages for initial and incremental. Initial consider as full load which would be loaded once and incremental loaded as per business require mental.
    I need some information related to risk and error handling or any other related information maintained of packages, how to send email notification do any have script???

  9. Hi Matt, must say very nice walk through! I am newbie to SSIS so i had a little difficulty in understanding the two SQL tasks "Update & delete", whats the purpose of these tasks?

    1. Each type of CDC change (New row, updated row, deleted row) needs to be processed separately, since they result in different SQL operations (INSERT, UPDATE, DELETE)

  10. Hi Matt –

    I was walking through this and I get an error in the incremental part out of the CDCSource and get the following error.

    Error: 0x4002F304 at Certificate Partner, CDC Source [2]: "Problems when trying to get changed records from dbo_certificate_partner. Reason – Incorrect syntax near the keyword 'CONVERT'..".

    This is the command the source is running.

    select [__$start_lsn],[__$operation],[__$update_mask],[recno],[cobrandrecno],[typeofpurchaserecno],certtyperecno,[activationcoderecno],[status],[editedby],[editeddate],[customerclassrecno] from [cdc].[fn_cdc_get_net_changes_dbo_certificate_partner]
    (
    CONVERT(BINARY(10), 0x0023C1B900001DD00003, 1),
    CONVERT(BINARY(10), 0x0023C1B900001E540002, 1),
    'all'
    )

    What I found is that it is trying to run against a DB that has compatibility mode of 80. I changed to 100 and it worked.Should I assume that it will not ever work against 80?

    Thanks Scott

    1. I believe the CDC functionality was first introduced in SQL Server 2008 (DB version 100), so yes, I'd assume that that the SQL statements it generates assume a minimum level of compatibility.

  11. Hi Matt,
    Does Backup routine affect CDC. Also does the DBA need to truncate the CDC schema tables as house keeping or does the Cleanup routine do this?
    Thanks
    raj.

  12. Hello,
    We have got this going and its great, so thanks for your walk through. Now we want to wrap it up so that it is either complete success or complete failure with the incremental package so we can correct the error then run it again. When we set the package Transaction property to Required it fails on the end CDC control task in the package with no errors elsewhere.

    Cheers

    Hugh

  13. Hello
    This is very helpful to me.
    But I have question regarding if I want to do extract data from multiple tables and put into one table ,Is this possible?

    Thanks,
    Shreyans

    1. Sure – the only difference would that I think you'd want to track the state for each table separately (separate variables, separate sources, separate control tasks). You can control where the data ends up in your package logic.

  14. Hi Matt,
    This is very informative, thanks for posting the same. I have tried all the steps and it is executing without any error. I have couple of questions:
    1. After the initial load , I made an update for a single row in the source table.When I run the Incremental Load, there is no error shown , the package runs successfully without updating the destination table.
    2. If I want to restart the process once again. I mean load the destination table using the initial load. Can I truncate the CDC_states table and start over again.

    Thanks

    1. The first time you run the incremental load package, it will check for any changes that were made while your "initial load" phase was done. If you insert rows after the initial load, you should see them appear the second time you run your incremental load package.

      Check the CDC API – I believe they have stored procedures which let you pause/reset CDC tracking on the table. Another option is to drop/recreate the CDC tables and run your initial load package again.

  15. Hi, is there any way to update CDC_State – CS value ? since it is keeping the the same value if i run the same package again , the newly inserted rows is getting inserted again in the destination.

  16. Hi Matt,

    This is a great intro into the topic. I was wondering if there is any reason that using a single staging table with a CDCUpdateType field ( with values like 'I','D','U') would be a bad idea. Would you run into blocking issues? Because I can see that with the CDC splitter you need to have three separate output streams. Or do you think having three staging tables is better for the data transform parts of the ETL after the staging tables have been loaded?

  17. Matt,

    We have CDC enabled on our production Stage environment which receives replication rows from our source tables. We also have a 2012 SSIS CDC package to mine the updates, inserts and deletes from the *_CT tables. We take these records and version them to our Data Store using the [cdc].[lsn_time_mapping] table to determine the row change order.

    One issue we have not been able to resolve..

    We have some user applications the business users use that retrieve and delete many rows. When exiting the app, it writes these rows back to the sever, even if this data did not change. Our CDC processing mode is set to NET so we see the _$operation 1's and 2's in the CDC table and SSIS processes these as a _$operation 5. How can cancel out the deletes and inserts when the insert matches what already in the table?

  18. Hi Matt,_We are using CDC for incremental load in SQL 2012. Everything was fine until one doomed morning when ETL job took more than 4 hours to finish and then daily data accuracy monitoring job raised an alarm. While checking with DBA we learnt that he executed a simple update statement to load another table using trigger. As far as I know that SQLServer don’t log any entry in CT table if there is no genuine change. This statement was something like “Update ABC set Amount = Amount”. CT table logged every single record with two operations, Delete first and Insert in same transaction. Yes a “Halloween” issue. First thing that it shouldn’t had logged a single entry in CT table since it wasn’t a true change and second is that CT logged two operations (Delete and Insert ) for update Action. ETL CDC Splitter separated insert and delete action and expired all deleted records in destination table since we have SCD type2 nature. Unfortunately it took us few days to fix ETL as well as data. Is there any other simple way to deal with Halloween issue in CDC or in CDC ETL tasks rather than implementing it in script? _

  19. HI Matt, Great article. I implemented the CDC as per your direction using SSIS 2012. I came across a small issue. When I use ADO.net destination for UpdateOutput , it failed because I have a unique key constraint on of the columns.When I removed the constraint, the UpdateOutput dataflow inserted a new column in the destination table insted of updating the existing column in the destination. Please direct me how I can correct this issue.

  20. Can we select multiple tables in CDC SOurce.?
    COnsider if i have 1000 tables, do i need to create this package for all tables or can i configure multiple tables in single CDC source of a package

  21. We've been using CDC for 6 months now. We just migrated to a new data center and now code that was working is failing intermittently. Obviously there is something related to the move going on here, but because of the nature of the failure and the messages involved I'm having trouble tracking down the issue. Here is the call stack from SSIS, to read in order go from bottom to top, (1-4) I've cleaned it up a little for readability:

    (4) CDC Data Flow:Error: SSIS Error Code DTS_E_PRIMEOUTPUTFAILED.
    The PrimeOutput method on CDC Source – Get Changes Since Last Sync LSN returned error code 0xC0047020.
    The component returned a failure code when the pipeline engine called PrimeOutput().
    The meaning of the failure code is defined by the component, but the error is fatal and the pipeline stopped executing.
    There may be error messages posted before this with more information about the failure.

    Load ODS Changes into Logic using CDC:Error:
    System.Runtime.InteropServices.COMException (0xC0047020):
    Exception from HRESULT: 0xC0047020
    at Microsoft.SqlServer.Dts.Pipeline.Wrapper.IDTSBuffer100.SetEndOfRowset()
    at Attunity.SqlServer.CDCSrc.CdcSourceComponent.TransferRowsToOutput
    (
    PipelineBuffer defaultBuffer
    ,PipelineBuffer errorBuffer
    ,IDTSOutput100 defaultOutput
    ,String stateVariableValue
    ,Boolean handleInitialProcessingColumn
    ,CdcProcessingMode processingMode
    ,Int32 errorOutputId
    )
    at Attunity.SqlServer.CDCSrc.CdcSourceComponent.PrimeOutput(Int32 outputs, Int32[] outputIDs, PipelineBuffer[] buffers)
    at Microsoft.SqlServer.Dts.Pipeline.ManagedComponentHost.HostPrimeOutput
    (
    IDTSManagedComponentWrapper100 wrapper
    ,Int32 outputs
    ,Int32 [] outputIDs
    ,IDTSBuffer100 [] buffers
    ,IntPtr ppBufferWirePacket
    )

    (3) CDC Data Flow:Error: "Problems when trying to get changed records from [TableName].
    Reason – Exception from HRESULT: 0xC0047020.".

    (2) CDC Data Flow:Error: Unspecified error

    (1) CDC Data Flow:Error: Unspecified error

  22. Hi Matt,
    Can we use these components in case when the source is SQL serer 2008 R2, if i devlope the solution in 2012 as the source and then migrate
    to 2008 as source will i be able to use the same components.
    Thanks
    Garry

  23. Hi MATT,
    I have implemented the CDC on test enivornment,it is working as per expectation. but when I have implemented the same on DEV enivornment,intial load is working fine. but when i have started with incremental load CDC source task is not capturing the changes. I have checked source CT table ,it is capturing changes. also CDC state variable is also updating when i am running incremental packages.
    please suggest on this.

  24. Hi matt, you seem to have errors in your data loading script lines 25 and 35? what is CustomerKey &lt; 500 supposed to mean?

  25. hey matt,

    i think the last script for loading remaining records and updates needs to be… USE [CDCTest]
    GO

    — Transfer the remaining customer rows
    SET IDENTITY_INSERT DimCustomer_CDC ON

    INSERT INTO DimCustomer_CDC
    ( CustomerKey ,
    GeographyKey ,
    CustomerAlternateKey ,
    Title ,
    FirstName ,
    MiddleName ,
    LastName ,
    NameStyle ,
    BirthDate ,
    MaritalStatus ,
    Suffix ,
    Gender ,
    EmailAddress ,
    YearlyIncome ,
    TotalChildren ,
    NumberChildrenAtHome ,
    EnglishEducation ,
    SpanishEducation ,
    FrenchEducation ,
    EnglishOccupation ,
    SpanishOccupation ,
    FrenchOccupation ,
    HouseOwnerFlag ,
    NumberCarsOwned ,
    AddressLine1 ,
    AddressLine2 ,
    Phone ,
    DateFirstPurchase ,
    CommuteDistance
    )
    SELECT CustomerKey ,
    GeographyKey ,
    CustomerAlternateKey ,
    Title ,
    FirstName ,
    MiddleName ,
    LastName ,
    NameStyle ,
    BirthDate ,
    MaritalStatus ,
    Suffix ,
    Gender ,
    EmailAddress ,
    YearlyIncome ,
    TotalChildren ,
    NumberChildrenAtHome ,
    EnglishEducation ,
    SpanishEducation ,
    FrenchEducation ,
    EnglishOccupation ,
    SpanishOccupation ,
    FrenchOccupation ,
    HouseOwnerFlag ,
    NumberCarsOwned ,
    AddressLine1 ,
    AddressLine2 ,
    Phone ,
    DateFirstPurchase ,
    CommuteDistance
    FROM [AdventureWorksDW2012].[dbo].[DimCustomer]
    WHERE CustomerKey > 11500

    SET IDENTITY_INSERT DimCustomer_CDC OFF
    GO

    — give 10 people a raise
    UPDATE DimCustomer_CDC
    SET YearlyIncome = YearlyIncome + 10
    WHERE CustomerKey BETWEEN 11000 AND 11010
    GO

  26. Hi Matt,
    I tried to implement incremental ETL using CDC as explained by you.
    I added one row to source column it worked as expected.But when i m trying to run the same pkg again without making any change to source table.The row i added last time is again flowing through.since my target table is having primary key constraint.its failed.
    Ideally only rows changed from last run should flow through.am i missing any settings?can you please suggest what changes need to be done to handle this situation.

  27. Hi Matt,
    This was literally the most useful and detailed tutorial I've ever followed. Everything is explained and I understand the SSIS CDC tasks a lot better now. I'm having one problem though…The second time I run the package it should have 0 rows but it does show the same rows as the first time. What am I missing? The CDC task flows are the same as in your tutorial (first is Get processing range, second is Mark processing range). Can you or anyone else here help me?

  28. IS there a way to use CDC for partial table/ Selected Columns? For example I have a table that has 50 columns. But i want to apply CDC for 10 columns which i will be needing for my destination table.

Leave a Reply

Your email address will not be published. Required fields are marked *