CDC in SSIS for SQL Server 2012

SQL Server 2012 introduces new components that make it easier to do Change Data Capture (CDC) using SSIS. This blog post provides a quick walkthrough of how to use them.

Note: The documentation for the CDC components is not in Books Online yet. It will be appearing in the RTM refresh.

New Task and Components

CDC Control Task

The CDC Control task is used to control the life cycle of change data capture (CDC) packages. It handles CDC package synchronization with the initial load package, the management of Log Sequence Number (LSN) ranges that are processed in a run of a CDC package. In addition, the CDC Control task deals with error scenarios and recovery.

CDC Source

The CDC source reads a range of change data from CDC change tables and delivers the changes downstream to other SSIS components.

CDC Splitter

The CDC splitter splits a single flow of change rows from a CDC Source component into different data flows for Insert, Update and Delete operations. It is essentially a “smart” Conditional Split transform that automatically handles the standard values of the __$operation column.

Walkthrough

Database Setup

For sample data, we will create a new database (CDCTest), and select a subset of rows from the AdventureWorksDW DimCustomer table into a sample table (DimCustomer_CDC). This will become the Source table for this demo.

We then enable CDC on the database, and create a capture instance for the DimCustomer_CDC table.

We can see that a number of tables have been added under the cdc schema, and that SQL agent jobs have been created to capture changes being made to this table.

CDC tablesCDC SQL Agent jobs

For the Destination, we’ll create a separate table – DimCustomer_Destination – with the same structure as the Source.

In real life this would be in a separate database, and usually on a completely different server (otherwise, why are you mirroring the changes?), but for the purposes of this walkthrough, we’ll keep it all together.

We’re ready to start consuming changes with SSIS.

SSIS Packages

Our processing logic will be split into two packages – an Initial Load package that will read all of the data in the source table, and an Incremental Load package that will process change data on subsequent runs.

Initial Load

This package will only be run once, and handles the initial load of data in the source table (DimCustomer_CDC). The package uses the following logic:

  1. Use the CDC Control Task to mark the initial load start LSN
  2. Transfer all of the data from the source table into our destination table
  3. Use the CDC Control Task to mark the initial load end LSN

Package creation steps:

Create a new SSIS package

Add a CDC Control Task. Double click the Control Task to bring up the editor.

  • Add a new ADO.NET connection manager for the Source database
  • Set CDC Control Operation to Mark initial load start
  • Create a new package variable (CDC_State) to hold the CDC state information.
  • Set the connection manager for the Destination database
  • Create a table for storing the state ([cdc_states]). This table will be used to track the CDC load information, so that you only pick up new changes each time the incremental load package is run. It will be created in the Destination database.
  • Set the state name (CDC_State). This value acts as a key for the CDC state information. Packages that are accessing the same CDC data should be using a common CDC state name.

CDC Control Task Editor

Add a Data Flow Task, and connect it to the CDC Control Task

  • Configure the Data Flow task to transfer all of the data from the Source to the Destination

Add a second CDC Control Task. Connect the success constraint of the Data Flow Task to it.

  • Configure the second CDC Control Task with the same settings as the first one, except the CDC Control Operation should be set to Mark initial load end.

CDC Control Task Editor

The package will now look like this:

image

When we run the package, all of the data currently in the Source table will be transferred to the Destination, and the initial CDC state markers will be created. If we select from the cdc_states table, we can see that there is now a “CDC_State” entry. Note, the state entry is an encoded string that is used by the CDC components – you should not have to edit or deal with it directly.

image

Incremental Load

This package will be run every time we want to grab the latest changes from our Source table. It will store the CDC state every time it runs, ensuring that we only pick up new changes every time we run the package. It will use the following logic:

  1. Create staging tables for updated and deleted rows (so we can process the changes in a batch – more about that below)
  2. Use a CDC Control Task to retrieve the CDC state from the destination table
  3. Use a CDC Source to retrieve our change data
  4. Use a CDC Splitter transform to redirect the rows based on their operation (New, Updated, and Deleted)
  5. Insert the new rows into the Destination table
  6. Stage the Updated and Deleted rows
  7. Process the Updated and Deleted rows using Execute SQL Tasks
  8. Use a CDC Control Task to update the CDC state

Package creation steps:

Add an Execute SQL Task to create staging tables

  • Create a connection manager for the Destination database (set the ConnectionType to ADO.NET to reuse the same connection manager used by the CDC Control Task)
  • Enter the SQL statements to create two staging tables that match the Destination table. For example:

Add a CDC Control Task. Connect the Execute SQL task to the CDC Control Task

  • Create a connection manager for the Source database
  • Set the CDC Control Operation to Get processing range
  • Create a new CDC state variable (CDC_state)
  • Create a connection manager for the Destination database
  • Select the state table (this was created by the Initial Load package) – [dbo].[cdc_states]
  • Set the State name – this must match what was used in the Initial Load package (CDC_State)

image

Add a Data Flow Task. Connected it to the CDC Control Task.

Add a CDC Source component

  • Set the Connection Manager to the Source database
  • Select the source table (DimCustomer_CDC)
  • Set the CDC processing mode to Net
  • Select the CDC_State variable
  • Click the Columns tab to make sure we’re pulling back all of the right information, then click OK.

image

Add a CDC Splitter transform

Add an ADO.NET Destination – rename it to “New rows”

  • Connect the InsertOutput of the CDC Splitter to the “New rows” destination
  • Double click the “New rows” destination to bring up its editor
  • Set the Destination connection manager, and select the main destination table (DimCustomer_Destination)
  • Click the Mappings tab. The columns should automatically match by name. The CDC columns (the ones starting with __$) can be ignored

Add two more ADO.NET Destinations, mapping the DeleteOutput to the stg_DimCustomer_DELETES table, and UpdateOutput to stg_DimCustomer_UPDATES. We will update the final Destination table using batch SQL statements after this data flow. An alternative design here would be to use an OLE DB Command transform to perform the updates and deletes. The OLE DB Command approach has some performance problems though, as the transform operates on a row by row basic (i.e. it issues one query per row).

image

Back in the Control Flow, add two Execute SQL tasks. These tasks will perform the batch update/delete using the data we loaded into the staging tables. The queries look like this (note, I took columns out of the update statement to keep things short – normally you’d include all of the columns here):

Add a CDC Control Task. It should have the same settings as the first CDC Control Task in the package, except the CDC control operation is Mark processed range.

Finally, add an Execute SQL Task to drop the staging tables. Alternatively, you can leave the staging tables in place, just truncate them.

Your package should look like this:

image

Running the Incremental Load Package

If we run the Incremental Load package at this point, it should run successfully, but not transfer any rows. That’s because we haven’t made any changes yet to the Source table. Let’s do that now by running the following script against the Source table:

If we enable a Data Viewer in the Incremental Load package and run it, we’ll see that the CDC Source picks up all of the rows we’ve changed. We can see that some of the rows are __$operation = 4 (update), while the rest are 2 (new rows).

image

When the package completes, we see that the data flow moved a total of 17,995 rows (11 updates, and the rest are inserts).

image

Because the CDC Control Task updated LSN values stored in the CDC state table, if we run the package a second time (without making any changes to the source table), we see that no rows get transferred in the data flow.

image

Wrap Up

I hope you found this walkthrough of the new CDC components in SQL Server 2012 helpful. I will continue posting more information about the CDC Components in follow up posts. Please let me know if you have any specific questions / topics you’d like me to describe further.

Design Pattern: Avoiding Transactions

SSIS has built-in support for transactions, but their use isn’t ideal for all scenarios. This post covers how SSIS transactions work at a high level, some of the challenges associated with them, and provides some design alternatives.

Disclaimer: Many SSIS packages use transactions without issue. Using the built-in functionality is the easiest approach, and (typically) doesn’t require a redesign of your package logic to implement. If your packages are currently using transactions, please do not feel the need to go in and change anything! If, however, you are looking to improve the performance of your packages, or are encountering the issues I describe below, consider using the design patterns presented here.

How SSIS Transactions Work

SSIS transactions are based on the Microsoft Distributed Transaction Coordinator (MSDTC). They are enabled at the Task/Container level by setting the TransactionOption property to Required. For a transaction to work at runtime, MSDTC must be enabled (and configured) on all machines involved in the transaction (i.e. the source machine, the destination, and the machine running the SSIS package… and of course, in some environments, this might all be the same machine).

How SSIS uses MSDTC

Challenges

Configuration

Configuring DTC to communicate across hosts and machine boundaries can be tricky. Timeouts and unexpected termination of connections on long running queries are another common source of problems. Finally, the server(s) you are accessing must support DTC – this isn’t always possible when your database is using a non-Windows platform.

Performance

Performance is probably the biggest challenge that customers face with SSIS transactions. Using any sort of transaction will impact the performance of your packages, but DTC transactions can be especially “heavy-weight”, as they can require coordination between multiple hosts.

DB Provider

To enlist in a transaction, the underlying data provider used by your connection manager must also support DTC transactions. Microsoft providers (like SQL Native Client and SqlClient) typically support DTC, but the same may not be true for third party providers.

Transaction Support

My biggest complaint (and perhaps it’s something we can address in the future) is that there is no way to tell which Tasks actually support transactions. Take the following control flow:

Which Tasks support Transactions?

We’re missing some information from the UI:

  • We can’t tell if transactions are enabled
  • We can’t tell where the transaction starts
  • We can’t tell if any of the tasks have opted out of the transaction (TransactionOption = NotSupported)

To get the information we need, we have to rely on good documentation (i.e. annotations), or take a look at the TransactionOption property for each of the tasks/containers in the package.

Even if we can assume that transactions are enabled at the package level, and none of the tasks have opted out, not every Task is able to rollback their work when a transaction fails. Using the example above:

Supports Transactions (rollback)

  • Execute SQL Task
  • Analysis Services Execute DDL Task
  • Data Flow Task

Does Not Support Transactions (rollback)

  • XML Task
  • File System Task

The Script Task may or may not support transactions, depending on how it was written.

Design Alternatives

Manual Transactions

A common alternative to using the built-in transactions is to create them yourself. This is done by setting the Connection Manager’s RetainSameConnection property to True, and creating/committing/rolling back the transaction using Execute SQL Tasks. Jamie Thomson blogged about this back in 2005, and is a very common design pattern amongst SSIS developers today. When RetainSameConnection is set to True, the connection manager will attempt to reuse an existing connection when a task asks for one (instead of creating a new connection each time).

Using RetainSameConnection

The advantage here is that your transactions become “light-weight” and perform better, as they no longer need to be coordinated across multiple machines using DTC. However, this approach is not a silver bullet, and there are some limitations to what RetainSameConnection can do.

The behavior of RetainSameConnection is determined by the underlying provider – there is very little logic in the Connection Manager itself to handle connection reuse. Not all providers will be able to support this functionality (although it should work fine with the SQL Server providers). Also note that not all Connection Managers support RetainSameConnection, either. File Connection Managers, for example, simply return a string to the file or directory they are pointing to –  they don’t actually open up a “connection object”.

RetainSameConnection might not work if the a connection manager is being used multiple times in parallel. You usually encounter this issue if you are trying to SELECT and INSERT at the same time. Some common examples:

  1. Running multiple Execute SQL Tasks in parallel using the same connection manager
    • Solution: Use separate connection managers, or run the tasks serially
  2. Both the Source and Destination component in your Data Flow are using the same Connection Manager
    • Solution: Use a separate connection manager for the source component
  3. The Destination and another transform (usually Lookup, or Script) are using the same Connection Manager
    • Solution: Use a Full Cache Lookup or staging tables

The last thing to note is that if your data flow is writing to multiple destinations, you will need to use multiple connection managers. Be sure to start a separate transaction for each one.

Note: If you are using this pattern and encountering issues, try using SQL Profiler to detect if your queries are in fact using the same SPID.

Design for Restartability

Instead of using transactions, consider designing your packages for restartability. This doesn’t necessarily mean the use of Checkpoints (although you can use them if you are so inclined) – it means that your packages take any corrective actions they need to ensure that they can be safely run multiple times. The type of actions you take will depend on your package’s logic. For example, a package that loads multiple flat files from a directory could move/rename a file once it has been successfully processed to avoid loading it a second time if the package needs to be restarted. This approach requires you to put more effort into the design of your package, but will give you a much more robust ETL solution.

There will be scenarios where the use of transactions is required, but it is usually possible to limit their scope. This typically involves splitting up your ETL logic across multiple tasks (or packages), staging your data in between each step, and enabling transactions only when absolute necessary. For example, you may create a Data Flow that pulls data from a source, performs the required transformations, and then loads the data into a staging table. You can then use another Data Flow (or an Execute SQL Task if the staging table is area in the destination database) to insert/merge the data into the final destination. With this approach, you will only need to use a transaction for the final insertion.

Handling Failed Inserts when using Fast Load

A useful pattern when designing for restartability is to isolate the specific rows that fail to insert into your destination. If you are using Bulk / Fast Load at your destination (which is recommended if you are at all concerned with performance) and one of the rows fails to insert, it may fail the entire batch (the logic depends on the provider and destination you are using). To isolate the failed row, you can add a second destination to the error output that does row by row insertions (i.e. does not use Bulk / Fast Load). You can then redirect the error output of the second destination to a flat file / staging table so it can be manually inspected and corrected.

Handling Failed Inserts when using Fast Load

Summary

SSIS has built-in support for transactions, but to get the best performance, you might want to consider an alternative design. I highly recommend designing your packages for restartability, and limiting the use and scope of transactions in your packages.

I hope you found this blog post useful. I’d love to hear about specific scenarios where you are using transactions, and alternative package designs you might have used to implement your own form of restartability.

Setting a Default Logging Level in the SSIS Catalog

One of the powerful features of the SSIS Catalog is that it can automatically capture logs for packages run on the server. The logging level is set to Basic by default, and can be changed when the package is run on the Advanced tab of the run UI (as shown below. This can also be done when scheduling the package through SQL Agent).

Logging level on Execute Package UI

You can set a server wide default logging level on the Catalog properties page.

  • Connect to the SQL Server instance using SSMS
  • Expand the Integration Services Catalogs node
  • Right click on the SSISDB node, select Properties
  • Change the Server-wide Default Logging Level setting (in the Operations Log category)

Catalog Properties

See What Events are Included in the SSIS Catalog Log Levels for more information.

What Events are Included in the SSIS Catalog Log Levels

Logging Levels

Packages run through the SSIS Catalog can automatically have their logs captured in SQL Server 2012. The SSIS Catalog provides four different log levels – None, Basic, Performance, Verbose. The default log level is Basic.

The table below shows which SSIS events are captured in the [catalog].[operation_messages] view in the four logging levels.

 

Level Events Notes
None None Captures enough information to say whether the package succeeded or failed, and does not log any messages to the [operation_messages] view.
Basic OnPreValidateOnPostValidate

OnPreExecute

OnPostExecute

OnInformation

OnWarning

OnError

Captures similar information to what is displayed on the console by the default when a package is run with dtexec.
Performance OnWarningOnError This log level is required to track the performance information for the run (how long it took to run each task / component, etc) but does not log all of the events captured by the Basic log level.
Verbose All events The Verbose log level captures all log events (including performance and diagnostic events). This logging level can introduce some overhead on performance – see details blow.

Note that additional information is captured and stored in other tables / views in the SSIS Catalog. For an overview of the Views related to package execution, see Jamie Thomson’s blog post – SSIS Logging in Denali.

When to Use Each Logging Level

None

While the None logging level provides slightly better performance than Basic, I don’t expect that it will be used too often in production environments. This logging level does not capture error and status messages, which makes diagnosing problems difficult. However, if you have Packages Never Fail ™, this logging level might be for you! (Hey, you can always re-run the package with a higher level if something goes wrong…)

Basic

This is the recommended logging level for every day use. It captures important events (like errors and warnings), as well as enough progress information to display what the currently active task is, and how long each task has taken (internally this information is stored in the [catalog].[executable_statistics] view).

Performance

The Performance log level should be used when you are doing benchmarking and performance tuning for your packages. While it actually logs less messages to the [catalog].[operation_messages] view than Basic, it captures a lot more events internally to analyze the performance of the data flow components. As a result, there is a bit more overhead during execution – packages run with Basic will actually run a little faster than Performance (in this case Performance means “give me all of the performance details”, not “run my packages as fast as you can”).

Verbose

The Verbose logging level captures a lot of events (especially when using OLE DB connections, as they generate many DIAGNOSTIC events). Verbose should only be used when trying to debug or diagnose package failures.

Custom Log Events

The only log level that captures custom log events (i.e. special events that are raised with custom / 3rd party SSIS extensions or scripts) is Verbose. Due to the overhead that the Verbose level introduces, I recommend creating your own custom reports if you need to capture these custom events.

Upcoming Book – SSIS Design Patterns

Over the past few months I’ve been working on a book for SQL Server 2012 entitled – SSIS Design Patterns. I’m co-authoring the book with a fantastic bunch of SQL Server MVPs – Tim Mitchell, Jessica Moss, Michelle Ufford, and Andy Leonard. Details are now available on the Apress website and Amazon.

SSIS Design Patterns

SSIS Design Patterns is a book of recipes for SQL Server Integration Services (SSIS). Design patterns in the book show how to solve common problems encountered when developing data integration solutions. Because you do not have to build the code from scratch each time, using design patterns improves your efficiency as an SSIS developer. In SSIS Design Patterns, we take you through several of these snippets in detail, providing the technical details of the resolution.
SSIS Design Patterns does not focus on the problems to be solved; instead, the book delves into why particular problems should be solved in certain ways. You’ll learn more about SSIS as a result, and you’ll learn by practical example. Where appropriate, SSIS Design Patterns provides examples of alternative patterns and discusses when and where they should be used. Highlights of the book include sections on ETL Instrumentation, SSIS Frameworks, and Dependency Services.

  • Takes you through solutions to several common data integration challenges
  • Demonstrates new features in SQL Server 2012 Integration Services
  • Teaches SSIS using practical examples
What you’ll learn
  • Load data from flat file formats
  • Explore patterns for executing SSIS packages
  • Discover a pattern for loading XML data
  • Migrate SSIS packages through your application lifecycle without editing connections
  • Take advantage of SSIS 2012 Dependency Services
  • Build an SSIS Framework to support your application needs
Who this book is for

SSIS Design Patterns is for the data integration developer who is ready to take their SQL Server Integration Services (SSIS) skills to a more efficient level. It’s for the developer interested in locating a previously-tested solution quickly. SSIS Design Patterns is a great book for ETL (extract, transform, and load) specialists and those seeking practical uses for new features in SQL Server 2012 Integration Services. It’s an excellent choice for business intelligence and data warehouse developers.

Welcome!

I started an SSIS blog (“SSIS Stuff”) shortly after joining the SQL Server Integration Services product team back in 2006. The blog was eventually rebranded as the SSIS Team Blog, and is now one of the more popular blogs on MSDN. There was only one problem – only one person could post new articles to the site. As a result, I always treated it as “my blog”. Now that the platform allows multiple authors, I decided to move “my blog” to a site I can truly make my own. I’ll still be occasionally posting to the SSIS Team Blog, but most of the content will be coming from other members of the team.

I’ll be using this site to provide SSIS information and tutorials, as well a place where I can post content for the SSIS sessions I present (which are becoming more and more frequent).

Enjoy!