Learn more about SSIS Design Patterns at the 2012 PASS Summit

We managed to bring together the entire author team for the SSIS Design Patterns book to run a pre-conference event at the 2012 PASS Summit. I posted about this when it was first announced, but since our book came out today, I thought it would be a good time to post a reminder.

We’ve been meeting over the last couple of weeks, trying to decide on what content to include in the session. It’s a lot harder than I thought it would be – there is just so much we could talk about. We have all of that content from the book, of course, and there’s always the desire to include brand new material as well. On top of that, we could draw on content from the pre-con that Andy, Tim and I did last year, the Performance Design Patterns talks I put together, Andy’s framework courses, Tim’s upcoming Real World SSIS session (SQL Saturday 163), and the great stuff from Jessica and Michelle’s blogs… You see our dilemma. I have a feeling the eight hour session is going to fly by.

I’ve always felt that having too much content is better than not having enough, and that is definitely the case here. I think the audience will be happy with what we select, though, and we’ll make sure that the topics that didn’t make the cut will be available online (in one form or another).

The event will take place on Monday, November 5th, 2012. If you’d like to join us for the day, you can register here.

Thanks!

 

SSIS Performance Design Patterns

Here are the slides (all 177 of them) for the SSIS Performance Design Patterns talk that I’ve delivered at three different pre-conference events over the past month. Each presentation was slightly different, but the core content remained the same. These events included:

The version posted here is from the SQL Bits event.The recording from this event be posted online in the next few weeks – be sure to check the site!

 

Too Many Sources in a Data Flow

This was a recent customer issue that I wanted to share.

The customer was transferring a large amount of data between two tables, without performing any transformations on it. They wanted to run multiple transfer streams in parallel, and created a way to split up their source data into six partitions. In their original design, they had six Source -> Destination combinations in a single Data Flow Task.

six source components in a data flow

This package worked fine in their development environment, but would result in a deadlock when they ran it in production. Looking at the SSIS logs, they could see that the package was hanging in the Pre-Execute phase of the Data Flow. They ran a profiler trace and found that SQL was showing Async Network IO for five sessions, but the sixth was waiting for memory resources (wait type of RESOURCE_SEMAPHORE). Looking at sys.dm_exec_query_resource_semaphores  (shown below) confirmed this.

Waits

There were two factors contributing to the deadlock:

  1. The source server did not have enough memory to run six of these queries in parallel
  2. In SSIS, the Data Flow does not start executing until all components have completed the Pre-Execute phase

When the data flow runs, each source component prepares its query with the source server in the pre-execute phase. The server was able to grant enough memory for the first five queries (session_id 57 to 61), but the sixth query (session_id = 63) was told to wait. This prevents the pre-execute phase of the sixth OLE DB Source component from completing. The data flow won’t start running until the pre-execute phase completes for each component, and the source server can’t grant memory until the other queries complete, which results in a deadlocked package.

To resolve this issue, the customer changed the package design pattern to have a single Source -> Destination combination per data flow, with six parallel Data Flow Tasks. With this design, they were still occasionally getting RESOURCE_SEMAPHORE waits, but the waits didn’t last forever, as the other Data Flow tasks were able to complete their queries independently. (They were eventually able to remove the waits all together by optimizing their source query.)

An alternate design (which would be my preference) would have been to have a single Data Flow per package, parameterize the source query, and run six instances of the package in parallel.

Summary

Beware of doing too much in a single Data Flow! Unlike Tasks in the control flow, which run independently, the Data Flow can’t start until each component has finished its initialization in the Pre-Execute phase. This is especially important for components that can spend a long time in the Pre-Execute phase, such as a Lookup Transform with Full Cache mode. A Data Flow should only have a single source — unless the data flow is combining data from multiple sources. If your data flow has multiple data paths that do not overlap, then you should consider using separate Data Flow tasks (if they need to be run “together”, put them in the same Sequence Container).

 

Design Pattern: Avoiding Transactions

SSIS has built-in support for transactions, but their use isn’t ideal for all scenarios. This post covers how SSIS transactions work at a high level, some of the challenges associated with them, and provides some design alternatives.

Disclaimer: Many SSIS packages use transactions without issue. Using the built-in functionality is the easiest approach, and (typically) doesn’t require a redesign of your package logic to implement. If your packages are currently using transactions, please do not feel the need to go in and change anything! If, however, you are looking to improve the performance of your packages, or are encountering the issues I describe below, consider using the design patterns presented here.

How SSIS Transactions Work

SSIS transactions are based on the Microsoft Distributed Transaction Coordinator (MSDTC). They are enabled at the Task/Container level by setting the TransactionOption property to Required. For a transaction to work at runtime, MSDTC must be enabled (and configured) on all machines involved in the transaction (i.e. the source machine, the destination, and the machine running the SSIS package… and of course, in some environments, this might all be the same machine).

How SSIS uses MSDTC

Challenges

Configuration

Configuring DTC to communicate across hosts and machine boundaries can be tricky. Timeouts and unexpected termination of connections on long running queries are another common source of problems. Finally, the server(s) you are accessing must support DTC – this isn’t always possible when your database is using a non-Windows platform.

Performance

Performance is probably the biggest challenge that customers face with SSIS transactions. Using any sort of transaction will impact the performance of your packages, but DTC transactions can be especially “heavy-weight”, as they can require coordination between multiple hosts.

DB Provider

To enlist in a transaction, the underlying data provider used by your connection manager must also support DTC transactions. Microsoft providers (like SQL Native Client and SqlClient) typically support DTC, but the same may not be true for third party providers.

Transaction Support

My biggest complaint (and perhaps it’s something we can address in the future) is that there is no way to tell which Tasks actually support transactions. Take the following control flow:

Which Tasks support Transactions?

We’re missing some information from the UI:

  • We can’t tell if transactions are enabled
  • We can’t tell where the transaction starts
  • We can’t tell if any of the tasks have opted out of the transaction (TransactionOption = NotSupported)

To get the information we need, we have to rely on good documentation (i.e. annotations), or take a look at the TransactionOption property for each of the tasks/containers in the package.

Even if we can assume that transactions are enabled at the package level, and none of the tasks have opted out, not every Task is able to rollback their work when a transaction fails. Using the example above:

Supports Transactions (rollback)

  • Execute SQL Task
  • Analysis Services Execute DDL Task
  • Data Flow Task

Does Not Support Transactions (rollback)

  • XML Task
  • File System Task

The Script Task may or may not support transactions, depending on how it was written.

Design Alternatives

Manual Transactions

A common alternative to using the built-in transactions is to create them yourself. This is done by setting the Connection Manager’s RetainSameConnection property to True, and creating/committing/rolling back the transaction using Execute SQL Tasks. Jamie Thomson blogged about this back in 2005, and is a very common design pattern amongst SSIS developers today. When RetainSameConnection is set to True, the connection manager will attempt to reuse an existing connection when a task asks for one (instead of creating a new connection each time).

Using RetainSameConnection

The advantage here is that your transactions become “light-weight” and perform better, as they no longer need to be coordinated across multiple machines using DTC. However, this approach is not a silver bullet, and there are some limitations to what RetainSameConnection can do.

The behavior of RetainSameConnection is determined by the underlying provider – there is very little logic in the Connection Manager itself to handle connection reuse. Not all providers will be able to support this functionality (although it should work fine with the SQL Server providers). Also note that not all Connection Managers support RetainSameConnection, either. File Connection Managers, for example, simply return a string to the file or directory they are pointing to –  they don’t actually open up a “connection object”.

RetainSameConnection might not work if the a connection manager is being used multiple times in parallel. You usually encounter this issue if you are trying to SELECT and INSERT at the same time. Some common examples:

  1. Running multiple Execute SQL Tasks in parallel using the same connection manager
    • Solution: Use separate connection managers, or run the tasks serially
  2. Both the Source and Destination component in your Data Flow are using the same Connection Manager
    • Solution: Use a separate connection manager for the source component
  3. The Destination and another transform (usually Lookup, or Script) are using the same Connection Manager
    • Solution: Use a Full Cache Lookup or staging tables

The last thing to note is that if your data flow is writing to multiple destinations, you will need to use multiple connection managers. Be sure to start a separate transaction for each one.

Note: If you are using this pattern and encountering issues, try using SQL Profiler to detect if your queries are in fact using the same SPID.

Design for Restartability

Instead of using transactions, consider designing your packages for restartability. This doesn’t necessarily mean the use of Checkpoints (although you can use them if you are so inclined) – it means that your packages take any corrective actions they need to ensure that they can be safely run multiple times. The type of actions you take will depend on your package’s logic. For example, a package that loads multiple flat files from a directory could move/rename a file once it has been successfully processed to avoid loading it a second time if the package needs to be restarted. This approach requires you to put more effort into the design of your package, but will give you a much more robust ETL solution.

There will be scenarios where the use of transactions is required, but it is usually possible to limit their scope. This typically involves splitting up your ETL logic across multiple tasks (or packages), staging your data in between each step, and enabling transactions only when absolute necessary. For example, you may create a Data Flow that pulls data from a source, performs the required transformations, and then loads the data into a staging table. You can then use another Data Flow (or an Execute SQL Task if the staging table is area in the destination database) to insert/merge the data into the final destination. With this approach, you will only need to use a transaction for the final insertion.

Handling Failed Inserts when using Fast Load

A useful pattern when designing for restartability is to isolate the specific rows that fail to insert into your destination. If you are using Bulk / Fast Load at your destination (which is recommended if you are at all concerned with performance) and one of the rows fails to insert, it may fail the entire batch (the logic depends on the provider and destination you are using). To isolate the failed row, you can add a second destination to the error output that does row by row insertions (i.e. does not use Bulk / Fast Load). You can then redirect the error output of the second destination to a flat file / staging table so it can be manually inspected and corrected.

Handling Failed Inserts when using Fast Load

Summary

SSIS has built-in support for transactions, but to get the best performance, you might want to consider an alternative design. I highly recommend designing your packages for restartability, and limiting the use and scope of transactions in your packages.

I hope you found this blog post useful. I’d love to hear about specific scenarios where you are using transactions, and alternative package designs you might have used to implement your own form of restartability.