SSIS has built-in support for transactions, but their use isn’t ideal for all scenarios. This post covers how SSIS transactions work at a high level, some of the challenges associated with them, and provides some design alternatives.
Disclaimer: Many SSIS packages use transactions without issue. Using the built-in functionality is the easiest approach, and (typically) doesn’t require a redesign of your package logic to implement. If your packages are currently using transactions, please do not feel the need to go in and change anything! If, however, you are looking to improve the performance of your packages, or are encountering the issues I describe below, consider using the design patterns presented here.
How SSIS Transactions Work
SSIS transactions are based on the Microsoft Distributed Transaction Coordinator (MSDTC). They are enabled at the Task/Container level by setting the TransactionOption property to Required. For a transaction to work at runtime, MSDTC must be enabled (and configured) on all machines involved in the transaction (i.e. the source machine, the destination, and the machine running the SSIS package… and of course, in some environments, this might all be the same machine).
Configuring DTC to communicate across hosts and machine boundaries can be tricky. Timeouts and unexpected termination of connections on long running queries are another common source of problems. Finally, the server(s) you are accessing must support DTC – this isn’t always possible when your database is using a non-Windows platform.
Performance is probably the biggest challenge that customers face with SSIS transactions. Using any sort of transaction will impact the performance of your packages, but DTC transactions can be especially “heavy-weight”, as they can require coordination between multiple hosts.
To enlist in a transaction, the underlying data provider used by your connection manager must also support DTC transactions. Microsoft providers (like SQL Native Client and SqlClient) typically support DTC, but the same may not be true for third party providers.
My biggest complaint (and perhaps it’s something we can address in the future) is that there is no way to tell which Tasks actually support transactions. Take the following control flow:
We’re missing some information from the UI:
- We can’t tell if transactions are enabled
- We can’t tell where the transaction starts
- We can’t tell if any of the tasks have opted out of the transaction (TransactionOption = NotSupported)
To get the information we need, we have to rely on good documentation (i.e. annotations), or take a look at the TransactionOption property for each of the tasks/containers in the package.
Even if we can assume that transactions are enabled at the package level, and none of the tasks have opted out, not every Task is able to rollback their work when a transaction fails. Using the example above:
Supports Transactions (rollback)
- Execute SQL Task
- Analysis Services Execute DDL Task
- Data Flow Task
Does Not Support Transactions (rollback)
- XML Task
- File System Task
The Script Task may or may not support transactions, depending on how it was written.
A common alternative to using the built-in transactions is to create them yourself. This is done by setting the Connection Manager’s RetainSameConnection property to True, and creating/committing/rolling back the transaction using Execute SQL Tasks. Jamie Thomson blogged about this back in 2005, and is a very common design pattern amongst SSIS developers today. When RetainSameConnection is set to True, the connection manager will attempt to reuse an existing connection when a task asks for one (instead of creating a new connection each time).
The advantage here is that your transactions become “light-weight” and perform better, as they no longer need to be coordinated across multiple machines using DTC. However, this approach is not a silver bullet, and there are some limitations to what RetainSameConnection can do.
The behavior of RetainSameConnection is determined by the underlying provider – there is very little logic in the Connection Manager itself to handle connection reuse. Not all providers will be able to support this functionality (although it should work fine with the SQL Server providers). Also note that not all Connection Managers support RetainSameConnection, either. File Connection Managers, for example, simply return a string to the file or directory they are pointing to – they don’t actually open up a “connection object”.
RetainSameConnection might not work if the a connection manager is being used multiple times in parallel. You usually encounter this issue if you are trying to SELECT and INSERT at the same time. Some common examples:
- Running multiple Execute SQL Tasks in parallel using the same connection manager
- Solution: Use separate connection managers, or run the tasks serially
- Solution: Use a separate connection manager for the source component
- Solution: Use a Full Cache Lookup or staging tables
The last thing to note is that if your data flow is writing to multiple destinations, you will need to use multiple connection managers. Be sure to start a separate transaction for each one.
Note: If you are using this pattern and encountering issues, try using SQL Profiler to detect if your queries are in fact using the same SPID.
Design for Restartability
Instead of using transactions, consider designing your packages for restartability. This doesn’t necessarily mean the use of Checkpoints (although you can use them if you are so inclined) – it means that your packages take any corrective actions they need to ensure that they can be safely run multiple times. The type of actions you take will depend on your package’s logic. For example, a package that loads multiple flat files from a directory could move/rename a file once it has been successfully processed to avoid loading it a second time if the package needs to be restarted. This approach requires you to put more effort into the design of your package, but will give you a much more robust ETL solution.
There will be scenarios where the use of transactions is required, but it is usually possible to limit their scope. This typically involves splitting up your ETL logic across multiple tasks (or packages), staging your data in between each step, and enabling transactions only when absolute necessary. For example, you may create a Data Flow that pulls data from a source, performs the required transformations, and then loads the data into a staging table. You can then use another Data Flow (or an Execute SQL Task if the staging table is area in the destination database) to insert/merge the data into the final destination. With this approach, you will only need to use a transaction for the final insertion.
Handling Failed Inserts when using Fast Load
A useful pattern when designing for restartability is to isolate the specific rows that fail to insert into your destination. If you are using Bulk / Fast Load at your destination (which is recommended if you are at all concerned with performance) and one of the rows fails to insert, it may fail the entire batch (the logic depends on the provider and destination you are using). To isolate the failed row, you can add a second destination to the error output that does row by row insertions (i.e. does not use Bulk / Fast Load). You can then redirect the error output of the second destination to a flat file / staging table so it can be manually inspected and corrected.
SSIS has built-in support for transactions, but to get the best performance, you might want to consider an alternative design. I highly recommend designing your packages for restartability, and limiting the use and scope of transactions in your packages.
I hope you found this blog post useful. I’d love to hear about specific scenarios where you are using transactions, and alternative package designs you might have used to implement your own form of restartability.