Consistent Design Patterns and SSIS Frameworks

Question: How do I make sure everyone at my company is using the same SSIS framework and design patterns?


SSIS Design Patterns and frameworks are one of my favorite things to talk (and write) about. A recent search on SSIS frameworks highlighted just how many different frameworks there are out there, and making sure that everyone at your company is following what you consider to be best practices can be a challenge.

In an ideal scenario, before you do any SSIS development, your team has an experienced SSIS consultant/trainer come in. They provide you with a framework that works for your environment and establishes a set of best practices that your team sticks to for all future package development, until the end of time.

This rarely happens.

It’s Too Easy

SSIS has a learning curve, but is fairly easy to get started with. You can throw together a simple workflow & ETL running on your local machine in a short amount of time. It starts to get a bit more complicated when you need to deploy that solution to another machine, and even more so when you start adding multiple developers to the mix. This is where the best practices start to be required.

The downside to SSIS being easy enough for people to learn on their own is many people come up with their own way of doing things. Given that there are usually multiple ways to accomplish common ETL tasks, your company can end up with a mix of “best practices”, “frameworks” and ways of using SSIS.

Did you know there are at least 5 different SSIS frameworks used internally at Microsoft? Each one has its own take on how to do configuration, dynamic package execution/generation, and logging. I wouldn’t be surprised if there are a number of others that I’m not aware of.

We improved this in SQL 2012 with the addition of the SSIS Catalog (SSISDB). In 2005/2008, the biggest differences between SSIS projects would typically be the way they do deployment & configuration. The Catalog was meant to provide a “default” best practice, and remove the need to rely on their own custom frameworks. There’s still more we could do here, but given the adoption rates we’re seeing, it was definitely a step in the right direction.

When we asked our MVPs to create whitepapers for SQL 2008, we ended up with four configuration related articles. Each one does things in a slightly different way.

So the question becomes, how do I make sure that the SSIS developers in my company are following best practices and using the framework we have in place?

Training plays a big role, here. If you show your developers that there is a better way of doing things, they should be willing to adopt it. That said, many times you’ll find consistency is more important than doing things in a slightly better way. I know this sounds ironic coming from someone who worked on the product (*COUGH*three separate data type systems*COUGH*), but I give this advice to other people, speaking from experience.

Tools to Help Apply Best Practices

Tools can really help here, as well. There are two third party tools that comes to mind:

BI xPress from Pragmatic Works has many different features that help here. The Auditing Framework helps with consistent logging, and their Templates, Snippets and Data Flow Nuggets functionality lets you apply and reuse patterns across your packages. They also recently added a Best Practices Analyzer that helps you maintain some consistency.

If you’re looking to apply design patterns and/or you require dynamic package generation, then Biml from Varigence is a great solution. It’s a powerful markup language/compiler that makes it really easy to create your SSIS packages in a consistent way. There is a free version of Biml integrated into BIDS Helper, and a more advanced version (and separate designer – Mist) available from Varigence. There is also a Biml repository site that is great for examples.

Slides from SSIS Performance Design Patterns | TechDays Hong Kong 2013

The slides from my 3rd presentation at the TechDays Hong Kong event are now available. As I mentioned during the talk, design patterns are one of my favorite topics, and something I really enjoyed writing about. This presentation had a subset of the SSIS Performance Design patterns in the full day training sessions I’ve done in the past. See this post for a link to the full deck (over 100 slides).

Thanks for attending!

Dynamic Execution of Child Packages with Different Parameters

SSIS 2012 introduces the concept of Parameters – a more explicit way of configuring your SSIS packages that you can use with the Project deployment model. The Execute Package Task was updated to support binding variables in the parent package to parameters in the child package (an alternative to the Parent Variable Configurations in SSIS 2005/2008).

image

Clicking the Add button on the Parameter bindings page will automatically populate the binding list with the next un-bound parameter for the child package you have selected to run. Note, however, that you can manually enter a parameter name in the Child package parameter field, even if it doesn’t actually exist on the child package.

image

At runtime, any parameters that don’t exist on the child package don’t cause any errors or warnings at runtime. There is a reason for this…

Dynamic Package Execution

You can change the which child package the Execute Package Task runs using an expression on the PackageName property.

image

Since not all packages will have the same set of parameters, the Execute Package Task allows you to define a superset of parameter bindings at design time. Parameters that don’t exist for the current child package will simply be ignored.

For example:

  • ChildPackage1 requires ParameterA and ParameterB
  • ChildPackage2 requires ParameterC
  • ParentPackage has logic that will dynamically call either ChildPackage1 or ChildPackage2 at runtime

To do this, you would add parameter bindings for all three parameters (ParameterA, ParameterB, ParameterC) in the Execute Package Task in your parent package.

image


More details and parent-child package execution design patterns can be found in Chapter 16 of the SQL Server 2012 Integration Services Design Patterns book available from Apress.

PASS Summit 2012 – SSIS Pre-Conference Event

Well, I guess Andy, Tim & I did a good job at PASS last year, as we’ve been invited back to present another SSIS pre-conference event at the 2012 PASS Summit. This time we’ll be joined by the co-authors of our SSIS Design Patterns book, Jessica Moss and Michelle Ufford.

 

SSIS Design Patterns
Speaker(s): Andy Leonard, Jessica Moss, Matt Masson, Michelle Ufford, Tim Mitchell
In this full-day session, the “SSIS Design Patterns” (Apress, 2012) author team will describe and demonstrate patterns for package execution, package logging, loading flat file and XML sources, loading the cloud, dynamic package generation, SSIS Frameworks, data warehouse ETL, and data flow performance.

 

We had a great audience and turnout last year (around 120 people if I recall correctly…). This year should be even better…

SSIS Performance Design Patterns

Here are the slides (all 177 of them) for the SSIS Performance Design Patterns talk that I’ve delivered at three different pre-conference events over the past month. Each presentation was slightly different, but the core content remained the same. These events included:

The version posted here is from the SQL Bits event.The recording from this event be posted online in the next few weeks – be sure to check the site!

 

Speaking at SQL Saturday #105 in Dublin, Ireland

I’m honored to be presenting at SQL Saturday #105 in Dublin, Ireland on March 24th. They recently added a 4th session track to keep up with demand, and I was more than happy to fill in one of the timeslots with a session about the developer experience enhancements in SSIS 2012.

I’ll also be covering SSIS Performance Design Patterns in a full day training session the day before the event. The content will be similar to the pre-con session I’m doing the following week at SQL Bits X, with a couple of changes – I’ll be presenting solo, and will spend the last portion of the day covering the enhancements we made to SSIS in SQL Server 2012.

Registration information and details are available here.

SQL Saturday #105

Design Pattern: Avoiding Transactions

SSIS has built-in support for transactions, but their use isn’t ideal for all scenarios. This post covers how SSIS transactions work at a high level, some of the challenges associated with them, and provides some design alternatives.

Disclaimer: Many SSIS packages use transactions without issue. Using the built-in functionality is the easiest approach, and (typically) doesn’t require a redesign of your package logic to implement. If your packages are currently using transactions, please do not feel the need to go in and change anything! If, however, you are looking to improve the performance of your packages, or are encountering the issues I describe below, consider using the design patterns presented here.

How SSIS Transactions Work

SSIS transactions are based on the Microsoft Distributed Transaction Coordinator (MSDTC). They are enabled at the Task/Container level by setting the TransactionOption property to Required. For a transaction to work at runtime, MSDTC must be enabled (and configured) on all machines involved in the transaction (i.e. the source machine, the destination, and the machine running the SSIS package… and of course, in some environments, this might all be the same machine).

How SSIS uses MSDTC

Challenges

Configuration

Configuring DTC to communicate across hosts and machine boundaries can be tricky. Timeouts and unexpected termination of connections on long running queries are another common source of problems. Finally, the server(s) you are accessing must support DTC – this isn’t always possible when your database is using a non-Windows platform.

Performance

Performance is probably the biggest challenge that customers face with SSIS transactions. Using any sort of transaction will impact the performance of your packages, but DTC transactions can be especially “heavy-weight”, as they can require coordination between multiple hosts.

DB Provider

To enlist in a transaction, the underlying data provider used by your connection manager must also support DTC transactions. Microsoft providers (like SQL Native Client and SqlClient) typically support DTC, but the same may not be true for third party providers.

Transaction Support

My biggest complaint (and perhaps it’s something we can address in the future) is that there is no way to tell which Tasks actually support transactions. Take the following control flow:

Which Tasks support Transactions?

We’re missing some information from the UI:

  • We can’t tell if transactions are enabled
  • We can’t tell where the transaction starts
  • We can’t tell if any of the tasks have opted out of the transaction (TransactionOption = NotSupported)

To get the information we need, we have to rely on good documentation (i.e. annotations), or take a look at the TransactionOption property for each of the tasks/containers in the package.

Even if we can assume that transactions are enabled at the package level, and none of the tasks have opted out, not every Task is able to rollback their work when a transaction fails. Using the example above:

Supports Transactions (rollback)

  • Execute SQL Task
  • Analysis Services Execute DDL Task
  • Data Flow Task

Does Not Support Transactions (rollback)

  • XML Task
  • File System Task

The Script Task may or may not support transactions, depending on how it was written.

Design Alternatives

Manual Transactions

A common alternative to using the built-in transactions is to create them yourself. This is done by setting the Connection Manager’s RetainSameConnection property to True, and creating/committing/rolling back the transaction using Execute SQL Tasks. Jamie Thomson blogged about this back in 2005, and is a very common design pattern amongst SSIS developers today. When RetainSameConnection is set to True, the connection manager will attempt to reuse an existing connection when a task asks for one (instead of creating a new connection each time).

Using RetainSameConnection

The advantage here is that your transactions become “light-weight” and perform better, as they no longer need to be coordinated across multiple machines using DTC. However, this approach is not a silver bullet, and there are some limitations to what RetainSameConnection can do.

The behavior of RetainSameConnection is determined by the underlying provider – there is very little logic in the Connection Manager itself to handle connection reuse. Not all providers will be able to support this functionality (although it should work fine with the SQL Server providers). Also note that not all Connection Managers support RetainSameConnection, either. File Connection Managers, for example, simply return a string to the file or directory they are pointing to –  they don’t actually open up a “connection object”.

RetainSameConnection might not work if the a connection manager is being used multiple times in parallel. You usually encounter this issue if you are trying to SELECT and INSERT at the same time. Some common examples:

  1. Running multiple Execute SQL Tasks in parallel using the same connection manager
    • Solution: Use separate connection managers, or run the tasks serially
  2. Both the Source and Destination component in your Data Flow are using the same Connection Manager
    • Solution: Use a separate connection manager for the source component
  3. The Destination and another transform (usually Lookup, or Script) are using the same Connection Manager
    • Solution: Use a Full Cache Lookup or staging tables

The last thing to note is that if your data flow is writing to multiple destinations, you will need to use multiple connection managers. Be sure to start a separate transaction for each one.

Note: If you are using this pattern and encountering issues, try using SQL Profiler to detect if your queries are in fact using the same SPID.

Design for Restartability

Instead of using transactions, consider designing your packages for restartability. This doesn’t necessarily mean the use of Checkpoints (although you can use them if you are so inclined) – it means that your packages take any corrective actions they need to ensure that they can be safely run multiple times. The type of actions you take will depend on your package’s logic. For example, a package that loads multiple flat files from a directory could move/rename a file once it has been successfully processed to avoid loading it a second time if the package needs to be restarted. This approach requires you to put more effort into the design of your package, but will give you a much more robust ETL solution.

There will be scenarios where the use of transactions is required, but it is usually possible to limit their scope. This typically involves splitting up your ETL logic across multiple tasks (or packages), staging your data in between each step, and enabling transactions only when absolute necessary. For example, you may create a Data Flow that pulls data from a source, performs the required transformations, and then loads the data into a staging table. You can then use another Data Flow (or an Execute SQL Task if the staging table is area in the destination database) to insert/merge the data into the final destination. With this approach, you will only need to use a transaction for the final insertion.

Handling Failed Inserts when using Fast Load

A useful pattern when designing for restartability is to isolate the specific rows that fail to insert into your destination. If you are using Bulk / Fast Load at your destination (which is recommended if you are at all concerned with performance) and one of the rows fails to insert, it may fail the entire batch (the logic depends on the provider and destination you are using). To isolate the failed row, you can add a second destination to the error output that does row by row insertions (i.e. does not use Bulk / Fast Load). You can then redirect the error output of the second destination to a flat file / staging table so it can be manually inspected and corrected.

Handling Failed Inserts when using Fast Load

Summary

SSIS has built-in support for transactions, but to get the best performance, you might want to consider an alternative design. I highly recommend designing your packages for restartability, and limiting the use and scope of transactions in your packages.

I hope you found this blog post useful. I’d love to hear about specific scenarios where you are using transactions, and alternative package designs you might have used to implement your own form of restartability.

Upcoming Book – SSIS Design Patterns

Over the past few months I’ve been working on a book for SQL Server 2012 entitled – SSIS Design Patterns. I’m co-authoring the book with a fantastic bunch of SQL Server MVPs – Tim Mitchell, Jessica Moss, Michelle Ufford, and Andy Leonard. Details are now available on the Apress website and Amazon.

SSIS Design Patterns

SSIS Design Patterns is a book of recipes for SQL Server Integration Services (SSIS). Design patterns in the book show how to solve common problems encountered when developing data integration solutions. Because you do not have to build the code from scratch each time, using design patterns improves your efficiency as an SSIS developer. In SSIS Design Patterns, we take you through several of these snippets in detail, providing the technical details of the resolution.
SSIS Design Patterns does not focus on the problems to be solved; instead, the book delves into why particular problems should be solved in certain ways. You’ll learn more about SSIS as a result, and you’ll learn by practical example. Where appropriate, SSIS Design Patterns provides examples of alternative patterns and discusses when and where they should be used. Highlights of the book include sections on ETL Instrumentation, SSIS Frameworks, and Dependency Services.

  • Takes you through solutions to several common data integration challenges
  • Demonstrates new features in SQL Server 2012 Integration Services
  • Teaches SSIS using practical examples
What you’ll learn
  • Load data from flat file formats
  • Explore patterns for executing SSIS packages
  • Discover a pattern for loading XML data
  • Migrate SSIS packages through your application lifecycle without editing connections
  • Take advantage of SSIS 2012 Dependency Services
  • Build an SSIS Framework to support your application needs
Who this book is for

SSIS Design Patterns is for the data integration developer who is ready to take their SQL Server Integration Services (SSIS) skills to a more efficient level. It’s for the developer interested in locating a previously-tested solution quickly. SSIS Design Patterns is a great book for ETL (extract, transform, and load) specialists and those seeking practical uses for new features in SQL Server 2012 Integration Services. It’s an excellent choice for business intelligence and data warehouse developers.