New SSIS Whitepapers from the SQLCAT Team

Two new SSIS SQL Server Technical Articles written by the SQLCAT team have been posted to the MSDN Library. They both focus on hybrid/cloud data movement, but they also contain design patterns and guidance that applies to on-premises data flows as well. The SSIS Operational and Tuning Guide has a great section on designing a package for restartability and performance tuning that all SSIS users will find useful.

SSIS Operational and Tuning Guide

SQL Server Integration Services (SSIS) can be used effectively as a tool for moving data to and from Windows Azure (WA) SQL Database, as part of the total extract, transform, and load (ETL) solution and as part of the data movement solution. SSIS can be used effectively to move data between sources and destinations in the cloud, and in a hybrid scenario between the cloud and on-premise. This paper outlines SSIS best practices for cloud sources and destinations, discusses project planning for SSIS projects whether the project is all in the cloud or involves hybrid data moves, and walks through an example of maximizing performance on a hybrid move by scaling out the data movement.

 

SSIS for Azure and Hybrid Data Movement

SQL Server Integration Services (SSIS) can be used effectively as a tool for moving data to and from Windows Azure SQL Database, as part of the total extract, transform, and load (ETL) solution and as part of the data movement solution. SSIS can be used effectively to move data between sources and destinations in the cloud, and in a hybrid scenario between the cloud and on-premise. This paper outlines best practices for using SSIS for cloud sources and destinations and for project planning for SSIS projects to be used with Azure or hybrid data moves, and gives an example of maximizing performance on a hybrid move by scaling out the data movement.

 

Did you know that the DQS team also published a whitepaper on using DQS with AlwaysOn? It’s another interesting read.

SSIS Performance Design Patterns

Here are the slides (all 177 of them) for the SSIS Performance Design Patterns talk that I’ve delivered at three different pre-conference events over the past month. Each presentation was slightly different, but the core content remained the same. These events included:

The version posted here is from the SQL Bits event.The recording from this event be posted online in the next few weeks – be sure to check the site!

 

Resources for SSIS Performance Best Practices

Simple post today.

Here are a number of links to resources that I reference in my Performance Design Patterns presentation.

 


You can see me for SSIS Performance Design Pattern training at SQL Saturday #105 in Dublin, SQL Bits X in London, and SQL Saturday #111 in Atlanta.

Queries for Data Flow Component Performance in SQL 2012

The built-in SSIS Catalog reports in SSMS show a breakdown of task and component performance of a package execution when you run with a logging level of “Performance”.

SSIS Catalog package performance report

For a more detailed or customized view, you can query the catalog tables directly. The key table to get data flow performance information is [catalog].[execution_component_phases]. Here are two sample queries – the first returns the total time spent on each component (similar to what is displayed in the performance report), and the second returns the times broken down by phase.

 

Running these queries returns something like this:

image

Some notes:

  1. You need to run the package with a Performance logging level to capture this data
  2. You can set a default logging level for the Catalog
  3. The query above determines the execution_id for the latest run – you will need to change this value to look at the results for specific package executions

How Can I Improve the Performance of my Destination Component

Question: I’ve created a custom destination component, but I’m not happy with the performance. I tried using multiple threads, but the pipeline buffer doesn’t allow free traversal. What can I do?

 

I’d start by profiling the component to make sure you identify the correct bottlenecks. The most common source of problems is accessing the COM interfaces at runtime – make sure that you keep a reference to any object retrieved through IDTSxxx100 members to avoid repeated COM interop calls. It’s a best practice to cache everything you need in the pre-execute phase.

If you’ve determined that the bottleneck is at the destination, and that multiple threads will improve performance, then you’ll need to make a copy of the data in the buffer before passing it to your destination. I did something similar in the old Data Services Destination sample on codeplex (main/DataServices/Src/Destination/Component.cs). The logic was:

  1. Loop through the buffer
  2. Copy row data into local object
  3. Add batch of data to a work item threadpool for insertion (or use the new Task Parallel Library)
  4. When loop completes, wait for all threads/work items to complete before returning from ProcessInput

In the sample I linked to above, a “batch” was a single row (as the destination only supported RBAR at the time), but you can configure this to be whatever batch size makes sense for your destination.

Note that step #4 is important, as the buffer object that is passed into ProcessInput can be destroyed/modified once the function completes. Waiting allows you to redirect error rows if insertions fail at the destination. The downside to waiting is that your destination component can’t start processing a new buffer until you’re “done” with the current one (i.e. have returned from ProcessInput). You can choose not to wait and return right after you’ve read all of the rows, but you must ensure that:

  1. You make a copy (a real copy – not just an object reference) of any data you need from the PipelineBuffer object
  2. When you receive EndOfRowset, make sure you complete all outstanding work items before returning from ProcessInput

The Data Services sample also contains a really useful class called ThreadPoolWait, which makes it easier to wait for threads to complete.

Hope that helps!

Too Many Sources in a Data Flow

This was a recent customer issue that I wanted to share.

The customer was transferring a large amount of data between two tables, without performing any transformations on it. They wanted to run multiple transfer streams in parallel, and created a way to split up their source data into six partitions. In their original design, they had six Source -> Destination combinations in a single Data Flow Task.

six source components in a data flow

This package worked fine in their development environment, but would result in a deadlock when they ran it in production. Looking at the SSIS logs, they could see that the package was hanging in the Pre-Execute phase of the Data Flow. They ran a profiler trace and found that SQL was showing Async Network IO for five sessions, but the sixth was waiting for memory resources (wait type of RESOURCE_SEMAPHORE). Looking at sys.dm_exec_query_resource_semaphores  (shown below) confirmed this.

Waits

There were two factors contributing to the deadlock:

  1. The source server did not have enough memory to run six of these queries in parallel
  2. In SSIS, the Data Flow does not start executing until all components have completed the Pre-Execute phase

When the data flow runs, each source component prepares its query with the source server in the pre-execute phase. The server was able to grant enough memory for the first five queries (session_id 57 to 61), but the sixth query (session_id = 63) was told to wait. This prevents the pre-execute phase of the sixth OLE DB Source component from completing. The data flow won’t start running until the pre-execute phase completes for each component, and the source server can’t grant memory until the other queries complete, which results in a deadlocked package.

To resolve this issue, the customer changed the package design pattern to have a single Source -> Destination combination per data flow, with six parallel Data Flow Tasks. With this design, they were still occasionally getting RESOURCE_SEMAPHORE waits, but the waits didn’t last forever, as the other Data Flow tasks were able to complete their queries independently. (They were eventually able to remove the waits all together by optimizing their source query.)

An alternate design (which would be my preference) would have been to have a single Data Flow per package, parameterize the source query, and run six instances of the package in parallel.

Summary

Beware of doing too much in a single Data Flow! Unlike Tasks in the control flow, which run independently, the Data Flow can’t start until each component has finished its initialization in the Pre-Execute phase. This is especially important for components that can spend a long time in the Pre-Execute phase, such as a Lookup Transform with Full Cache mode. A Data Flow should only have a single source — unless the data flow is combining data from multiple sources. If your data flow has multiple data paths that do not overlap, then you should consider using separate Data Flow tasks (if they need to be run “together”, put them in the same Sequence Container).