Too Many Sources in a Data Flow

This was a recent customer issue that I wanted to share.

The customer was transferring a large amount of data between two tables, without performing any transformations on it. They wanted to run multiple transfer streams in parallel, and created a way to split up their source data into six partitions. In their original design, they had six Source -> Destination combinations in a single Data Flow Task.

six source components in a data flow

This package worked fine in their development environment, but would result in a deadlock when they ran it in production. Looking at the SSIS logs, they could see that the package was hanging in the Pre-Execute phase of the Data Flow. They ran a profiler trace and found that SQL was showing Async Network IO for five sessions, but the sixth was waiting for memory resources (wait type of RESOURCE_SEMAPHORE). Looking at sys.dm_exec_query_resource_semaphores  (shown below) confirmed this.

Waits

There were two factors contributing to the deadlock:

  1. The source server did not have enough memory to run six of these queries in parallel
  2. In SSIS, the Data Flow does not start executing until all components have completed the Pre-Execute phase

When the data flow runs, each source component prepares its query with the source server in the pre-execute phase. The server was able to grant enough memory for the first five queries (session_id 57 to 61), but the sixth query (session_id = 63) was told to wait. This prevents the pre-execute phase of the sixth OLE DB Source component from completing. The data flow won’t start running until the pre-execute phase completes for each component, and the source server can’t grant memory until the other queries complete, which results in a deadlocked package.

To resolve this issue, the customer changed the package design pattern to have a single Source -> Destination combination per data flow, with six parallel Data Flow Tasks. With this design, they were still occasionally getting RESOURCE_SEMAPHORE waits, but the waits didn’t last forever, as the other Data Flow tasks were able to complete their queries independently. (They were eventually able to remove the waits all together by optimizing their source query.)

An alternate design (which would be my preference) would have been to have a single Data Flow per package, parameterize the source query, and run six instances of the package in parallel.

Summary

Beware of doing too much in a single Data Flow! Unlike Tasks in the control flow, which run independently, the Data Flow can’t start until each component has finished its initialization in the Pre-Execute phase. This is especially important for components that can spend a long time in the Pre-Execute phase, such as a Lookup Transform with Full Cache mode. A Data Flow should only have a single source — unless the data flow is combining data from multiple sources. If your data flow has multiple data paths that do not overlap, then you should consider using separate Data Flow tasks (if they need to be run “together”, put them in the same Sequence Container).

 

4 thoughts on “Too Many Sources in a Data Flow

    1. No sir, it would not. For two reasons:

      1. The hang is happening in the pre-execute phase (which is the root of the problem). The data flow is deadlocked because the 6th source is waiting on the 1st source to finish it's query, and the 1st source query never completes because the 6th source is preventing the Execute phase from starting.

      2. This isn't how EngineThreads works. EngineThreads is just a suggestion to the Data Flow. If the Data Flow engine determines the value is below the minimum number of threads (M#T), it will be ignored. The calculation to determine M#T is: # execution trees + # sources + 2. A data flow is guaranteed to always have a thread assigned per source.

      1. Can you provide a link that says the EngineThreads is "just a suggestion". If I set it to 2, as also identified in the link I pasted above, only 2 run. Try it out.

        1. The Books Online entry that talks about Data Flow Performance Features mentions that it is just a suggestion. http://msdn.microsoft.com/en-us/library/ms141031….

          What version of SSIS are you using? I believe the calculation for minimum number of threads was changed in SQL 2008. I'm running the package example that Jamie put on his blog with SSIS 2012, and i see all paths running in parallel, regardless of the engine threads setting.

Comments are closed.