Generate an Empty Raw File Without Running your SSIS Package

A new feature was added to the Raw File Destination in SQL Server 2012 which allows you to output an empty raw file from the editor UI.

  1. Add a Raw File Destination to your data flow, and connect it to the previous transform
  2. Double click to open the Raw File Destination Editor
  3. Configure the file name
  4. Click on the Columns tab, and select the columns you want included
  5. Click on the Connection Manager tab
  6. Click the Generate initial raw file … button
  7. A dialog confirming the output columns will pop up. Click OK to create your file.




This functionality is handy when you are using raw files to stage data between data flows. You can use this empty raw file in the Raw File Source to configure the metadata it needs. In previous versions, you would have had to run the package to produce the initial file.

Another note – Raw Files were improved in SQL Server 2012 to include sort information, so now the IsSorted flag is automatically set in the Raw File Source.



Using Hive ODBC from SSIS

Microsoft announced the general availability of Windows Azure HDInsight last week, and released an updated version of the Microsoft Hive ODBC Driver. This post describes how to use this driver to pull data from your HDInsight cluster with an SSIS Data Flow Task.

Installing and Configuring the Hive ODBC Driver

The download page for the Hive ODBC driver provides a link to an article on how to configure the Hive ODBC driver to connect to HDInsight using Excel. The screenshots in the article are a bit out of date, but the procedure is essentially the same when using the driver from SSIS. Be sure to install both the 32bit and 64bit versions of the driver – you’ll need to install them separately. If you only install the 64bit driver, you’ll get errors in visual studio (which is a 32bit process) when trying to configure your connection manager. If you only install the 32bit driver, you might get an error when you try to run the package in the designer (as the designer runs the package in 64bit mode by default).

Once the driver is installed, there are two ways to configure it – using a Data Source Name (DSN), or using a connection string. I recommend taking the connection string approach, but you might find using the DSN approach more convenient for your environment.

Data Source Name (DSN)

Using a DSN requires you preregister the data source (at the file/user/system level) using the Windows ODBC Data Source Administrator. You’ll then be able to reference this DSN entry by name from an SSIS Connection Manager. This essentially creates an alias for your data source – you can change where the DSN is pointing, and your packages continue to work. The downside to this approach is that you’ll need to make sure this DSN exists on all machines that will be running this package.

There are two versions of the ODBC Data Source Administrator UI – one for 32bit (%windir%SysWOW64odbcad32.exe) and one for 64bit (%windir%System32odbcad32.exe). You’ll likely want to create both 32bit and 64bit DSNs – just make sure that the same name is used for both versions. At minimum, you’ll need to register a 32bit DSN to use when creating your SSIS package in the designer.

When creating the DSN, select the Microsoft Hive ODBC Driver.


On the configuration page, you’ll need to enter:

  • Host: This is the full domain name of your HDInsight cluster (*
  • Port: 443 is the default
  • Database: default
  • Hive Server Type: Hive Server 2
  • Authentication Mechanism: Select Windows Azure HDInsight Service – this will be the UN/PW you used when configuring your cluster
  • User Name & Password: the user you configured for your cluster


Another setting of interest on the Advanced Options page is the Default string column length value. By default, this will be set to 32767, which is larger than the SSIS maximum length for a DT_STR/DT_WSTR column (4000). This means that SSIS will treat all Hive STRING columns as DT_TEXT/DT_NTEXT, which has negative performance implications. If you know that your data values will be less than 4000 characters in length, I recommend lowering this value to 4000 (or less).


Once the values are configured, click the Test button, and you’re good to go.


The DSN will then appear in your ODBC Administrator UI. As noted above, you’ll want to create both 32bit and 64bit DSNs.


You will now be able to see your newly created DNS when you create an ODBC Connection Manager in SSIS.


Using a Connection String

The advantage to using a connection string is that you don’t have to pre-create the DSN on the systems that will be running your SSIS package. Knowing what to enter can be a little tricky (see the tip below), but I prefer approach as it removes the external (DSN) dependency. Also note that the same connection string works for both 32bit and 64bit execution modes – you just need to ensure that both versions of the ODBC driver are installed.

The table below lists properties you’ll need to set.

Field Description
Driver Name of the driver – {Microsoft Hive ODBC Driver}
Host DNS hostname of your cluster
Port Connection port – the HDInsight is 443
Schema Default database schema
RowsFetchedPerBlock Number of rows fetched per block. The recommendation is to keep it at 10,000.
HiveServerType The HDInsight default is 2
AuthMech Authentication mechanism – you’ll want to use a value of 6, which maps to using the username and password you specified when the cluster was created
DefaultStringColumnLength The default length for STRING columns. See the note above.



DRIVER={Microsoft Hive ODBC Driver};Host=<cluster name>;Port=443;Schema=default;RowsFetchedPerBlock=10000;HiveServerType=2;AuthMech=6;DefaultStringColumnLength=4000

When configuring your Connection Manager in SSIS, select the “Use connection string” option and paste in your connection string. Note, the Username and Password values can be specified as part of the connection string (uid and pwd fields), or can be set separately in the connection manager’s UI.


Tip – To determine all of the fields that can be set for the provider (or at least the ones exposed by the UI), you can go through the DSN creation step, and then save it as a File DSN. This will create a text file in .ini format. Open it with any text editor, and you can see each of the name/value pairs that are being set. You just need to put them together into a single, semi-colon delimited string.

Configuring Your Data Flow

In an SSIS Data Flow, you can use the Hive ODBC Driver from:

  1. ODBC Source – introduced as a new component in SSIS 2012, although there are third party options available for previous versions
  2. ADO.NET Source – using the .NET ODBC Data Provider

Note that in the past the Hive ODBC driver had some compatibility issues with the SSIS ODBC Source, leading to a recommendation to use the ADO.NET Source. I didn’t notice any problems when using the latest Microsoft driver – you’ll see that the example screenshots below use the ODBC Source.

After configuring your connection manager, you can select a table or enter your query (similar to interacting with any other source in SSIS). I found that using the table drop down can take a while (15-30 seconds) – querying metadata from Hive isn’t an immediate operation.


Clicking the Preview button takes a while (as the Hive job is executed on your cluster), but it is a good way to make sure that your connection manager is correctly configured.


Clicking the Columns tab may have a delay (again – fetching metadata isn’t instantaneous), but will come back with the columns as you’d expect.


Once you click OK, you may notice another delay as the component is validated. Since Hive metadata operations can take a while to complete, I suggest setting the ValidateExternalMetadata value of your source component to False. This will speed up validation quite a bit – the metadata will only be refreshed when you open the Source UI. When using the ODBC Source, you should also consider setting the BatchSize property (default 1000) to match the RowsFetchedPerBlock setting of your connection manager (default is 10,000).

After you complete the rest of your Data Flow, you should be able to run your package. In my experience, the Validate and PreExecute phases of the data flow will finish (relatively) quick, but once the Execute phase starts, your source will sit with the icon spinning and no rows coming out as it waits for your Hive job to complete on your cluster. Once the job completes, you’ll see the data streaming back nicely.



I found that running a package which reads from the hivesampletable table that comes with HDInsight consistently takes about 2 minutes. After applying the tips mentioned throughout this post (and summarized below), the average execution time went down to 1 minute, 20 seconds. I also found the the ADO.NET Source was consistently (slightly) faster than the ODBC Source (5-10 seconds), but I expect this will vary between environments (and data sets).


  • Install both the 32bit and 64bit versions of the Hive ODBC Driver
  • Consider using a connection string instead of a DSN in your connection manager
  • Change the Default String Column Length to 4000 (or less) if you are sure your String columns are less than 4000 characters in length
  • Set ValidateExternalMetadata to False on your Source component
  • When using the ODBC Source, increase the BatchSize (to 10000)


If Preview works for you in the ODBC Source (or ADO.NET Source), but your package fails when you run in the designer, make sure you have both the 32bit and 64bit versions of the Hive provider installed, and that you’ve created DSNs for both platforms. The error looks something like this:


[ODBC Source [2]] Error: The AcquireConnection method call to the connection manager <XXX> failed with error code 0xC0014009. There may be error messages posted before this with more information on why the AcquireConnection method call failed.

[SSIS.Pipeline] Error: ODBC Source failed validation and returned error code 0x80004005.

[Connection manager “<XXX>”] Error: An ODBC error -1 has occurred.

[Connection manager “<XXX>”] Error: There was an error trying to establish an Open Database Connectivity (ODBC) connection with the database server.


If for some reason you are unable to install the 64bit provider, you can force your SSIS package to run in 32bit mode in Visual Studio by right clicking on the project, selecting Properties, going to the Configuration Properties | Debugging page, and setting the Run64BitRuntime value to False.


Data Flow Optimization Checklist


I put this together for an SSIS performance presentation I did a while ago, and thought it warranted its own post.

In addition to looking for appropriate Design Patterns, I have a checklist of things to look for when trying to optimize an SSIS Data Flow. This list is by no means extensive, but it contains some of the more common causes of performance issues in the packages I’ve worked with.

OLE DB command operates on a row by row basis. Stage the data and perform the operation in a batch mode using the Execute SQL Task instead.

Learn MERGE, and love it.

Sometimes you do. Most of the time you don’t. If you do need them (i.e. you are loading from a flat file, and need to do a merge), considering maxing out the DefaultBufferMaxRows setting so you can sort more data in memory (sort will keep at most 5 active buffers in memory before it starts swapping to disk).

  • Lookup Transforms
    • Can the lookup be avoided?
    • Are you using the right cache mode?
    • Should you use a Cache Connection Manager?

Lots of related lookup posts can be found here.


Have fun optimizing!

Generate Data with a Script Component

I use the following code to generate data with a script component. I’ve found this useful in the past to demonstrate how buffers are handled in the data flow, as well as perf/load testing a destination component. This script will automatically generate data based on the columns you’ve defined for the component – no additional code modifications are needed. For optimal performance, it uses static values, but it could be updated to support randomly generated values as well (it does support random ints – look at the code for details)

To use it, do the following:

  • Add a new Script Component, and make it a Source
  • Bring up the component UI
  • On Inputs and Outputs, define your columns
  • When you are done, click OK to save the component. I know, you haven’t entered the script yet, but it is a good idea to save your component at this point. If for some reason I cancel out of the dialog, or you have an issue with the VSTA editor, you could lost all of the columns you just spent time defining.



  • Open the UI again
  • (Optional) If you want to be able to control the number of rows that get generated, define an Integer variable called MaxRowCount. Add it as a ReadOnly variable
  • Click the Edit Script… button
  • When the editor opens, replace your main.cs file with the following code

Download the script file here:


Automatically Select Input Columns in a Custom Data Flow Component

The following code snippet can be used in a custom data flow component to automatically select all input columns when you attach a path.

This is the equivalent to clicking the Select All box in the Advanced Editor.


Queries for Data Flow Component Performance in SQL 2012

The built-in SSIS Catalog reports in SSMS show a breakdown of task and component performance of a package execution when you run with a logging level of “Performance”.

SSIS Catalog package performance report

For a more detailed or customized view, you can query the catalog tables directly. The key table to get data flow performance information is [catalog].[execution_component_phases]. Here are two sample queries – the first returns the total time spent on each component (similar to what is displayed in the performance report), and the second returns the times broken down by phase.


Running these queries returns something like this:


Some notes:

  1. You need to run the package with a Performance logging level to capture this data
  2. You can set a default logging level for the Catalog
  3. The query above determines the execution_id for the latest run – you will need to change this value to look at the results for specific package executions

How Can I Improve the Performance of my Destination Component

Question: I’ve created a custom destination component, but I’m not happy with the performance. I tried using multiple threads, but the pipeline buffer doesn’t allow free traversal. What can I do?


I’d start by profiling the component to make sure you identify the correct bottlenecks. The most common source of problems is accessing the COM interfaces at runtime – make sure that you keep a reference to any object retrieved through IDTSxxx100 members to avoid repeated COM interop calls. It’s a best practice to cache everything you need in the pre-execute phase.

If you’ve determined that the bottleneck is at the destination, and that multiple threads will improve performance, then you’ll need to make a copy of the data in the buffer before passing it to your destination. I did something similar in the old Data Services Destination sample on codeplex (main/DataServices/Src/Destination/Component.cs). The logic was:

  1. Loop through the buffer
  2. Copy row data into local object
  3. Add batch of data to a work item threadpool for insertion (or use the new Task Parallel Library)
  4. When loop completes, wait for all threads/work items to complete before returning from ProcessInput

In the sample I linked to above, a “batch” was a single row (as the destination only supported RBAR at the time), but you can configure this to be whatever batch size makes sense for your destination.

Note that step #4 is important, as the buffer object that is passed into ProcessInput can be destroyed/modified once the function completes. Waiting allows you to redirect error rows if insertions fail at the destination. The downside to waiting is that your destination component can’t start processing a new buffer until you’re “done” with the current one (i.e. have returned from ProcessInput). You can choose not to wait and return right after you’ve read all of the rows, but you must ensure that:

  1. You make a copy (a real copy – not just an object reference) of any data you need from the PipelineBuffer object
  2. When you receive EndOfRowset, make sure you complete all outstanding work items before returning from ProcessInput

The Data Services sample also contains a really useful class called ThreadPoolWait, which makes it easier to wait for threads to complete.

Hope that helps!

Too Many Sources in a Data Flow

This was a recent customer issue that I wanted to share.

The customer was transferring a large amount of data between two tables, without performing any transformations on it. They wanted to run multiple transfer streams in parallel, and created a way to split up their source data into six partitions. In their original design, they had six Source -> Destination combinations in a single Data Flow Task.

six source components in a data flow

This package worked fine in their development environment, but would result in a deadlock when they ran it in production. Looking at the SSIS logs, they could see that the package was hanging in the Pre-Execute phase of the Data Flow. They ran a profiler trace and found that SQL was showing Async Network IO for five sessions, but the sixth was waiting for memory resources (wait type of RESOURCE_SEMAPHORE). Looking at sys.dm_exec_query_resource_semaphores  (shown below) confirmed this.


There were two factors contributing to the deadlock:

  1. The source server did not have enough memory to run six of these queries in parallel
  2. In SSIS, the Data Flow does not start executing until all components have completed the Pre-Execute phase

When the data flow runs, each source component prepares its query with the source server in the pre-execute phase. The server was able to grant enough memory for the first five queries (session_id 57 to 61), but the sixth query (session_id = 63) was told to wait. This prevents the pre-execute phase of the sixth OLE DB Source component from completing. The data flow won’t start running until the pre-execute phase completes for each component, and the source server can’t grant memory until the other queries complete, which results in a deadlocked package.

To resolve this issue, the customer changed the package design pattern to have a single Source -> Destination combination per data flow, with six parallel Data Flow Tasks. With this design, they were still occasionally getting RESOURCE_SEMAPHORE waits, but the waits didn’t last forever, as the other Data Flow tasks were able to complete their queries independently. (They were eventually able to remove the waits all together by optimizing their source query.)

An alternate design (which would be my preference) would have been to have a single Data Flow per package, parameterize the source query, and run six instances of the package in parallel.


Beware of doing too much in a single Data Flow! Unlike Tasks in the control flow, which run independently, the Data Flow can’t start until each component has finished its initialization in the Pre-Execute phase. This is especially important for components that can spend a long time in the Pre-Execute phase, such as a Lookup Transform with Full Cache mode. A Data Flow should only have a single source — unless the data flow is combining data from multiple sources. If your data flow has multiple data paths that do not overlap, then you should consider using separate Data Flow tasks (if they need to be run “together”, put them in the same Sequence Container).