Webcast: Developing Extensions for SSIS | 2013-05-22

Andy Leonard will be joining me to present a Developing Extensions for SSIS webcast on May 22nd. It’s been a long time since I’ve had the chance to present a code heavy session like this, and I’m really looking forward to it. I’ll be using this session as a dry run for a talk I’ll be giving at the DevTeach conference in Toronto the following week, so feedback will be much appreciated.

Andy has a great blog series about creating custom SSIS Tasks, and his company (Linchpin People) recently released an SSIS Event Task. I plan on expanding on the content in Andy’s blog posts a little bit, and talk about some of the key things you need to know when building an SSIS Data Flow component.

The abstract for the talk is available below, and registration page can be found here.

 

Join Matt Masson and Andy Leonard for a discussion and demonstrations on extending SSIS with custom tasks and data flow components. This code heavy session walks you through the creation of a custom SQL Server Integration Services (SSIS) Task, and a custom Data Flow component. If you’ve ever thought of customizing SSIS functionality beyond what you can do with scripting, this talk is for you! All examples will be in coded in C#, and made available following the session.

Using SSIS Term Extraction to Process Search Terms

This blog uses WordPress, and I recently started using the Jetpack plugin. One of the features that Jetpack provides is a list of search terms that people used to arrive at your site. It doesn’t account for all views for your site (or even the majority of them), but it is interesting to see what people are looking for when they come to your site.

Yes, according to the site stats, most people are searching for my name when they arrive to my site. I’m not sure how I should feel about that.

From these full search queries, I wanted to see what the most common terms were. This blog posts describes how I did this using the SSIS Term Extraction transform.

Extracting the Data

The first step was to actually get the list of search terms from my blog’s dashboard. When dealing with data on a webpage, I typically turn to Excel, and figured I could make use of the new Data Explorer add-in. Unfortunately, since my blog requires you to be logged in, Data Explorer ends up getting (silently) redirected to the login page, with no way to proceed. I had hit this issue before, so I tried to use the regular “From Web” functionality in Excel. This let me log into the site, but unfortunately, Excel didn’t recognize the data on the page as a table, which prevented me from creating a refreshable web query. Alas, I had to resort to the low tech approach of Copy &Paste, but it did the job.

I copy and pasted the data from the web page into Excel, and saved as CSV.

clip_image002

Why CSV and not .xls/.xlsx? Simplicity. One advantage of reading from Excel is that you get some metadata and limited query support from the Access Database Engine provider (Jet/Ace). However, the intricacies of working with Excel typically make me prefer a straight CSV document. Unless I need to use a refreshable query, CSV is my preferred format for all data imports (it’s faster, too).

Processing with SSIS

I created a new SSIS package, added a flat file connection manager, and pointed it at my recently created CSV file. I specified the Text qualifier (Excel uses ” (double quote) by default), and marked that the first row contains headers.

clip_image003

On Advanced tab, I set the data types (DT_WSTR(500) and DT_I4).

Note, even though the text values in the CSV are DT_STR (ansi strings – no special characters), I used DT_WSTR because that is what the Term Extraction transform supports. Alternatively, we could have used a Data Convert or Derived Column transform to perform the conversion, but I’d rather do it directly in the Source component.

I then added a Data Flow Task, Flat File Source which used my new connection manager, and a Term Extraction transform. There’s not much configuration involved with the term extraction – you select the column you want to analyze, and (optionally) provide names for the output columns.

clip_image004

On the Advanced tab, I changed the Frequency threshold to 1 and Maximum length of term to 20. Changing these values can affect performance of the component, but since we’re processing a limited number of rows (about 500 total), I figured it was safe to do.

clip_image005

The Term Extraction transform redirects error rows by default, so you’ll get a warning if you don’t configure the output path. Instead, I configured the error output to Fail Component on error since we’re not expecting to have any errors in our list of simple terms.

clip_image006

I then closed off the data flow to see the results with a data viewer.

clip_image007

Running the package gave me a break down of the common terms.

clip_image008

The more the term appears in the input, the higher the score will be. This is really useful, but unfortunately it’s not completely representative of the actual search terms since the values had already been aggregates (the “Views” value from the original data set). To get a more accurate number, I had to expand the results before processing them.

Expanding the Search Results with a Script Component

By default Script Components are synchronous – among other things, this means they will have the same number of rows going in as they have coming out. To expand the values counts correctly, we’ll need to add more rows to the data flow. This means I’ll need to make the Script Component Asynchronous.

I added a Script Component (transform) right after my Flat File Source, and opened the editor.

I selected both input columns.

clip_image009

I went to the Inputs and Outputs tab, selected the single output, and set the Synchronous ID value to None.

I added a column for the Term (DT_WSTR(500)).

clip_image010

Open the script editor to start entering the code.

The code is pretty simple. Using an Asynchronous Script Component isn’t that different from a Synchronous one – you just need to remember that you’re adding rows in a new buffer, rather than changing values in the current one. The only method I needed to worry about is the one that processes the incoming rows – Input0_ProcessInputRow.

For each incoming row, I want to look at the term (Search) and the number of times it appears (Views). I then programmatically add a row which contains the term to the output buffer for each time it was viewed.

Running the package after this change, I see that I have a lot more rows going into the Term Extraction transform then I did before (1043 vs. 498), and more accurate scores as a result.

clip_image011

That’s it – thanks for reading!

Consistent Design Patterns and SSIS Frameworks

Question: How do I make sure everyone at my company is using the same SSIS framework and design patterns?


SSIS Design Patterns and frameworks are one of my favorite things to talk (and write) about. A recent search on SSIS frameworks highlighted just how many different frameworks there are out there, and making sure that everyone at your company is following what you consider to be best practices can be a challenge.

In an ideal scenario, before you do any SSIS development, your team has an experienced SSIS consultant/trainer come in. They provide you with a framework that works for your environment and establishes a set of best practices that your team sticks to for all future package development, until the end of time.

This rarely happens.

It’s Too Easy

SSIS has a learning curve, but is fairly easy to get started with. You can throw together a simple workflow & ETL running on your local machine in a short amount of time. It starts to get a bit more complicated when you need to deploy that solution to another machine, and even more so when you start adding multiple developers to the mix. This is where the best practices start to be required.

The downside to SSIS being easy enough for people to learn on their own is many people come up with their own way of doing things. Given that there are usually multiple ways to accomplish common ETL tasks, your company can end up with a mix of “best practices”, “frameworks” and ways of using SSIS.

Did you know there are at least 5 different SSIS frameworks used internally at Microsoft? Each one has its own take on how to do configuration, dynamic package execution/generation, and logging. I wouldn’t be surprised if there are a number of others that I’m not aware of.

We improved this in SQL 2012 with the addition of the SSIS Catalog (SSISDB). In 2005/2008, the biggest differences between SSIS projects would typically be the way they do deployment & configuration. The Catalog was meant to provide a “default” best practice, and remove the need to rely on their own custom frameworks. There’s still more we could do here, but given the adoption rates we’re seeing, it was definitely a step in the right direction.

When we asked our MVPs to create whitepapers for SQL 2008, we ended up with four configuration related articles. Each one does things in a slightly different way.

So the question becomes, how do I make sure that the SSIS developers in my company are following best practices and using the framework we have in place?

Training plays a big role, here. If you show your developers that there is a better way of doing things, they should be willing to adopt it. That said, many times you’ll find consistency is more important than doing things in a slightly better way. I know this sounds ironic coming from someone who worked on the product (*COUGH*three separate data type systems*COUGH*), but I give this advice to other people, speaking from experience.

Tools to Help Apply Best Practices

Tools can really help here, as well. There are two third party tools that comes to mind:

BI xPress from Pragmatic Works has many different features that help here. The Auditing Framework helps with consistent logging, and their Templates, Snippets and Data Flow Nuggets functionality lets you apply and reuse patterns across your packages. They also recently added a Best Practices Analyzer that helps you maintain some consistency.

If you’re looking to apply design patterns and/or you require dynamic package generation, then Biml from Varigence is a great solution. It’s a powerful markup language/compiler that makes it really easy to create your SSIS packages in a consistent way. There is a free version of Biml integrated into BIDS Helper, and a more advanced version (and separate designer – Mist) available from Varigence. There is also a Biml repository site that is great for examples.

Error: The Microsoft Access database engine cannot open or write to the file

Issue – you run an SSIS package that uses an Excel Source or Excel Destination and get the following error:

[Connection manager "Excel Connection Manager"] Error: SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occurred. Error code: 0x80004005.

An OLE DB record is available. Source: "Microsoft Access Database Engine" Hresult: 0x80004005 Description: "The Microsoft Access database engine cannot open or write to the file ”. It is already opened exclusively by another user, or you need permission to view and write its data.".

Steps to resolve:

  1. Verify that the file isn’t actual open (check task manager for suspicious EXCEL.EXE processes)
  2. Verify that there are no other SSIS packages (or other background processes) that might be opening this file at the same time
  3. If you are in Visual Studio when this error happens, restart VS and try again

The third issue typically happens if you click the Preview button on the component editor. This ends up loading the Excel file through the Access database engine, and can keep a reference to the file open until the designer is restarted.

image 

Monitoring SSIS package executions

Question: When we run SSIS in development (from SSDT/BIDS), we can see the progress of the job, with all of the steps highlighted in yellow, red or green. Is there a way to display this same status information when the packages run in production?


You have a number of options when it comes to monitoring SSIS packages at runtime. In 2005/2008, you needed to create your own solution (based on SSIS logging, or custom logging approaches). In 2012, you now have built in reporting and monitoring from the SSIS Catalog (although the previous methods still work as well). There are a number of third party and open source logging frameworks and monitoring solutions, and of course you can always roll your own using the SSIS APIs.

A quick search turned up a number of options (some of which I wasn’t previously aware of):

SSIS Catalog (2012)

All versions

Installing SSIS for Visual Studio 2012

Did you hear that SQL Server Data Tools – Business Intelligence (formerly known as BIDS) has been releases for Visual Studio 2012? It was announced last month, and we’ve seen a lot of downloads so far. The download adds support for the SQL Server 2012 BI projects (Reporting Services, Analysis Services, and Integration Services) to VS 2012, or installs a standalone VS shell if you don’t already have VS 2012 installed.

Also note that the component was renamed – we added “Business Intelligence” to the end to distinguish it from the SQL Server Data Tools (Juneau). We now refer to it as SSDTBI, rather than “SSDT, no, not that one, the other one – you know, the one that comes with the SQL installation media, not the one you download”.

Installation

Download the installer from http://www.microsoft.com/en-us/download/details.aspx?id=36843.

When you run the installer, the SQL Server 2012 setup application will show up. Select the Perform a new installation of SQL Server 2012 option, and click Next.

image

On the next page, select the SQL Server Data Tools – Business Intelligence for Visual Studio 2012 option.

image

Complete the installation process, and then launch Visual Studio 2012. You’ll now be able to create new SSIS projects in all of VS 2012’s monochrome goodness.

image

Refresh an Excel Workbook with a Script Task

Are you loading data from an Excel workbook with external queries using SSIS? Do you want to make sure those queries are refreshed before the load takes place? Have I got the post for you! I recently had to do this myself, and after some googling with Bing, I found an MSDN forum post that describes how to refresh all of the queries within a workbook using an SSIS Script Task. I’ve adapted the code from the post, and you can find it below.

For the code to compile, you’ll need to add a reference to Microsoft.Office.Interop.Excel within your script task’s project.

image

Next, select the reference under the project explorer, and open the properties window. You’ll need to disable the Embed Interop Types option. (This assumes you’re using SQL 2012, which supports .NET 4 … if you are using 2008 with .NET 2/3.5, you won’t have to do this).

image

Include the following code in your script task. It iterates over all of the Excel connection managers within your package, loads the workbook, and calls RefreshAll() to refresh each of the queries in the workbook. The task then sleeps for 20 seconds (to give time for the query refresh to complete) before exiting.

I kept the code simple for this example, but there are a few different optimizations you can make here:

  • Add error handling
  • Pass in the excel file name through a variable instead of iterating over each connection manager
  • Refresh a specific query within the workbook, rather than calling RefreshAll
  • Find your query within the workbook and set the BackgroundQuery property to False to avoid the asynchronous refresh call, and having to sleep

Using OUTPUT Parameters with the OLE DB Source

Did you know that support for OUTPUT parameters was added to the OLE DB Source in SSIS 2012? So many features were added in 2012, this one is often overlooked. (I had completely forgotten about it until someone asked the other day…). This post has a quick walkthrough of how to use them.

For this example, we’ll be using the Production.uspGetList example from Books Online. It looks like this:

We see there are four parameters total:

  1. Product
  2. MaxPrice
  3. ComparePrice (OUTPUT)
  4. ListPrice (OUTPUT)

In the OLE DB Source, we create a connection manager and set the Data access mode to SQL command. In the SQL statement, we can mark parameters with a question mark (?), or hard code values by entering them directly. For this example, we’ll provide a value for MaxPrice, and use variables for the others. I’

 

image

We can map the parameters by clicking the Parameters… button.

image

Some important caveats:

  • You must use the parameter name, rather than the ordinal, when mapping the output parameters.
  • All OUTPUT parameter variables should be of type Object. This is because the provider will likely attempt to set the variable to NULL (DbNull) prior to executing the stored procedure, and only SSIS Object variables support DbNull values. This means you’ll need to cast the variable value to the appropriate type before you use it.
  • The OUTPUT value will be written to the variable when the data flow completes (i.e. in the PostExecute phase). This means you can use the variables in a control flow task after your Data Flow, but not within the same Data Flow.

 

image

Slides from SSIS with Oracle talk | TechDays Hong Kong 2013

The slides from the Using SQL Server Integration Services with Oracle talk from TechDays Hong Kong 2013 can be found below (use the embedded control to download the powerpoint presentation directly from SkyDrive). Thank you to everyone who attended!

 

Generate Data with a Script Component

I use the following code to generate data with a script component. I’ve found this useful in the past to demonstrate how buffers are handled in the data flow, as well as perf/load testing a destination component. This script will automatically generate data based on the columns you’ve defined for the component – no additional code modifications are needed. For optimal performance, it uses static values, but it could be updated to support randomly generated values as well (it does support random ints – look at the code for details)

To use it, do the following:

  • Add a new Script Component, and make it a Source
  • Bring up the component UI
  • On Inputs and Outputs, define your columns
  • When you are done, click OK to save the component. I know, you haven’t entered the script yet, but it is a good idea to save your component at this point. If for some reason I cancel out of the dialog, or you have an issue with the VSTA editor, you could lost all of the columns you just spent time defining.

image

 

  • Open the UI again
  • (Optional) If you want to be able to control the number of rows that get generated, define an Integer variable called MaxRowCount. Add it as a ReadOnly variable
  • Click the Edit Script… button
  • When the editor opens, replace your main.cs file with the following code

Download the script file here: main.cs.zip