SSIS, DQS and MDS Training at SQL Saturday #229 | Dublin, Ireland

I’m honored to be invited back to this year’s SQL Saturday Dublin event. I must have done a decent job last year on my SSIS Design Patterns pre-conference session, because I’ve been asked to do another one this time around as part of their Training Day (June 21st).

This year I’ll be doing a full day training session on Enterprise Information Management, which greatly expands upon one of my more popular talks about combining SSIS, DQS and MDS. The session will include some advanced SSIS topics (such as automation and dynamic package generation), and some of the main SSIS design patterns for data warehousing. The abstract is included below:

Enterprise Information Management (EIM) is an industry term for technologies to manage your data for integration, quality, and governance. This full day training session will show you how Integration Services (SSIS), Data Quality Services (DQS), and Master Data Services (MDS) work together to provide a comprehensive EIM solution in SQL Server 2012. Focusing on Data Warehousing scenarios, we’ll explore the key steps needed to build such a solution, from the beginning (planning, benchmarking), to data curation (cleansing, matching, managing dimensions), to data loading using SSIS. We’ll also touch on some advanced data warehousing ETL design patterns, such as change data capture (CDC), slowly changing dimension processing, and automation.

 

Course Modules

  • Data cleansing and matching
  • Reference data management
  • SSIS design patterns
  • Incremental data loading
  • SSIS deployment, management, and monitoring
  • Automation

Registration details can be found on the Prodata site.

The session schedule hasn’t been posted yet, but I see that a number of people submitted interesting SSIS sessions to the conference. I see a lot of big names from the SQL community, and expect there will be an awesome turnout (just like last time).

I submitted a brand new session entitled Cats, Facebook, and Online Dating with Microsoft BI that I hope will be accepted. I’ll show the SSIS demos I’ve done for the BI Power Hour (2011 | 2012), how I built them, and the lessons that I learned throughout. I’ll try not to make it too egotistical (“Hey look at me and how I made people laugh that one time!”), and try to provide content that attendees will find useful.

Oh, who I am kidding? I’ll totally be showing off Mr. Wiggles.

Hope to see you there!

Using SSIS Term Extraction to Process Search Terms

This blog uses WordPress, and I recently started using the Jetpack plugin. One of the features that Jetpack provides is a list of search terms that people used to arrive at your site. It doesn’t account for all views for your site (or even the majority of them), but it is interesting to see what people are looking for when they come to your site.

Yes, according to the site stats, most people are searching for my name when they arrive to my site. I’m not sure how I should feel about that.

From these full search queries, I wanted to see what the most common terms were. This blog posts describes how I did this using the SSIS Term Extraction transform.

Extracting the Data

The first step was to actually get the list of search terms from my blog’s dashboard. When dealing with data on a webpage, I typically turn to Excel, and figured I could make use of the new Data Explorer add-in. Unfortunately, since my blog requires you to be logged in, Data Explorer ends up getting (silently) redirected to the login page, with no way to proceed. I had hit this issue before, so I tried to use the regular “From Web” functionality in Excel. This let me log into the site, but unfortunately, Excel didn’t recognize the data on the page as a table, which prevented me from creating a refreshable web query. Alas, I had to resort to the low tech approach of Copy &Paste, but it did the job.

I copy and pasted the data from the web page into Excel, and saved as CSV.

clip_image002

Why CSV and not .xls/.xlsx? Simplicity. One advantage of reading from Excel is that you get some metadata and limited query support from the Access Database Engine provider (Jet/Ace). However, the intricacies of working with Excel typically make me prefer a straight CSV document. Unless I need to use a refreshable query, CSV is my preferred format for all data imports (it’s faster, too).

Processing with SSIS

I created a new SSIS package, added a flat file connection manager, and pointed it at my recently created CSV file. I specified the Text qualifier (Excel uses ” (double quote) by default), and marked that the first row contains headers.

clip_image003

On Advanced tab, I set the data types (DT_WSTR(500) and DT_I4).

Note, even though the text values in the CSV are DT_STR (ansi strings – no special characters), I used DT_WSTR because that is what the Term Extraction transform supports. Alternatively, we could have used a Data Convert or Derived Column transform to perform the conversion, but I’d rather do it directly in the Source component.

I then added a Data Flow Task, Flat File Source which used my new connection manager, and a Term Extraction transform. There’s not much configuration involved with the term extraction – you select the column you want to analyze, and (optionally) provide names for the output columns.

clip_image004

On the Advanced tab, I changed the Frequency threshold to 1 and Maximum length of term to 20. Changing these values can affect performance of the component, but since we’re processing a limited number of rows (about 500 total), I figured it was safe to do.

clip_image005

The Term Extraction transform redirects error rows by default, so you’ll get a warning if you don’t configure the output path. Instead, I configured the error output to Fail Component on error since we’re not expecting to have any errors in our list of simple terms.

clip_image006

I then closed off the data flow to see the results with a data viewer.

clip_image007

Running the package gave me a break down of the common terms.

clip_image008

The more the term appears in the input, the higher the score will be. This is really useful, but unfortunately it’s not completely representative of the actual search terms since the values had already been aggregates (the “Views” value from the original data set). To get a more accurate number, I had to expand the results before processing them.

Expanding the Search Results with a Script Component

By default Script Components are synchronous – among other things, this means they will have the same number of rows going in as they have coming out. To expand the values counts correctly, we’ll need to add more rows to the data flow. This means I’ll need to make the Script Component Asynchronous.

I added a Script Component (transform) right after my Flat File Source, and opened the editor.

I selected both input columns.

clip_image009

I went to the Inputs and Outputs tab, selected the single output, and set the Synchronous ID value to None.

I added a column for the Term (DT_WSTR(500)).

clip_image010

Open the script editor to start entering the code.

The code is pretty simple. Using an Asynchronous Script Component isn’t that different from a Synchronous one – you just need to remember that you’re adding rows in a new buffer, rather than changing values in the current one. The only method I needed to worry about is the one that processes the incoming rows – Input0_ProcessInputRow.

For each incoming row, I want to look at the term (Search) and the number of times it appears (Views). I then programmatically add a row which contains the term to the output buffer for each time it was viewed.

Running the package after this change, I see that I have a lot more rows going into the Term Extraction transform then I did before (1043 vs. 498), and more accurate scores as a result.

clip_image011

That’s it – thanks for reading!

Consistent Design Patterns and SSIS Frameworks

Question: How do I make sure everyone at my company is using the same SSIS framework and design patterns?


SSIS Design Patterns and frameworks are one of my favorite things to talk (and write) about. A recent search on SSIS frameworks highlighted just how many different frameworks there are out there, and making sure that everyone at your company is following what you consider to be best practices can be a challenge.

In an ideal scenario, before you do any SSIS development, your team has an experienced SSIS consultant/trainer come in. They provide you with a framework that works for your environment and establishes a set of best practices that your team sticks to for all future package development, until the end of time.

This rarely happens.

It’s Too Easy

SSIS has a learning curve, but is fairly easy to get started with. You can throw together a simple workflow & ETL running on your local machine in a short amount of time. It starts to get a bit more complicated when you need to deploy that solution to another machine, and even more so when you start adding multiple developers to the mix. This is where the best practices start to be required.

The downside to SSIS being easy enough for people to learn on their own is many people come up with their own way of doing things. Given that there are usually multiple ways to accomplish common ETL tasks, your company can end up with a mix of “best practices”, “frameworks” and ways of using SSIS.

Did you know there are at least 5 different SSIS frameworks used internally at Microsoft? Each one has its own take on how to do configuration, dynamic package execution/generation, and logging. I wouldn’t be surprised if there are a number of others that I’m not aware of.

We improved this in SQL 2012 with the addition of the SSIS Catalog (SSISDB). In 2005/2008, the biggest differences between SSIS projects would typically be the way they do deployment & configuration. The Catalog was meant to provide a “default” best practice, and remove the need to rely on their own custom frameworks. There’s still more we could do here, but given the adoption rates we’re seeing, it was definitely a step in the right direction.

When we asked our MVPs to create whitepapers for SQL 2008, we ended up with four configuration related articles. Each one does things in a slightly different way.

So the question becomes, how do I make sure that the SSIS developers in my company are following best practices and using the framework we have in place?

Training plays a big role, here. If you show your developers that there is a better way of doing things, they should be willing to adopt it. That said, many times you’ll find consistency is more important than doing things in a slightly better way. I know this sounds ironic coming from someone who worked on the product (*COUGH*three separate data type systems*COUGH*), but I give this advice to other people, speaking from experience.

Tools to Help Apply Best Practices

Tools can really help here, as well. There are two third party tools that comes to mind:

BI xPress from Pragmatic Works has many different features that help here. The Auditing Framework helps with consistent logging, and their Templates, Snippets and Data Flow Nuggets functionality lets you apply and reuse patterns across your packages. They also recently added a Best Practices Analyzer that helps you maintain some consistency.

If you’re looking to apply design patterns and/or you require dynamic package generation, then Biml from Varigence is a great solution. It’s a powerful markup language/compiler that makes it really easy to create your SSIS packages in a consistent way. There is a free version of Biml integrated into BIDS Helper, and a more advanced version (and separate designer – Mist) available from Varigence. There is also a Biml repository site that is great for examples.

Presenting to the .NET Montreal community group | May 27th

The fine folks over at Communauté .NET Montréal have invited me to speak at their meeting on May 27th, 2013. I’ll be using this as a practice run for my sessions at the DevTeach Toronto event later that week. Come join us at the Microsoft Montreal office if you’re interested in finding out more about:

Data Quality Services (DQS) – End to End
Microsoft’s SQL Server Data Quality Services (DQS) is a unique solution that is based on the creation and maintenance of Data Quality Knowledge Bases (DQKB) and the ability to use them efficiently for a variety of Data Quality improvements. In this session we’ll walk through the creation of a DQS solution, discuss the main concepts behind the creation of the DQKB, and how to use DQS in various scenarios and activities.

 

Developing Extensions for SSIS
This code heavy session walks you through the creation of a custom SQL Server Integration Services (SSIS) Task, and a custom Data Flow component. If you’ve ever thought of customizing SSIS functionality beyond what you can do with scripting, this talk is for you! All examples will be in coded in C#, and made available following the session.

 

Registration details can be found here.

Using Data Explorer with SSIS

Data Explorer is a great new add-on for Excel. It provides a simple interface over a power data transformation engine (powered by M), and it has become my tool of choice anytime I need to get data into Excel. In this post I show how you can use Data Explorer to do some initial data shaping, and then automate the data loading process through SSIS.

In a previous post, I showed how I used Data Explorer to grab the list of upcoming SQL Saturday events. When the Data Explorer query is executed, the data is downloaded, transformed, and stored in the Excel sheet.

image

We can read this data in SSIS using a Data Flow Task with an Excel Source component.

After we add the Excel Source, double click to bring up the component UI. We can create a new connection manager, and point it at the saved Excel file (.xlsx).

image

Note that even if you use Office 2010 or 2013 to create the file, you should select the “Microsoft Excel 2007” version (which was when the .xlsx format was introduced, replacing the .xls binary format).

On the Excel Source, we can now select “Table or view” as the data access mode. The drop down list will contain all of the sheets within our workbook. We can either select the name of the Sheet (in this case, Data), or the data query (Data$ExternalData_1). Since the result of the query is the only thing on this sheet, it will come out to the same thing in this instance.

image

On the Mapping tab, we can see the four columns from the workbook have been read.

image

We can click OK to save the settings, and then we can setup the rest of the data flow. Running the package should show us that all the rows have been read.

image

(Yes, that is a custom made Trash destination. I am pretty fancy.)

As we see, we can grab the data from the workbook using SSIS. The problem is that this data is a static copy – reading the data doesn’t automatically re-run the Data Explorer query that we used in the workbook. Thankfully, we can workaround that using a Script Task – see my previous post about dynamically refreshing a workbook. Go ahead and ready that post real quick – I’ll wait.

The refresh code in my previous post updates the excel file in place. If we watch the Task Manager, we can see the EXCEL.EXE process runs in the background.

image

Important – if you run the package with the refresh code added and receive a COMException – {“Class not registered”} error, you’re probably missing the Data Explorer add-in. Data Explorer uses a custom OLEDB provider internally, which requires you to install the add-in on the machine that is running your SSIS package.

If we open the file after the package runs, we should see the updated values from the refreshed query.

image

Closing Thoughts

Anyone familiar with SSIS will see some overlap between the transformations you can do in Data Explorer, and what you can do in the SSIS Data Flow. At this point Data Explorer is tool for getting data into Excel, and not what most people would consider a full fledged ETL platform. However, underneath the fancy, easy to use interface, is a powerful transformation engine (driven by the M language), and I can see a lot of potential applications for it. While the approach I describe above does get the products working together, I can imagine that closer integration between the products would be coming at some point in the future – but that’s pure speculation on my part.

What do you think? How would you like to see the products integrated?

Error: The Microsoft Access database engine cannot open or write to the file

Issue – you run an SSIS package that uses an Excel Source or Excel Destination and get the following error:

[Connection manager "Excel Connection Manager"] Error: SSIS Error Code DTS_E_OLEDBERROR. An OLE DB error has occurred. Error code: 0x80004005.

An OLE DB record is available. Source: "Microsoft Access Database Engine" Hresult: 0x80004005 Description: "The Microsoft Access database engine cannot open or write to the file ”. It is already opened exclusively by another user, or you need permission to view and write its data.".

Steps to resolve:

  1. Verify that the file isn’t actual open (check task manager for suspicious EXCEL.EXE processes)
  2. Verify that there are no other SSIS packages (or other background processes) that might be opening this file at the same time
  3. If you are in Visual Studio when this error happens, restart VS and try again

The third issue typically happens if you click the Preview button on the component editor. This ends up loading the Excel file through the Access database engine, and can keep a reference to the file open until the designer is restarted.

image 

Monitoring SSIS package executions

Question: When we run SSIS in development (from SSDT/BIDS), we can see the progress of the job, with all of the steps highlighted in yellow, red or green. Is there a way to display this same status information when the packages run in production?


You have a number of options when it comes to monitoring SSIS packages at runtime. In 2005/2008, you needed to create your own solution (based on SSIS logging, or custom logging approaches). In 2012, you now have built in reporting and monitoring from the SSIS Catalog (although the previous methods still work as well). There are a number of third party and open source logging frameworks and monitoring solutions, and of course you can always roll your own using the SSIS APIs.

A quick search turned up a number of options (some of which I wasn’t previously aware of):

SSIS Catalog (2012)

All versions

Installing SSIS for Visual Studio 2012

Did you hear that SQL Server Data Tools – Business Intelligence (formerly known as BIDS) has been releases for Visual Studio 2012? It was announced last month, and we’ve seen a lot of downloads so far. The download adds support for the SQL Server 2012 BI projects (Reporting Services, Analysis Services, and Integration Services) to VS 2012, or installs a standalone VS shell if you don’t already have VS 2012 installed.

Also note that the component was renamed – we added “Business Intelligence” to the end to distinguish it from the SQL Server Data Tools (Juneau). We now refer to it as SSDTBI, rather than “SSDT, no, not that one, the other one – you know, the one that comes with the SQL installation media, not the one you download”.

Installation

Download the installer from http://www.microsoft.com/en-us/download/details.aspx?id=36843.

When you run the installer, the SQL Server 2012 setup application will show up. Select the Perform a new installation of SQL Server 2012 option, and click Next.

image

On the next page, select the SQL Server Data Tools – Business Intelligence for Visual Studio 2012 option.

image

Complete the installation process, and then launch Visual Studio 2012. You’ll now be able to create new SSIS projects in all of VS 2012’s monochrome goodness.

image