PASS Summit 2013

Wow, has it really been two months since my last blog post? I guess I’ve been busy!

I’ve been hard at work on SSIS and some components for Power BI. I hope to blog about all of the fun projects I’ve been working on this past year, but in case I don’t get a chance, you can come hear me talk all about it at this years PASS Summit!

Yes boys and girls, I’ve once again tricked the conference organizers into letting me present at the Summit. No SSIS pre-cons this year, but I’ll be doing a total of 4 sessions that span SSIS and Power BI topics.

My favorite session – the BI Power Hour – will once again be educating audiences with a set of carefully selected, thought-provoking demos, that will in no way revolve around cats or tapping things. We’re honored to be a Spotlight session this year, and have a killer lineup of presenters. We’re hoping to have enough seats for everyone this year (rather than having people sitting on the floor and in the aisles like last year) – but you might want to get there early just in case.

I’ll be presenting an SSIS Roadmap session, showing off some of the cool new things we’ve been working on.

I’ll also be co-presenting a couple of Power BI sessions – an all up, end to end session (Power BI – This is How We Do It) with many of my co-presenters from the BI Power Hour, and a session which takes a deeper look at the components which allow corporate data source access from SharePoint Online (Power BI and the Data Management Gateway).

I’m really looking forward to another great Summit – hope to see you in Charlotte!

Filtering in Data Explorer

I was been manning the Data Discovery & Visualization and Data Warehousing booths at TechEd Europe last week, and we saw lots of excitement over Data Explorer. One of the questions I got was about how Data Explorer does its filtering, and I realized there hasn’t been much shared about this yet. It seems like the general assumption is that Data Explorer would pull all rows into Excel, and then perform its filtering in memory (as that’s how you’d build your steps in the UI), but in fact, it’s a lot smarter that this – it will automatically push filters directly to the source query. The team calls this technique “Query Folding”. It is an extremely powerful feature, especially in a “self service ETL” tool where many users aren’t thinking about query performance. Unfortunately, it’s not immediately obvious that this feature exists unless you monitor the underlying queries it uses – so let’s do that now.

Filters

From Data Explorer, we’ll connect to a SQL Server instance:

image

 

We’ll read from the DimGeography dimension of the AdventureWorksDW sample:

image

 

Take the first five columns:

  • GeographyKey
  • City
  • StateProvinceCode
  • StateProvinceName
  • CountryRegionCode

 

image

 

Click Done to load the data into Excel. We’ll then launch SQL Profiler, and connect it to our SQL instance. Once profiler is running, we can click the Refresh button in Excel, and see the queries that get executed:

image

We can see a (surprising) number of queries against SQL Server. We can see that the first “set” are fetching metadata about data types, columns, and indexes (you see lots of SELECT statements from sys.objects and sys.indexes). We’ll ignore these for now (but they’d make a great topic for a future post).

After retrieving metadata information, we see a query against the DimGeography table.

Already we can see the Data Explorer is smarter than the average query tool. Even though we selected the full table in the UI before hiding the columns certain columns, we can see the source query only contains the columns we want.

Let’s open the query again in Excel by clicking the Filter & Shape button.

With the query open, we’ll add a filter on the CountryRegionCode field. Click on the field title, select Text Filters | Equals …

image

We’ll filter on CA.

image

Leaving us with our records from Canada.

image

We can save the query by clicking Done, clear our current profiler trace and refresh the workbook to see the updated SQL query.

We see the query has gotten a bit more complicated, but it now contains the ‘CA’ filter we specified in the UI.

Sources and Other Types of Folding

Data Explorer isn’t able to do query folding for every source (i.e. there is no “query” when reading from a flat file), but it does it where it can. Here is an (unofficial) list of supported sources from the Data Explorer team:

  • SQL Databases
  • OData and OData based sources, such as the Windows Azure Marketplace and SharePoint Lists
  • Active Directory
  • HDFS.Files, Folder.Files, and Folder.Contents (for basic operations on paths)

I should also note that “filters” aren’t the only type of query folding that Data Explorer can do. If you watch the queries, you’ll see that other operations, such as column removal, renaming, joins, and type conversions are pushed as close to the source as possible. (I’ll explore this in a future post).

 

Preview warning – this information is current for the June 2013 build of the Data Explorer preview (Version: 1.2.3263.4). The information contained in this post may change prior to RTM.

Server Execution Parameters with DTEXEC

The SSIS team blogged about executing packages deployed to the SSIS Catalog using DTEXEC a while ago. The post mentions the $ServerOption::SYNCHRONIZED parameter as a way to control whether the execution is synchronous or asynchronous, but there are some other server options you can set as well. Phil Brammer actually blogged about the options last year. You can also see the full list of options when you view the SSIS Catalog Execution report (note the Parameters Used section in the screenshot below).

image

These options can be specified on the command line when you run a catalog package with DTExec. For example, to change the logging level to VERBOSE (3) for a specific execution, you’d add the following to your dtexec command:

/Par "$ServerOption::LOGGING_LEVEL(Int32)";3

More information on logging levels can be found here.

SQL Saturday Dublin – this weekend!

Less than a week to go before the next SQL Saturday Dublin event!

I’ll be presenting a fun session at the main event (Cats, Facebook, and Online Dating with Microsoft BI), and doing a full day training session this Friday (6/21). If you are interested in learning more about SSIS, DQS, and MDS, please see the Prodata site for details.

Hope to see you there!

 

31216337-bf22-4cb8-9ef3-205f0eaa718c 

Quick Tip: How can I tell when my DQS knowledge base has been updated?

imageYou can improve data quality in your SSIS data flows by using the DQS Cleansing transform (new in SQL 2012). When you’re using DQS, the data is cleansed according to the rules you’ve built up in the DQS knowledge base. As you update the knowledge base with better rules, your overall data quality goes up. Therefore, you might find that you want to re-process data that’s already been through DQS anytime you make a significant update to the knowledge base.

Think of your data as “water”, and your knowledge base as a “water filter”… as your filter improves in quality (you improve your knowledge base), the water (data) you run through it gets cleaner.

I usually recommend storing the date value of when the data was processed with DQS. This could be a column value, if you want to track things at the row level, or a entry in a separate file/table, if you’re tracking it at the table/DB level. (I describe this pattern in detail in the SSIS Design Patterns book)

While there is no official notification system or API to know when a DQS knowledge base has been updated, the DQS database does contain a PUBLISH_DATE field that you can query. This won’t tell you which individual domains were touched, or what was actually modified, but given that you probably won’t be publishing a new version of your knowledge base without a good reason, it should be good enough for most automated data cleansing scenarios.

Query the Publish Date for a DQS Knowledge Base

We’re interested in the [dbo].[A_KNOWLEDGEBASE] table in the [DQS_MAIN] database. This table lists all of the knowledge bases in your system, as well as any active projects. You can use the following query to get the last time the knowledge base was updated (replace the [Name] clause with the name of your KB).

Bulk Loading into MDS using SSIS

Each entity in SQL Server 2012 Master Data Services (MDS) will have it’s own staging table (stg.<name>_Leaf). Using this staging table, you can create, update, deactivate and delete left members in bulk. This post describes how to bulk load into an entity staging table and trigger the stored procedure to start the batch import process.

Staging Tables and Stored Procedures

The new entity based staging tables are an excellent feature in MDS 2012, and make it very easy to bulk load into MDS from SSIS. If you take a look at the SQL database used by your MDS instance, you’ll see at least one table in the stg schema for each entity. For this example I’ve created a Suppliers entity and I see a matching table called [stg].[Suppliers_Leaf]. If your entity is using hierarchies, you will have three staging tables (see BOL for details). If we expand the columns, we’ll see all of the attributes have their own columns, as well as some system columns that every staging table will have.

image

Each staging table will also have a stored procedure that is used to tell MDS that new data is ready to load. Details of the arguments can be found in BOL.

image

Import Columns

To load into this table from SSIS, our data flow will need to do the following:

  • Set a value for ImportType (see below)
  • Set a value for BatchTag
  • Map the column values in the data flow to the appropriate attribute columns

See the Leaf Member Staging Table BOL entry for details on the remaining system columns. If your Code value isn’t set to be generated automatically, then you’d also need to specify it in your data flow. Otherwise, the default fields can be safely ignored when we’re bulk importing.

The BatchTag column is used as an identifier in the UI – it can be any string value, as long as it’s unique (and under 50 characters).

MDS uses the same staging table for creating, updating and deleting entities. The ImportType column indicates which action you want to perform. The possible values are listed in the table below.

 

Value Description
0 Create new members. Replace existing MDS data with staged data, but only if the staged data is not NULL. NULL values are ignored. To change a string attribute value to NULL, set it ~NULL~. To change a number attribute value to NULL, set it to -98765432101234567890. To change a datetime attribute value to NULL, set it to 5555-11-22T12:34:56.
1 Create new members only. Any updates to existing MDS data fail.
2 Create new members. Replace existing MDS data with staged data. If you import NULL values, they will overwrite existing MDS values.
3 Deactivate the member, based on the Code value. All attributes, hierarchy and collection memberships, and transactions are maintained but no longer available in the UI. If the member is used as a domain-based attribute value of another member, the deactivation will fail. See ImportType 5 for an alternative.
4 Permanently delete the member, based on the Code value. All attributes, hierarchy and collection memberships, and transactions are permanently deleted. If the member is used as a domain-based attribute value of another member, the deletion will fail. See ImportType 6 for an alternative.
5 Deactivate the member, based on the Code value. All attributes, hierarchy and collection memberships, and transactions are maintained but no longer available in the UI. If the member is used as a domain-based attribute value of other members, the related values will be set to NULL. ImportType 5 is for leaf members only.
6 Permanently delete the member, based on the Code value. All attributes, hierarchy and collection memberships, and transactions are permanently deleted. If the member is used as a domain-based attribute value of other members, the related values will be set to NULL. ImportType 6 is for leaf members only.

When you are bulk loading data into MDS, you’ll use 0, 1 or 2 as the ImportType. To summarize the different modes:

  • Use 0 or 2 when you are adding new members and/or updating existing ones (i.e. doing a merge)
    • The difference between 0 and 2 is the way they handle NULLs when updating an existing member. With 0, NULL values are ignored (and require special handling if you actually want to set a NULL value). With 2, all values are replaced, even when the values are NULL.
  • Use 1 when you are only inserting new members. If you are specifying a code, then a duplicate value will cause the import to fail.

Package Design

You control flow will have at least two tasks:

  1. A Data Flow Task that loads your incoming data into the MDS staging table for your entity
  2. An Execute SQL Task which runs the staging table’s stored procedure which tells MDS to start processing the batch

image

Your data flow will have (at least) three steps:

  1. Read the values you want to load into MDS
  2. Add the BatchTag and ImportType column values (using a derived column transform)
  3. Load into the MDS staging table

image

As noted above, in your OLE DB Destination you’ll need to map your data flow columns to your member attributes (including Code if it’s not auto-generated), the BatchTag value (which can be automatically generated via expression), and the ImportType.

image

After the Data Flow, you’ll run the staging table stored procedure.

The first three parameters are required:

  1. The version name (i.e. VERSION_1)
  2. Whether this operation should be logged as an MDS transaction (i.e. do you want to record the change history, and make the change reversible?)
  3. The BatchTag value that you specified in your data flow

 

Additional resources:

Advanced SSIS Catalog presentation from TechEd North America 2013

Matthew Roche (blog | twitter) and I teamed up once again to present an advanced SSIS Catalog session at TechEd North America 2013 – Deep Inside the Microsoft SQL Server Integration Services Server. The video and slide deck are now available on the Channel9 site. The slide deck actually contains 10 additional slides that we didn’t have time to cover during the regular session (with some further details about the security model).

I want to extend a big thank you to everyone who attended, and to all the great feedback we got. It can be tough doing a 400 level SQL session at TechEd, and while I could see some people’s heads spinning, it sounded like most people were able to learn something new.

The TechEd team picked an excellent preview picture for the session (below). It comes from Matthew’s intro – you’ll have to watch the video to see how he worked a picture of kittens into a 400 level SSIS session.

image

If you didn’t already know, Channel 9 has many TechEd presentations available online. You can see recordings of my previous events on my speaker page, and Matthew’s as well.

Presentations Galore!

I’m mentally preparing myself for a whole slew of upcoming conferences / speaking engagements over the next month and a half. I love talking about SQL Server technologies, and I’m honored that I’ll have so many chances to do it in the next few weeks!

Please come see me at one (or more) of the following events:

Can I Automate SSIS Project Deployment?

Yes, yes you can. Scripted or automated deployment can be done in a number of ways in SQL Server 2012.

Integration Services Deployment Wizard

Yes, this is the primary UI tool for SSIS project deployment, but it can also be run silently on the command line. When you run through the wizard, the Review page will actually list all of the parameters you need to do the same deployment from the command line.

image

Run ISDeploymentWizard.exe /? from a command prompt, and you’ll see the full list of arguments it supports.

SSIS Catalog T-SQL API

The SSISDB [catalog] schema has a number of public stored procedures, including one that can be used for deployment. We even provide samples on how to use it in Books Online.

PowerShell

All SSIS Catalog operations can be automated through PowerShell. I previously blogged about a deployment script I use to setup my demos.

Custom Utility

The SSIS Catalog management object model (MOM) exposes a set of SMO classes you can use to code your own catalog utilities. You’ll want to use the CatalogFolder.DeployProject method to do the actual deployment. If SMO or .NET isn’t your thing, you can also code a custom utility which interacts directly with the T-SQL API.

Data Flow Optimization Checklist

image

I put this together for an SSIS performance presentation I did a while ago, and thought it warranted its own post.

In addition to looking for appropriate Design Patterns, I have a checklist of things to look for when trying to optimize an SSIS Data Flow. This list is by no means extensive, but it contains some of the more common causes of performance issues in the packages I’ve worked with.

OLE DB command operates on a row by row basis. Stage the data and perform the operation in a batch mode using the Execute SQL Task instead.

Learn MERGE, and love it.

Sometimes you do. Most of the time you don’t. If you do need them (i.e. you are loading from a flat file, and need to do a merge), considering maxing out the DefaultBufferMaxRows setting so you can sort more data in memory (sort will keep at most 5 active buffers in memory before it starts swapping to disk).

  • Lookup Transforms
    • Can the lookup be avoided?
    • Are you using the right cache mode?
    • Should you use a Cache Connection Manager?

Lots of related lookup posts can be found here.

 

Have fun optimizing!