Creating a Date Dimension with a Power Query Script

Chris Webb has previously blogged a similar approach, but I thought I’d shared my own version of a Power Query function that generates a table of dates that could be used to create a Date Dimension. I first saw Faisal Mohamood do this in our end to end Power BI talk at the PASS Summit, and have been using it as a quick way to show off the versatility of the Power Query formula language.

The function is included below:

How do I use this?

From the Power Query ribbon, select From Other Data Sources -> Blank Query.

In the query editor, click on the View tab, then click the Advanced Editor button on the ribbon.

Replace the default script with the CreateDateTable script, and click Done.

The Query Editor will now show the function definition, with an Invoke button.

Clicking the button will bring up a dialog that lets you enter values for each of the parameters. These are:

  • StartDate – the first date in your table
  • EndDate – the last date in your table
  • Culture – the culture of the table (optional). This gets passed to the Date.ToText function to display the month / weekday names. See the National Language Support (NLS) API for details.

After you Apply & Close, you’ll have a neatly formatted Date Dimension that can be imported directly into Power Pivot.

Generate an Empty Raw File Without Running your SSIS Package

A new feature was added to the Raw File Destination in SQL Server 2012 which allows you to output an empty raw file from the editor UI.

  1. Add a Raw File Destination to your data flow, and connect it to the previous transform
  2. Double click to open the Raw File Destination Editor
  3. Configure the file name
  4. Click on the Columns tab, and select the columns you want included
  5. Click on the Connection Manager tab
  6. Click the Generate initial raw file … button
  7. A dialog confirming the output columns will pop up. Click OK to create your file.

 

image

 

This functionality is handy when you are using raw files to stage data between data flows. You can use this empty raw file in the Raw File Source to configure the metadata it needs. In previous versions, you would have had to run the package to produce the initial file.

Another note – Raw Files were improved in SQL Server 2012 to include sort information, so now the IsSorted flag is automatically set in the Raw File Source.

 

image

Using Hive ODBC from SSIS

Microsoft announced the general availability of Windows Azure HDInsight last week, and released an updated version of the Microsoft Hive ODBC Driver. This post describes how to use this driver to pull data from your HDInsight cluster with an SSIS Data Flow Task.


Installing and Configuring the Hive ODBC Driver

The download page for the Hive ODBC driver provides a link to an article on how to configure the Hive ODBC driver to connect to HDInsight using Excel. The screenshots in the article are a bit out of date, but the procedure is essentially the same when using the driver from SSIS. Be sure to install both the 32bit and 64bit versions of the driver – you’ll need to install them separately. If you only install the 64bit driver, you’ll get errors in visual studio (which is a 32bit process) when trying to configure your connection manager. If you only install the 32bit driver, you might get an error when you try to run the package in the designer (as the designer runs the package in 64bit mode by default).

Once the driver is installed, there are two ways to configure it – using a Data Source Name (DSN), or using a connection string. I recommend taking the connection string approach, but you might find using the DSN approach more convenient for your environment.

Data Source Name (DSN)

Using a DSN requires you preregister the data source (at the file/user/system level) using the Windows ODBC Data Source Administrator. You’ll then be able to reference this DSN entry by name from an SSIS Connection Manager. This essentially creates an alias for your data source – you can change where the DSN is pointing, and your packages continue to work. The downside to this approach is that you’ll need to make sure this DSN exists on all machines that will be running this package.

There are two versions of the ODBC Data Source Administrator UI – one for 32bit (%windir%SysWOW64odbcad32.exe) and one for 64bit (%windir%System32odbcad32.exe). You’ll likely want to create both 32bit and 64bit DSNs – just make sure that the same name is used for both versions. At minimum, you’ll need to register a 32bit DSN to use when creating your SSIS package in the designer.

When creating the DSN, select the Microsoft Hive ODBC Driver.

clip_image002[4]

On the configuration page, you’ll need to enter:

  • Host: This is the full domain name of your HDInsight cluster (*.azurehdinsight.net)
  • Port: 443 is the default
  • Database: default
  • Hive Server Type: Hive Server 2
  • Authentication Mechanism: Select Windows Azure HDInsight Service – this will be the UN/PW you used when configuring your cluster
  • User Name & Password: the user you configured for your cluster

image

Another setting of interest on the Advanced Options page is the Default string column length value. By default, this will be set to 32767, which is larger than the SSIS maximum length for a DT_STR/DT_WSTR column (4000). This means that SSIS will treat all Hive STRING columns as DT_TEXT/DT_NTEXT, which has negative performance implications. If you know that your data values will be less than 4000 characters in length, I recommend lowering this value to 4000 (or less).

image

Once the values are configured, click the Test button, and you’re good to go.

clip_image004[4]

The DSN will then appear in your ODBC Administrator UI. As noted above, you’ll want to create both 32bit and 64bit DSNs.

clip_image005[4]

You will now be able to see your newly created DNS when you create an ODBC Connection Manager in SSIS.

clip_image006[4]

Using a Connection String

The advantage to using a connection string is that you don’t have to pre-create the DSN on the systems that will be running your SSIS package. Knowing what to enter can be a little tricky (see the tip below), but I prefer approach as it removes the external (DSN) dependency. Also note that the same connection string works for both 32bit and 64bit execution modes – you just need to ensure that both versions of the ODBC driver are installed.

The table below lists properties you’ll need to set.

Field Description
Driver Name of the driver – {Microsoft Hive ODBC Driver}
Host DNS hostname of your cluster
Port Connection port – the HDInsight is 443
Schema Default database schema
RowsFetchedPerBlock Number of rows fetched per block. The recommendation is to keep it at 10,000.
HiveServerType The HDInsight default is 2
AuthMech Authentication mechanism – you’ll want to use a value of 6, which maps to using the username and password you specified when the cluster was created
DefaultStringColumnLength The default length for STRING columns. See the note above.

 

Example:

DRIVER={Microsoft Hive ODBC Driver};Host=<cluster name>.azurehdinsight.net;Port=443;Schema=default;RowsFetchedPerBlock=10000;HiveServerType=2;AuthMech=6;DefaultStringColumnLength=4000

When configuring your Connection Manager in SSIS, select the “Use connection string” option and paste in your connection string. Note, the Username and Password values can be specified as part of the connection string (uid and pwd fields), or can be set separately in the connection manager’s UI.

clip_image0164

Tip – To determine all of the fields that can be set for the provider (or at least the ones exposed by the UI), you can go through the DSN creation step, and then save it as a File DSN. This will create a text file in .ini format. Open it with any text editor, and you can see each of the name/value pairs that are being set. You just need to put them together into a single, semi-colon delimited string.

Configuring Your Data Flow

In an SSIS Data Flow, you can use the Hive ODBC Driver from:

  1. ODBC Source – introduced as a new component in SSIS 2012, although there are third party options available for previous versions
  2. ADO.NET Source – using the .NET ODBC Data Provider

Note that in the past the Hive ODBC driver had some compatibility issues with the SSIS ODBC Source, leading to a recommendation to use the ADO.NET Source. I didn’t notice any problems when using the latest Microsoft driver – you’ll see that the example screenshots below use the ODBC Source.

After configuring your connection manager, you can select a table or enter your query (similar to interacting with any other source in SSIS). I found that using the table drop down can take a while (15-30 seconds) – querying metadata from Hive isn’t an immediate operation.

clip_image007[4]

Clicking the Preview button takes a while (as the Hive job is executed on your cluster), but it is a good way to make sure that your connection manager is correctly configured.

clip_image008[4]

Clicking the Columns tab may have a delay (again – fetching metadata isn’t instantaneous), but will come back with the columns as you’d expect.

clip_image009[4]

Once you click OK, you may notice another delay as the component is validated. Since Hive metadata operations can take a while to complete, I suggest setting the ValidateExternalMetadata value of your source component to False. This will speed up validation quite a bit – the metadata will only be refreshed when you open the Source UI. When using the ODBC Source, you should also consider setting the BatchSize property (default 1000) to match the RowsFetchedPerBlock setting of your connection manager (default is 10,000).

After you complete the rest of your Data Flow, you should be able to run your package. In my experience, the Validate and PreExecute phases of the data flow will finish (relatively) quick, but once the Execute phase starts, your source will sit with the icon spinning and no rows coming out as it waits for your Hive job to complete on your cluster. Once the job completes, you’ll see the data streaming back nicely.

clip_image012[4]

clip_image013[4]

I found that running a package which reads from the hivesampletable table that comes with HDInsight consistently takes about 2 minutes. After applying the tips mentioned throughout this post (and summarized below), the average execution time went down to 1 minute, 20 seconds. I also found the the ADO.NET Source was consistently (slightly) faster than the ODBC Source (5-10 seconds), but I expect this will vary between environments (and data sets).

Tips

  • Install both the 32bit and 64bit versions of the Hive ODBC Driver
  • Consider using a connection string instead of a DSN in your connection manager
  • Change the Default String Column Length to 4000 (or less) if you are sure your String columns are less than 4000 characters in length
  • Set ValidateExternalMetadata to False on your Source component
  • When using the ODBC Source, increase the BatchSize (to 10000)

Troubleshooting

If Preview works for you in the ODBC Source (or ADO.NET Source), but your package fails when you run in the designer, make sure you have both the 32bit and 64bit versions of the Hive provider installed, and that you’ve created DSNs for both platforms. The error looks something like this:

clip_image0104

[ODBC Source [2]] Error: The AcquireConnection method call to the connection manager <XXX> failed with error code 0xC0014009. There may be error messages posted before this with more information on why the AcquireConnection method call failed.

[SSIS.Pipeline] Error: ODBC Source failed validation and returned error code 0x80004005.

[Connection manager “<XXX>”] Error: An ODBC error -1 has occurred.

[Connection manager “<XXX>”] Error: There was an error trying to establish an Open Database Connectivity (ODBC) connection with the database server.

—-

If for some reason you are unable to install the 64bit provider, you can force your SSIS package to run in 32bit mode in Visual Studio by right clicking on the project, selecting Properties, going to the Configuration Properties | Debugging page, and setting the Run64BitRuntime value to False.

clip_image0114

Links from the SSIS Roadmap Session

Here are some of the resources I mentioned in the SSIS Roadmap session at the PASS Summit.

SSIS Reporting Pack from Jamie Thomson

image

 

DQS Matching Transform from OH22 Data

image

 

DQS Domain Value Import Destination from OH22 Data

image

There is also a great series of blog posts (part 1, part 2, part 3) on using these transforms on the DQS team blog.

Data Feed Publishing Components

image

SSIS Content from PASS Summit 2013 on PASS TV

You can now stream sessions from the 2013 PASS Summit on-demand from the PASStv site. There’s a lot of amazing content on there, including the Day 1 and Day 2 keynotes. If you’re interested in SSIS (and if you’re reading this, there is a good chance you are), there are two SSIS focused sessions available:


SQL Server Data Tools – Business Intelligence downloads

(update 2014/11/14: added link to the VS 2013 release which came out in April)

Looking for the SSIS development tools? Formerly known as Business Intelligence Developer Studio (BIDS), the designer is now called SQL Server Data Tools – Business Intelligence (SSDT-BI) (not to be confused with the other SQL Server Data Tools). The version you want will depend on two things; your version of SQL Server, and the version of Visual Studio you want to work with.

SQL Server 2012

SQL Server 2014

image

 

Some notes:

  • The SQL Server 2014 CTP2 version of SSDT-BI should not be installed on the same machine as SSDT-BI for SQL Server 2012
  • There is no SQL Server 2014 SSDT-BI for Visual Studio 2010, however, SSIS packages developed in SQL Server 2012 can be automatically upgraded to SQL Server 2014

Skylanders Power View Workbook

I had a lot of requests for the Skylanders BI Application I showed off at the BI Power Hour, so I’ve made it available on skydrive. For those that missed it, the “app” is an Excel workbook which uses Power View (on top of Power Pivot) that I created to help my kids made better decisions as to which toys they should buy.

image

The content in the workbook comes from the Spyro wiki at Wikia and is licensed under the Creative Commons Attribution-Share Alike License.

BI Power Hour at PASS Summit 2013

A big thank you to the PASS Summit 2013 attendees who made our BI Power Hour a great success! I have to admit I was a little worried when we lost all power 15 minutes before the session started (highly ironic that we lost power before the BI Power Hour devoted to Power BI), but I was encouraged when the audience started planning a hostile takeover of Paul Randal’s session room next door. but everything came back up just in time. We packed the ballroom, showed off some great functionality,  and only had one injury from the all of the flying swag. Reading through all of the #BIPH tweets after the session was very entertaining – we really appreciate all of the audience participation. This might have been our best Power Hour yet!

PASS 2013 BI Power Hour Audience #biph

SELECT * From SSIS.DataFlow

If you’ve been looking through the documentation for the Power BI preview, you might have noticed a section on Publishing SSIS Packages as OData Feeds. This functionality lets you create a T-SQL View over an SSIS data flow using a new SSIS add-on called the Data Feed Publishing Components. This add-on works with SQL Server 2012, and is a free download from the Microsoft download center. While the components are useful for a number of Power BI scenarios, the components don’t require a Power BI subscription – all you need is a SQL Server/SSIS 2012 installation.

The Data Feed Publishing Components have three main components:

  1. Data Streaming Destination – A new Data Flow Destination that creates an “endpoint”, letting you stream data back to the caller (similar in concept to the DataReader Destination)
  2. OLE DB Provider for SSIS – A special OLE DB provider that allows SQL to treat an SSIS package as a Linked Server
  3. Data Feed Publishing Wizard – A wizard that deploys a project/package containing a Data Streaming Destination and creates a T-SQL View (and linked server) that kicks off the SSIS package when accessed

Publish SSIS Packages - Conceptual Diagram

If this sounds interesting to you, be sure to check out the step-by-step guide in the Power BI documentation. Note – if you’re currently not in the Power BI preview, you can stop at step #3.

Scenarios

I expect I’ll be blogging more about this in the coming months (as well as talking about it at PASS), but I wanted to briefly mention some of the main scenarios we’ve been working with customers on.

Invoking an SSIS Package from a Report

You’d do this in the case where a simple query isn’t enough – there are work flow steps (i.e. FTP files from remote server), you’re merging/transforming disparate data sources, require .NET scripting logic, or your data source requires a custom connector. Internally we’ve been referring to this scenario as “Complex Data Feeds”.

While it is possible to configure Reporting Services to read from an SSIS package, the approach has some limitations as to the account the package will execute as (and is actually removed from the default configuration file in SQL 2012). The Data Feed components provide a similar approach, but also let you leverage the logging and configuration provided by the SSIS Catalog.

On-Demand Execution of an SSIS Package

SELECT’ing from the View created by the Publishing Wizard dynamically invokes the SSIS package with a data flow, and streams back the results. While the majority of SSIS packages would run on a schedule, or write data to a fixed destination, there are cases where dynamic invocation and streaming results are preferred.

One customer we worked with had 500+ “Data Feeds” – data sets that were more than just a simple queries. This data sets were typically small and used for ad hoc reporting purposes. These feeds weren’t accessed regularly – some would not be used for months, and then be used heavily for a day or two (perhaps at the end of a quarter). Unfortunately, the access patterns weren’t predictable. Because the data had to be there when it was needed, the customer ended up with a very large infrastructure keeping every feed up to date. What they needed was something that could be run on demand (perhaps with built in caching), so the data could be refreshed only when it was needed.

Another customer was looking for a way to do dynamic auditing in their environments using SSIS. They had a set of packages with custom auditing logic that they’d deploy and run in various environments, getting a real-time snapshot of their systems.

Alternative to Linked Servers

Want to use a linked server, but don’t have an OLE DB provider for your data source? Want to enforce custom logic, or do dynamic transformations on the results of the data? No problem – use an SSIS package!

Limitations

Just like Power BI, the current release of these components is in preview, and might not have all of the functionality you’re looking for (just yet). One thing to note is that the SSIS OLE DB provider currently does not support Distributed Query optimizations. Therefore, it currently doesn’t provide statistics or push down filters like other OLE DB providers used for Linked Servers. This functionality is best suited for one time executions of an SSIS package – if you find it’s something you’re accessing over and over, then you should probably be running your package on a schedule.

Links

For more information, see one of the following:

PASS Summit 2013

Wow, has it really been two months since my last blog post? I guess I’ve been busy!

I’ve been hard at work on SSIS and some components for Power BI. I hope to blog about all of the fun projects I’ve been working on this past year, but in case I don’t get a chance, you can come hear me talk all about it at this years PASS Summit!

Yes boys and girls, I’ve once again tricked the conference organizers into letting me present at the Summit. No SSIS pre-cons this year, but I’ll be doing a total of 4 sessions that span SSIS and Power BI topics.

My favorite session – the BI Power Hour – will once again be educating audiences with a set of carefully selected, thought-provoking demos, that will in no way revolve around cats or tapping things. We’re honored to be a Spotlight session this year, and have a killer lineup of presenters. We’re hoping to have enough seats for everyone this year (rather than having people sitting on the floor and in the aisles like last year) – but you might want to get there early just in case.

I’ll be presenting an SSIS Roadmap session, showing off some of the cool new things we’ve been working on.

I’ll also be co-presenting a couple of Power BI sessions – an all up, end to end session (Power BI – This is How We Do It) with many of my co-presenters from the BI Power Hour, and a session which takes a deeper look at the components which allow corporate data source access from SharePoint Online (Power BI and the Data Management Gateway).

I’m really looking forward to another great Summit – hope to see you in Charlotte!