Parsing Binary data using Power Query formulas

The Power Query formula language (M) contains a number of library functions that allow you to parse binary data. You can use these functions to build queries which parse custom binary file formats. This post contains two samples – a simple query which parses X & Y coordinate values from a binary field, and a more advanced query which reads PNG file headers.

Parsing binary data

The following query defines some binary data (Source), a record format (PointFormat), and a parsing format definition (FileFormat).

Let’s break this down.

The #binary function lets you pass in a list of byte values. Here we have specified a total of 12 bytes.

PointFormat defines a record format with two BinaryFormat.SignedInteger16 fields (2 bytes each), x and y, for a total of 4 bytes.

The first argument to the BinaryFormat.Choice function indicates that we should read the first 4 bytes (the size of an BinaryFormat.UnsignedInteger32). This value is read and passed in as the count parameter to the BinaryFormat.List function. BinaryFormat.List ends up reading the remaining 8 bytes of the binary (count * sizeof(PointFormat)), and outputs a list of records.

let
Source = #binary(
{0x00, 0x00, 0x00, 0x02,
0x00, 0x03, 0x00, 0x04,
0x00, 0x05, 0x00, 0x06}),

PointFormat = BinaryFormat.Record([
x = BinaryFormat.SignedInteger16,
y = BinaryFormat.SignedInteger16]),

FileFormat = BinaryFormat.Choice(BinaryFormat.UnsignedInteger32,
(count) => BinaryFormat.List(PointFormat, count))
in
FileFormat(Source)

If we put this M code into Power Query, convert the list of records to a table, and then expand, we get something like this:

image

Query

Parsing PNG Headers

This sample is provided as an exercise to the reader to understand. To test it out, replace the path value in the call to File.Contents.

Iterating over an unknown number of pages in Power Query

My previous blog post showed how to iterate over a set of web pages in Power Query using a parameterized function. The post contained two queries – the GetData function, and a query to invoke it over a set number of pages.

GetData function

Query to invoke it

This approach uses a pre-generated list of page numbers (Source = {1..7}), which works well if you know the range of pages you want to access. But what do you do if you don’t know the range upfront?

The Power Query Formula Language (M) is (partially) lazy – some steps won’t be fully evaluated until the data they reference is needed. We’ll use this capability to define a query that iterates over a large number of pages (10,000), but dynamically stops itself once the first error is hit.

Let’s break this down:

Line 2 (PageRange) defines a large range of page numbers (1 to 10,000).

Line 3 (Source) uses List.Transform to invoke a function (GetData) over each value in the list. It uses a try…otherwise statement, which will catch errors thrown by GetData. If an error occurs, the otherwise statement  returns null.

Line 4 (First) uses List.FirstN, and passes in a condition (each _ <> null) that essentially says to take all rows until the first null is reached.

Line 5 (Table) converts the list to a table, and then Line 6 (Expanded) fully expands the table to get at the date.

The key to this working is that using Table.ExpandTableColumn causes List.Transform to be lazily evaluated – the function which goes out and grabs the data from the page (GetData) won’t actually be called until the table is expanded. Since the query specifies that we only want rows up until we get our first error/null, the Power Query engine will stop making calls to GetData once it gets back a null value. In this example, we have 7 pages of data – page 8 returns an error page with no table, which causes our query to fail and return null.

Important note: if you try to paste this code into the Power Query editor, and click on any of the steps before the last one (Expanded), List.Transform will not be lazily evaluated … if you watch the requests being made (with Fiddler, for example), you’ll see Power Query trying to evaluate and access all 10,000 pages.

image

 

Iterating over multiple pages of web data using Power Query

This is one of my go-to demos for showing off the Power Query Formula Language (M). This approach works well with websites that spread data over multiple pages, and have one or more query parameters to specify which page or range of values to load. It works best when all pages have the same structure. This post walks through the end to end process, which includes:

  1. Creating the initial query to access a single page of data
  2. Turning the query into a parameterized function
  3. Invoking the function for each page of data you want to retrieve

This example uses the yearly box office results provided by BoxOfficeMojo.com.

You can find a video of this demo in the recording Theresa Palmer and I did at TechEd North America earlier this year.

Creating the initial query

Access one of the pages of data on the site using Power Query’s From Web functionality.

image

From Web actually generates two separate M functions – Web.Contents to access the URL that you enter, and then another function based on the content-type of the URL. In this case we are accessing a web page, so Web.Contents gets wrapped by a call to Web.Page. This function brings up a Navigator that lets you select one of the tables found on the page. In this case, we’re interested in the second table on the page (labeled Table 1). Selecting it and clicking Edit will bring up the Query Editor.

image

From here, we can filter and shape the data as we want it. Once we’re happy with the way it looks, we will convert it to a function.

Turning a Query into a parameterized Function

Open the Advanced Editor to bring up the M code behind your query.

image

For the sake of the demo, I’ve kept the query simple – I’m just accessing the data, and I’ve removed the “Changed Type” step that Power Query automatically inserted for me. The only shaping I did was to remove the bottom 3 summary rows on the page. My code now looks this:

Note that the url value in the call to Web.Contents contains a query parameter (page) that specifies the page of data we want to access.

To turn this query into a parameterized function, we’ll add the following line before the let statement.

(page as number) as table =>

The two as statements specify the expected data types for the page parameter (number) and the return value of the function (table). They are optional, but I like specifying types whenever I can.

We’ve now turned our query into a function, and have a parameter we can use within the code. We are going to dynamically build up the query string, replacing the existing page value in the URL with the page parameter. Since we’ve indicated that page is a number, we will need to convert the value to text using the Number.ToText function. The updated code looks like this:

Clicking Done on the advanced editor brings us back to the query editor. We now have a function expecting a parameter.

image

You can click on the the Invoke button and enter a page value to test it out.

image

image

Be sure to delete the Invoked Function step, then give the function a meaningful name (like GetData). Once the function has been given a good name, click Close & Load to save the query.

Invoking the function for each page of data you want to retrieve

Now that we have a function that can get the data, we’ll want to invoke it for each page we want to retrieve. M doesn’t have any concept of Loops – to perform an action multiple times, we’ll need to generate a List (or Table) of values we want to act on.

From the Power Query ribbon, select From Other Sources –> Blank Query. This brings up an empty editor page. In the formula bar, type the following formula:

= {1..7}

This gives us a list of numbers from 1 to 7.

image

Convert this to a table by clicking the To Table button, and click OK on the prompt.

Rename the column to something more meaningful (i.e. “Page”).

Go to the Add Column menu, and click Add Custom Column.

We can invoke our function (GetData) for each page with the following formula:

GetData([Page])

Click OK to the return to the editor. We now have a new column (Custom) with Table values. Note – clicking the whitespace next to the “Table” text (and not “Table” itself) will bring up a preview window in the bottom of the editor.

image

Click the Expand Columns button to expand the table inline.

image

The full query now looks like this

Clicking Close & Load brings us back to the workbook. After the query executes, we can scroll to the bottom of the sheet to see that we’ve pulled in 7 pages of data

 

image

Converting a Query to a Function in Power Query

You can turn a query into a function by adding a single line before your M query:

() =>

That is a pair of parenthesis ( () ), followed by a goes-to symbol ( => ). Any parameters for your function would go in between the parenthesis.

Example

In the Query Editor, click on the View page, and then click Advanced Editor.

image

The Advanced Editor will display your M query. Add () => before the starting let statement.

image

Click Done to return to the Query Editor. You should now see a single Applied Step, and the option to Invoke your function.

image

Clicking Invoke simply runs your previous query, which isn’t all that interesting. Delete the Invoked step (if you clicked the button), and go back to the Advanced Editor to add a parameter to your function.

Note that when you bring up the editor, it looks a little different …

image

The code will now look something like this:

Power Query has automatically prepended some code based on the name of your query. You can remove the outer let statement, or simply ignore it for now. Add a parameter in-between the parenthesis (for example, “page”).

Click Done to return to the editor. Notice that the editor now displays the parameter your entered. It is typed any because we didn’t explicitly specify a type on the parameter.

image

Clicking Invoke will prompt us for the parameter.

image

Invoking the query with the parameter doesn’t change our results at all, since we never actually referenced the parameter within our M code… I’ll explain that process in an upcoming post.

Check out the links before for more information on the Power Query Formula Language (“M”):

Notepad++ language file for the Power Query formula language (“M”)

Power Query doesn’t have syntax highlighting in its advanced editor (yet?). In the meantime, here is a language file for Notepad++. It’s not perfect, but I find it useful when writing M queries. It lists all of the current library functions as keywords (as of the PQ October update), recognizes comments, strings, literals, and language keywords.

You can find it on my One Drive share – here. You can find instructions on how to install a user defined language file in Notepad++ on their wiki site.

Enjoy!

image

For more information on writing queries using the Power Query formula language, you can check out the deep dive session that Theresa Palmer and I gave at TechEd North America earlier this year.

Defining Configurable Settings for your Queries

(The information in this blog post is current as of the March 2014 release of Power Query.)

The Problem

A colleague sent me a copy of the Excel workbook they used for a blog post about using Power Query with Project Online (see the fantastic article: Creating burndown charts for Project using Power Pivot and Power Query). The workbook has a query which uses the Project OData API to pull down Task information to produce a “burndown” chart. The query was configured to pull down the last 90 days worth of data, and I wanted to make it a little more configurable. The original query looked like this:

The highlighted line shows the filter that limits the data to 90 days (#duration(90, 0, 0, 0)). To change the value, we’d need to modify the formula in the Power Query editor (or open up the Advanced Editor and change it there). Making the change isn’t hard, but I wanted to make the workbook a bit more flexible so that anyone I shared it with could set the value without having to understand the script.

Using a Configuration Query

A Power Query query can reference other queries within the current workbook. We’re going to use this functionality to create a query that returns the settings we want, and then reference it from other queries to avoid hard coding anything. The steps will be:

  1. Create an Excel table that contains the settings we want to reference
  2. Create a query (named Settings) that reads the table
  3. Replace the hardcoded values with a reference to the Settings query

The Excel table is simple – a header row and a single row of data. The table should have a column for each setting we want to reference by name. In this case we have a single column the number of days we want the Project Online query to grab – we’ll call it DaysToKeep.

image

Next we create a query using the From Table button on the Power Query ribbon.

image

We’ll apply the following transformation steps:

  • Set the right data types
  • Convert the table to Records using the Table.ToRecords function
  • Select the first (and only) record
  • Disable loading the query (i.e. deselect “Load to Worksheet”)
  • Name the query something easy to remember (ex. Settings)

The query looks like this:

This gives us a query that returns a Record, which lets us refer to each field by name.

image

After saving the query, we’ll be able to use Settings[DaysToKeep] in any other query within our workbook, and it will evaluate to “30” at runtime. We can test this out by created a new query (Power Query –> From Other Data Sources –> Blank Query), and typing it into the formula bar.

image

Going back to the original burndown query, we can now replace the hard coded 90 value with the reference to the DaysToKeep setting.

After we change the query to use the Settings value, we’ll receive a prompt from Power Query’s Formula Firewall.

image

This prompt occurs because we are using the DaysToKeep value in a filter statement, which ends up getting folded into the query sent to the OData feed. The firewall prompt is meant to prevent unintentional data disclosure. In this case we’re not sending private information to the other data source (i.e. the number 30), but it would be a concern if we were joining a private data set (i.e. a customer list) with a public web service. Clicking Continue brings up the Privacy levels dialog.

image

From here we can specify that the OData feed has a privacy level of Organizational, and the settings from the Current Workbook can be considered Public (since they aren’t sensitive in any way). This satisfies the security requirements for the Power Query formula firewall, and lets the engine fold the values into the query as expected. Note, if the privacy levels didn’t allow folding (i.e. Current Workbook is set to Private, and the OData feed is Public), Power Query would do the filtering locally (in memory), rather than including the filter in the query. Once the privacy levels have been set, our query runs successfully.

image

After clicking apply, the query refreshes and brings a total of 110 rows into the data model.

image

To pull more data, we can simply change the value in the settings table, and refresh the query.

image

 

Creating a Date Dimension with a Power Query Script

Chris Webb has previously blogged a similar approach, but I thought I’d shared my own version of a Power Query function that generates a table of dates that could be used to create a Date Dimension. I first saw Faisal Mohamood do this in our end to end Power BI talk at the PASS Summit, and have been using it as a quick way to show off the versatility of the Power Query formula language.

The function is included below:

How do I use this?

From the Power Query ribbon, select From Other Data Sources -> Blank Query.

In the query editor, click on the View tab, then click the Advanced Editor button on the ribbon.

Replace the default script with the CreateDateTable script, and click Done.

The Query Editor will now show the function definition, with an Invoke button.

Clicking the button will bring up a dialog that lets you enter values for each of the parameters. These are:

  • StartDate – the first date in your table
  • EndDate – the last date in your table
  • Culture – the culture of the table (optional). This gets passed to the Date.ToText function to display the month / weekday names. See the National Language Support (NLS) API for details.

After you Apply & Close, you’ll have a neatly formatted Date Dimension that can be imported directly into Power Pivot.

Filtering in Data Explorer

I was been manning the Data Discovery & Visualization and Data Warehousing booths at TechEd Europe last week, and we saw lots of excitement over Data Explorer. One of the questions I got was about how Data Explorer does its filtering, and I realized there hasn’t been much shared about this yet. It seems like the general assumption is that Data Explorer would pull all rows into Excel, and then perform its filtering in memory (as that’s how you’d build your steps in the UI), but in fact, it’s a lot smarter that this – it will automatically push filters directly to the source query. The team calls this technique “Query Folding”. It is an extremely powerful feature, especially in a “self service ETL” tool where many users aren’t thinking about query performance. Unfortunately, it’s not immediately obvious that this feature exists unless you monitor the underlying queries it uses – so let’s do that now.

Filters

From Data Explorer, we’ll connect to a SQL Server instance:

image

 

We’ll read from the DimGeography dimension of the AdventureWorksDW sample:

image

 

Take the first five columns:

  • GeographyKey
  • City
  • StateProvinceCode
  • StateProvinceName
  • CountryRegionCode

 

image

 

Click Done to load the data into Excel. We’ll then launch SQL Profiler, and connect it to our SQL instance. Once profiler is running, we can click the Refresh button in Excel, and see the queries that get executed:

image

We can see a (surprising) number of queries against SQL Server. We can see that the first “set” are fetching metadata about data types, columns, and indexes (you see lots of SELECT statements from sys.objects and sys.indexes). We’ll ignore these for now (but they’d make a great topic for a future post).

After retrieving metadata information, we see a query against the DimGeography table.

Already we can see the Data Explorer is smarter than the average query tool. Even though we selected the full table in the UI before hiding the columns certain columns, we can see the source query only contains the columns we want.

Let’s open the query again in Excel by clicking the Filter & Shape button.

With the query open, we’ll add a filter on the CountryRegionCode field. Click on the field title, select Text Filters | Equals …

image

We’ll filter on CA.

image

Leaving us with our records from Canada.

image

We can save the query by clicking Done, clear our current profiler trace and refresh the workbook to see the updated SQL query.

We see the query has gotten a bit more complicated, but it now contains the ‘CA’ filter we specified in the UI.

Sources and Other Types of Folding

Data Explorer isn’t able to do query folding for every source (i.e. there is no “query” when reading from a flat file), but it does it where it can. Here is an (unofficial) list of supported sources from the Data Explorer team:

  • SQL Databases
  • OData and OData based sources, such as the Windows Azure Marketplace and SharePoint Lists
  • Active Directory
  • HDFS.Files, Folder.Files, and Folder.Contents (for basic operations on paths)

I should also note that “filters” aren’t the only type of query folding that Data Explorer can do. If you watch the queries, you’ll see that other operations, such as column removal, renaming, joins, and type conversions are pushed as close to the source as possible. (I’ll explore this in a future post).

 

Preview warning – this information is current for the June 2013 build of the Data Explorer preview (Version: 1.2.3263.4). The information contained in this post may change prior to RTM.

Using Data Explorer with SSIS

Data Explorer is a great new add-on for Excel. It provides a simple interface over a power data transformation engine (powered by M), and it has become my tool of choice anytime I need to get data into Excel. In this post I show how you can use Data Explorer to do some initial data shaping, and then automate the data loading process through SSIS.

In a previous post, I showed how I used Data Explorer to grab the list of upcoming SQL Saturday events. When the Data Explorer query is executed, the data is downloaded, transformed, and stored in the Excel sheet.

image

We can read this data in SSIS using a Data Flow Task with an Excel Source component.

After we add the Excel Source, double click to bring up the component UI. We can create a new connection manager, and point it at the saved Excel file (.xlsx).

image

Note that even if you use Office 2010 or 2013 to create the file, you should select the “Microsoft Excel 2007” version (which was when the .xlsx format was introduced, replacing the .xls binary format).

On the Excel Source, we can now select “Table or view” as the data access mode. The drop down list will contain all of the sheets within our workbook. We can either select the name of the Sheet (in this case, Data), or the data query (Data$ExternalData_1). Since the result of the query is the only thing on this sheet, it will come out to the same thing in this instance.

image

On the Mapping tab, we can see the four columns from the workbook have been read.

image

We can click OK to save the settings, and then we can setup the rest of the data flow. Running the package should show us that all the rows have been read.

image

(Yes, that is a custom made Trash destination. I am pretty fancy.)

As we see, we can grab the data from the workbook using SSIS. The problem is that this data is a static copy – reading the data doesn’t automatically re-run the Data Explorer query that we used in the workbook. Thankfully, we can workaround that using a Script Task – see my previous post about dynamically refreshing a workbook. Go ahead and ready that post real quick – I’ll wait.

The refresh code in my previous post updates the excel file in place. If we watch the Task Manager, we can see the EXCEL.EXE process runs in the background.

image

Important – if you run the package with the refresh code added and receive a COMException – {“Class not registered”} error, you’re probably missing the Data Explorer add-in. Data Explorer uses a custom OLEDB provider internally, which requires you to install the add-in on the machine that is running your SSIS package.

If we open the file after the package runs, we should see the updated values from the refreshed query.

image

Closing Thoughts

Anyone familiar with SSIS will see some overlap between the transformations you can do in Data Explorer, and what you can do in the SSIS Data Flow. At this point Data Explorer is tool for getting data into Excel, and not what most people would consider a full fledged ETL platform. However, underneath the fancy, easy to use interface, is a powerful transformation engine (driven by the M language), and I can see a lot of potential applications for it. While the approach I describe above does get the products working together, I can imagine that closer integration between the products would be coming at some point in the future – but that’s pure speculation on my part.

What do you think? How would you like to see the products integrated?

Consuming CANSIM Datasets using Data Explorer

I recently stumbled upon CANSIM – Canada’s equivalent of Data.Gov. The site has a lot of really interesting datasets that come in handy if you’re someone who frequently creates data focused demos. I wanted to play around a bit more with Data Visualizations in Excel (specifically, the Scatter Chart over Time feature in Power View), and thought that Canadian population statistics would be a good place to start.

The CANSIM site provides a default view for each dataset, with a few links that let us customize the view. If this view of the dataset is all we’re looking for, we could use the From Web functionality in Excel to import the data into our workbook.

image

Data Explorer provides a similar From Web feature. Unfortunately, this doesn’t currently work with the CANSIM site because the site redirects us to a survey page the first time you access it. Excel’s "From Web” feature brings up a web browser, allowing us to click through to the dataset we want. Data Explorer simply displays the raw data, so we end up with an empty table and no way to access the view that we want. Boo.

image

Thankfully, the CANSIM site also allows us to download the raw data behind each dataset. This is actually better for what we want to do, as Excel has all of the processing functionality we need to slice and visualize the data that way we want it. Clicking the Download tab on the dataset page gives us a couple of different format options. CSV is the easiest data format to work with in Excel, so that’s what we’ll download.

image

Loading the Data into Excel

After downloading and extracting the dataset (it comes zipped), we’ll load and format it using Data Explorer’s From File | From CSV option.

image

This dataset is pretty simple, with rows for the reference date (Ref_Date), geography (GEO), and population (Value). We’ll only need to do a couple of things to get the data in the format that we want.

image

First we click the table icon and select Use First Row and Headers to specify the column names.

image

Next, we’ll filter out the summary rows (total population for Canada), because we can calculate those directly. We can do this by right clicking on a value in the GEO column and selecting Text Filers | Does Not Equal from the context menu.

image

We’re left with a pretty simple query:

 

We can give the query a meaningful name (click on the “Query1” title at the top of the New Query page), and we’re good to go.

image

Loading the Data into PowerPivot

imageAfter clicking Done, the query is created and the data is imported into our workbook. Data Explorer gives us two main options on how we use this data – we can keep it on the worksheet, or load it into a data model. In this case, we want to perform some calculations on the data with PowerPivot before visualizing it with Power View, so it makes sense to load it directly into the data model. To do this, we turn off the Load to worksheet slider, and then click the Load to data model link. Now when the query is refreshed, the data gets loaded directly into PowerPivot. We can see the model by going to the PowerPivot ribbon menu, and clicking the Manage button.

We should do a few things before we start manipulating the data:

  1. Set the correct data types – We’ll need to set the data types for the Value (Whole Number) and Ref_Date (date) columns. PowerPivot will use whatever data types were set in the Data Explorer query. To set the Value column to a Number (rather than text), we could either do it in the DE query, or at the PowerPivot level.
  2. Set the date format – This is more preference than requirement, but since the original date values didn’t include a time value, we can remove it with formatting.
  3. Rename the table – By default this will be called “Query”. We’d usually want to rename this to something more meaningful, but for this example we’ll keep it as is.

Now that our columns are ready, we can add two calculated columns to make visualization easier.

Total Population

This column will represent the total population of Canada (SUM([Value])) at the given date ([Ref_Date]). The calculation is:

% of Total

This column calculates the % of the total population that this province has. The calculation is simple:

After adding the calculated columns (and applying a % formatting), we end up with a model that looks like this:

image

Scatter Plot in Power View

What I want to do is visualize the population growth of Canada since the start of our data set (1951). I want to see which provinces grew faster than others, and if the population distribution has remained consistent over the past 60+ years. The Power View Scatter Plot graph (with time series functionality) is a great way to do this.

We start by closing the PowerPivot model, and adding a Power View report from the INSERT | Power View ribbon.

On the Power View sheet, we have a Power View Fields pane on the right. Click ALL, and select the columns from our PowerPivot table (Query).

image

Select the data table on the main page, and convert it into a scatter chart by clicking on DESIGN | Other Chart | Scatter Chart.

image

There’s a few different ways we can visualize this, depending on what we’re trying to show.

Option 1 – Focus on % of Total

 

X Value % of Total
Y Value Total Population
Size Value
Details GEO
Play Axis Ref_Date

This option has the % of Total across the X Axis, highlighting which provinces have made up the bulk of Canada’s population, and how it has changed over time.

1951

image

1985

image

2011

image

You can click on a bubble to see its progression over time (you can also ctrl-click to see more than one).

image

This visualization quickly shows how Quebec’s population has been making up less and less of Canada’s total population over time, while Ontario, British Columbia and Albert have been gaining. Plus, it looks like snakes which is pretty cool.

Option 2 – Growth over Time

X Value Value
Y Value Value
Size % of Total
Details GEO
Play Axis Ref_Date

This option puts the population value on both the X and Y axis. The animation of the time series shows the relative growth speed of each province.

(Note, to add a column value twice, right click on the column and select one of the Add as options).

1951

image

1985

image

2011

image

If we play the animation we can see that Quebec and Ontario’s populations steadily increase at about the same rate until the 1970s. Then Quebec’s growth rate visibly shows, while Ontario’s continues to grow at an even faster rate. Interesting! 

Having both the X and Y Axis use the same value gives you a “linear growth” chart. We could have also made the bubbles “bounce around” a bit by adding another dimension (average population age?), but I’ll leave that as an exercise for the reader.

Links