Consuming CANSIM Datasets using Data Explorer

I recently stumbled upon CANSIM – Canada’s equivalent of Data.Gov. The site has a lot of really interesting datasets that come in handy if you’re someone who frequently creates data focused demos. I wanted to play around a bit more with Data Visualizations in Excel (specifically, the Scatter Chart over Time feature in Power View), and thought that Canadian population statistics would be a good place to start.

The CANSIM site provides a default view for each dataset, with a few links that let us customize the view. If this view of the dataset is all we’re looking for, we could use the From Web functionality in Excel to import the data into our workbook.

image

Data Explorer provides a similar From Web feature. Unfortunately, this doesn’t currently work with the CANSIM site because the site redirects us to a survey page the first time you access it. Excel’s "From Web” feature brings up a web browser, allowing us to click through to the dataset we want. Data Explorer simply displays the raw data, so we end up with an empty table and no way to access the view that we want. Boo.

image

Thankfully, the CANSIM site also allows us to download the raw data behind each dataset. This is actually better for what we want to do, as Excel has all of the processing functionality we need to slice and visualize the data that way we want it. Clicking the Download tab on the dataset page gives us a couple of different format options. CSV is the easiest data format to work with in Excel, so that’s what we’ll download.

image

Loading the Data into Excel

After downloading and extracting the dataset (it comes zipped), we’ll load and format it using Data Explorer’s From File | From CSV option.

image

This dataset is pretty simple, with rows for the reference date (Ref_Date), geography (GEO), and population (Value). We’ll only need to do a couple of things to get the data in the format that we want.

image

First we click the table icon and select Use First Row and Headers to specify the column names.

image

Next, we’ll filter out the summary rows (total population for Canada), because we can calculate those directly. We can do this by right clicking on a value in the GEO column and selecting Text Filers | Does Not Equal from the context menu.

image

We’re left with a pretty simple query:

 

We can give the query a meaningful name (click on the “Query1” title at the top of the New Query page), and we’re good to go.

image

Loading the Data into PowerPivot

imageAfter clicking Done, the query is created and the data is imported into our workbook. Data Explorer gives us two main options on how we use this data – we can keep it on the worksheet, or load it into a data model. In this case, we want to perform some calculations on the data with PowerPivot before visualizing it with Power View, so it makes sense to load it directly into the data model. To do this, we turn off the Load to worksheet slider, and then click the Load to data model link. Now when the query is refreshed, the data gets loaded directly into PowerPivot. We can see the model by going to the PowerPivot ribbon menu, and clicking the Manage button.

We should do a few things before we start manipulating the data:

  1. Set the correct data types – We’ll need to set the data types for the Value (Whole Number) and Ref_Date (date) columns. PowerPivot will use whatever data types were set in the Data Explorer query. To set the Value column to a Number (rather than text), we could either do it in the DE query, or at the PowerPivot level.
  2. Set the date format – This is more preference than requirement, but since the original date values didn’t include a time value, we can remove it with formatting.
  3. Rename the table – By default this will be called “Query”. We’d usually want to rename this to something more meaningful, but for this example we’ll keep it as is.

Now that our columns are ready, we can add two calculated columns to make visualization easier.

Total Population

This column will represent the total population of Canada (SUM([Value])) at the given date ([Ref_Date]). The calculation is:

% of Total

This column calculates the % of the total population that this province has. The calculation is simple:

After adding the calculated columns (and applying a % formatting), we end up with a model that looks like this:

image

Scatter Plot in Power View

What I want to do is visualize the population growth of Canada since the start of our data set (1951). I want to see which provinces grew faster than others, and if the population distribution has remained consistent over the past 60+ years. The Power View Scatter Plot graph (with time series functionality) is a great way to do this.

We start by closing the PowerPivot model, and adding a Power View report from the INSERT | Power View ribbon.

On the Power View sheet, we have a Power View Fields pane on the right. Click ALL, and select the columns from our PowerPivot table (Query).

image

Select the data table on the main page, and convert it into a scatter chart by clicking on DESIGN | Other Chart | Scatter Chart.

image

There’s a few different ways we can visualize this, depending on what we’re trying to show.

Option 1 – Focus on % of Total

 

X Value % of Total
Y Value Total Population
Size Value
Details GEO
Play Axis Ref_Date

This option has the % of Total across the X Axis, highlighting which provinces have made up the bulk of Canada’s population, and how it has changed over time.

1951

image

1985

image

2011

image

You can click on a bubble to see its progression over time (you can also ctrl-click to see more than one).

image

This visualization quickly shows how Quebec’s population has been making up less and less of Canada’s total population over time, while Ontario, British Columbia and Albert have been gaining. Plus, it looks like snakes which is pretty cool.

Option 2 – Growth over Time

X Value Value
Y Value Value
Size % of Total
Details GEO
Play Axis Ref_Date

This option puts the population value on both the X and Y axis. The animation of the time series shows the relative growth speed of each province.

(Note, to add a column value twice, right click on the column and select one of the Add as options).

1951

image

1985

image

2011

image

If we play the animation we can see that Quebec and Ontario’s populations steadily increase at about the same rate until the 1970s. Then Quebec’s growth rate visibly shows, while Ontario’s continues to grow at an even faster rate. Interesting! 

Having both the X and Y Axis use the same value gives you a “linear growth” chart. We could have also made the bubbles “bounce around” a bit by adding another dimension (average population age?), but I’ll leave that as an exercise for the reader.

Links

Using Data Explorer with Power View

In my first Data Explorer post, I pulled a list of event data from the SQL Saturday website. I then showed how to do dynamic lookups against an address verification service to pull back location data about the event’s venue. In this post I’ll complete my experiment and plot out each location on a map so I can easily see how far away each event is.

I’ll start off with the data I had from my previous post. Within the data, we have Latitude and Longitude values which we will use to create our location map with Power View.

The first thing we’ll do is create a Data Model in Excel. We do this by selecting our data, and clicking on the Add to Data Model button on the POWERPIVOT tab.

image

This brings up the PowerPivot window, with all of the columns from the Data Explorer query. image

The Address.Latitude and Address.Longitude columns are currently set as General/Text fields. We’re going to change this to Decimal Number by selecting the columns, and changing the Data Type field.

image

On the Advanced tab, we’ll set the Data Category values to Latitude and Longitude. This allows the columns to be automatically classified when we’re plotting our data.

image

Some other settings that will make visualization easier would be:

  • Setting the appropriate category values for City, State and Country
  • Fixing the “Summarize By” setting for Confidence, Latitude and Longitude

Once we’re done updating our data model, we can close the PowerPivot model and insert Power View report.

image

Select the fields we want to use in our report.

image

Select the original table it put in our report, and click the Map button to convert it.

The default categorization of the columns didn’t really make sense, but we can easily change that by dragging the fields around to where they belong.

image

Voila! In just a few minutes, we have a chart that maps our events across the US.

image

Dynamic Lookups With Data Explorer

In my previous post, I used Data Explorer to grab the list of upcoming events from the SQL Saturday website. Excel has had the ability to grab data from HTML pages for awhile, but using Data Explorer provided a richer set of options, and allowed me to apply a series of transforms to get the data exactly how I wanted it.

Using this events data, I want to do two things:

  • Enrich the data with some more details about the event venue’s location
  • Plot the events on a map/chart so I can see which ones are close to me

This post will cover the first step, and let’s me show off some of the more advanced Data Explorer functionality.

Address Lookups

There are a number of web services that allow you to validate an address, or fetch additional details (such as latitude and longitude). I just happen to have a subscription to Melissa Data’s Address Check service, provided through the Windows Azure Marketplace. While the service was designed to work with Data Quality Services, it can also be invoked programmatically, and returns a whole bunch of additional information for the address you provide. It will even return suggestions when an exact match for the address isn’t found.

image

I previously covered how to retrieve data from the Windows Azure Marketplace, but unfortunately that method won’t help us here because we’re not dealing with a pre-defined set of data (I’m not sure how long it would take to pull in all possible US and Canada address combinations to Excel, but it’s probably not something I’d want to try on my laptop). The Address Check service I want to use is an OData Service (in fact, all data market offerings have OData interfaces), and it exposes a single function with three parameters

  • Address
  • MaximumSuggestions
  • MinimumConfidence

I found all of the details (including the service URL) on the service description page in the marketplace.

Dynamic Lookups

I’ll use the list of events I retrieved in my previous post as a starting point. To keep the original query intact, I’ll Duplicate my query using the option on the QUERY tab.

 image

Next, I click on the Filter & Shape button to bring up my query.

image

I want to invoke the address lookup function on my OData Service for each row of my data set, and pass in the value of the Location field, but couldn’t find an option that would allow this on the query editor UI. Instead, we can turn to the Data Explorer formula language, and start adding some custom transformations.

First, we need to add a new transformation step by clicking the formula button.

image

By default, the step we just created displays the same data as the previous step. We can change this by modifying the value in the formula bar. Note, at the moment there is no intellisense or syntax highlighting for the formulas, but I’m hoping we’ll see this in a future update. The formula to invoke the external OData Service took some trial and error, but I eventually got it working.

First we define the lookup as an auxiliary query. We’ll name the step Lookup to make it easy to reference.

= OData.Feed(“https://api.datamarket.azure.com/MelissaData/AddressCheck/v1/”){0}[Data]

This formula is accessing the OData Service at the given URL, and returning the data ([Data]) at the first member ({0}), which in this case is the Melissa Data SuggestAddress function that we’re looking for.

image

When the function is returned, Data Explorer provides a UI that shows us the arguments, and indicates that the result will come back in a table format. Pressing the Invoke button lets us enter static values (shown below for demonstration purposes), but as we’re interested in dynamic lookups using the data from our existing query, we’ll skip this for now.

image

We then add a second step which invokes the Lookup function we’ve defined for each row in our data set. We can click the formula button again to create the new step, and use this value:

= Table.AddColumn(RenamedColumns, “Address”, each Lookup([Location], 1, 0))

Let’s break down what this formula is doing:

  • Add a new column to the result of RenamedColumns step
  • Name the column “Address”
  • For each row in the data set, call the function defined by the Lookup step
  • When calling the function, pass in the [Location] field (with two hard coded values – 1, 0 – for the remaining arguments)

After entering the formula, we get a notification that credentials are required to access this data source.

image

Clicking Edit credentials brings up the Data Explorer credential manager. The Windows Azure Marketplace uses the concept of a Feed Key (similar to an “API key” used by many web/REST based APIs). Clicking the link on the page brings you directly to the Marketplace page which displays your key – you can copy & paste it from there. (Be careful with the copy & paste! On my first couple of tries, I included extra whitespace before and after the key, which ended up giving me a permission error when I tried to connect)

image

After entering the key, we can Save and go back to our query. We immediate see another error:

image

Formula.Firewall: Quer
y ‘SqlSaturdayEvents’ is accessing data sources which can not be used together.

 

By default, it appears that Data Explorer doesn’t allow you to pass data between services, as it could result in a “leak”. For example, in this case I’m sending my original Location data (coming from the SQL Saturday website) to the OData Service (run by Melissa Data). Thankfully, there is a way to allow this behavior using functionality called “Fast Combine”.

We can click Done to save the query we have at this point (even though it’s in an error state), and go back to Excel. On the DATA EXPLORER tab, we can see the Fast Combine option.

image

Clicking it brings up a warning dialog that enabling Fast Combine may improve query performance, but may introduce privacy issues.

image

Click Enable, go back to Filter & Shape, and now the data is displayed.

image

The results of the OData Service come back as a table, which we can expand inline by clicking the small icon next to the column title.

image

After selecting the columns we want, we can see the new values in our data set.

image

The Melissa Data service we’re using is for US and Canadian addresses, so we can see that the non-US locations aren’t getting Latitude and Longitude results. Addresses that aren’t matched should come back with a confidence level of 0, so we can use this to filter out these unmatched locations from our list. Our filter will only include rows that have a Confidence value greater than 0.

image

image

The remaining results look pretty good. We can click Done to save the query and get the results into Excel.

image

Summary

Wow – I’m impressed with Data Explorer so far. I really like the idea of mashing up data sources like this, and browsing through the formula language gave me an idea of what’s possible.

In my next post, I’ll take a look at importing the data into a PowerPivot, and then doing some visualizing with Power View.

For reference, here are the queries I used at each step:

 

Source = Table.TransformColumns(Data0,{},Value.FromText)
HiddenColumns = Table.RemoveColumns(ChangedType,{“Column1”, “Column3”})
SplitColumnDelimiter = Table.SplitColumn(HiddenColumns,”Column4″,Splitter.SplitTextByEachDelimiter({“-“}, null, false),{“Column4.1”, “Column4.2”})
SplitColumnPos = Table.SplitColumn(SplitColumnDelimiter,”Column4.1″,Splitter.SplitTextByPositions({0, 13}, false),{“Column4.1.1”, “Column4.1.2”})
HiddenColumns1 = Table.RemoveColumns(SplitColumnPos,{“Column4.1.1”})
RenamedColumns

= Table.RenameColumns(HiddenColumns1,{{“Column2”, “Date”}, {“Column4.1.2”, “Number”}, {“Column4.2”, “Name”}, {“Column5”, “Location”}})

Lookup = OData.Feed(“https://api.datamarket.azure.com/MelissaData/AddressCheck/v1/”){0}[Data]
AddressLookup

= Table.AddColumn(RenamedColumns, “Address”, each Lookup([Location], 1, 0))

Expand Address

= Table.ExpandTableColumn(AddressLookup, “Address”, {“Confidence”, “Latitude”, “Longitude”, “City”, “State”, “Country”}, {“Address.Confidence”, “Address.Latitude”, “Address.Longitude”, “Address.City”, “Address.State”, “Address.Country”})

FilteredRows

= Table.SelectRows(#”Expand Address”, each [Address.Confidence] > 0)

Access the Windows Azure Marketplace from Data Explorer

One of the data sources that Data Explorer supports is the Windows Azure Marketplace. The marketplace (formerly known as “data market”) has a number of free and paid data sets that you can subscribe to. In this post I’ll walkthrough the steps of pulling data from the marketplace.

On the DATA EXPLORER tab in Excel, you’ll find Windows Azure Marketplace as one of the options under Other Sources.

image[7]

Selecting this option pops up the Data Explorer credential manager. It has built in support for Microsoft accounts, which makes it easy to interact with Microsoft services (such as Windows Azure Marketplace, and Office 365).

image

Clicking on Sign in pops up a dialog asking you to confirm that Data Explorer is allowed to access your data market subscriptions.

image

Once these permissions have been granted, you can click Save and proceed to defining your query. On the query page, the Navigator pane will list all of your available marketplace subscriptions. For this demo, I’ll select the USA 2011 Car crash data set (which is a bit morbid, I know, but I’m sure it can lead to all sorts of interesting analysis and insight).

image

(Yes, I know I have a lot of marketplace subscriptions – have I mentioned how much I like this service?)

Once you select the source, Data Explorer displays the initial data set.

image

From here I can apply filters and transformations, and hide the columns I’m not interested in. I can also limit the amount of data that comes back by clicking on the table icon in the top right corner of the grid (directly above the row numbers).

image

Being able to limit a data set in this way is very useful (in this case, there are over 72,000 rows in the data set, and I might not be interested in using a sample this big). To reduce the number of rows I’m working with, I can select Keep Top Rows …, and enter a maximum row count.

image

Note that when you’re building the query, Data Explorer will only retrieve a small number of rows for you to work with. You can retrieve more data by scrolling down the grid, but you’ll eventually hit a size limit. I really like this functionality, as it keeps Data Explorer responsive, even when you’re working with large data sets.

image

Clicking Done brings you back to Excel, and imports all of the data to a new page.

image

Data Explorer makes it very easy to pull and explore data from the Windows Azure Marketplace. The Marketplace team previously released an Excel Add-in, but I suspect that Data Explorer will replace it entirely in the future.

Exploring Data Explorer

This week the Data Explorer team announced that a preview of their Excel add-in is now available! I’ve been meaning to try it out for a while, so I went ahead and downloaded the Office 2013 version then walked through some of the tutorials on the Office Help site.

Once the add-in is installed, it shows up under the DATA EXPLORER tab. The first set of buttons allow you to grab data from different sources. The options are similar to what you find under the DATA table, but there are some new additions as well.

clip_image001

I was recently going through the list of upcoming SQL Saturday events, and trying to decide which ones I might be able to attend. I thought this “data set” would make a good example for this post.

Clicking the From Web button in the ribbon brings up a box to enter your URL.

clip_image002

Once Data Explorer has loaded the page, it parses out any potential sources of data (i.e. tables) it finds on the page. The SQL Saturday Events page has two tables – one that shows upcoming events, and one that shows past events. Selecting the table shows you the data you’re working with.

clip_image003

You can remove columns you don’t want/need by clicking on the column header to select it, right clicking, and selecting Hide Columns.

clip_image004

 

Parsing/splitting field values is pretty straight forward as well. Here I wanted to split the name of the event up into two fields – the event number, and the event title (city, year). To do this, you right click on the field and select Split Column.

clip_image005

In this case, our field is consistently formatted, so we can simply split on a dash (-) character.

clip_image006

We now have two separate fields.

clip_image007

I didn’t want the “SQLSaturday #” prefix, so I used another split to give me the event number

clip_image008

clip_image009

I then removed the first split column, as it was no longer needed.

 

clip_image010

Data Explorer generates default column names (Column1, Column2), and you also have the option to use the first row as headers. My data didn’t have any headers in it, so I manually renamed the columns.

clip_image011

Each time you apply a transform in Data Explorer, it adds a “Step” to your process. You can view them all by expanding the Steps tab on the left hand side of the window. Clicking on various steps will show you what the data looks like in each stage. You can view the formula that is being applied as well. I’m sure advanced users will play directly within the formula bar, but most of the options you’ll need are exposed through the UI.

clip_image012

After renaming all the columns, my data looks like this:

clip_image013

When I click the Done button, a new sheet is added in Excel with the query I just built. I can provide a friendly name for the query, as well as modify it and refresh the data.

clip_image014

For review, my queries at each step were:

 

Source = Table.TransformColumns(Data0,{},Value.FromText)
HiddenColumns = Table.RemoveColumns(ChangedType,{“Column1”, “Column3”})
SplitColumnDelimiter = Table.SplitColumn(HiddenColumns,”Column4″,Splitter.SplitTextByEachDelimiter({“-“}, null, false),{“Column4.1”, “Column4.2”})
SplitColumnPos = Table.SplitColumn(SplitColumnDelimiter,”Column4.1″,Splitter.SplitTextByPositions({0, 13}, false),{“Column4.1.1”, “Column4.1.2”})
HiddenColumns1 = Table.RemoveColumns(SplitColumnPos,{“Column4.1.1”})
RenamedColumns = Table.RenameColumns(HiddenColumns1,{{“Column2”, “Date”}, {“Column4.1.2”, “Number”}, {“Column4.2”, “Name”}, {“Column5”, “Location”}})

Overall, I’m really liking the Data Explorer experience so far. I think the team has done an excellent job in providing a UI that is easy and intuitive. Looking at the formula language, I can see that there is a lot of capabilities behind it as well.

In my next post, I’ll explore more of the more advanced functionality.