Using SSIS Term Extraction to Process Search Terms

This blog uses WordPress, and I recently started using the Jetpack plugin. One of the features that Jetpack provides is a list of search terms that people used to arrive at your site. It doesn’t account for all views for your site (or even the majority of them), but it is interesting to see what people are looking for when they come to your site.

Yes, according to the site stats, most people are searching for my name when they arrive to my site. I’m not sure how I should feel about that.

From these full search queries, I wanted to see what the most common terms were. This blog posts describes how I did this using the SSIS Term Extraction transform.

Extracting the Data

The first step was to actually get the list of search terms from my blog’s dashboard. When dealing with data on a webpage, I typically turn to Excel, and figured I could make use of the new Data Explorer add-in. Unfortunately, since my blog requires you to be logged in, Data Explorer ends up getting (silently) redirected to the login page, with no way to proceed. I had hit this issue before, so I tried to use the regular “From Web” functionality in Excel. This let me log into the site, but unfortunately, Excel didn’t recognize the data on the page as a table, which prevented me from creating a refreshable web query. Alas, I had to resort to the low tech approach of Copy &Paste, but it did the job.

I copy and pasted the data from the web page into Excel, and saved as CSV.

clip_image002

Why CSV and not .xls/.xlsx? Simplicity. One advantage of reading from Excel is that you get some metadata and limited query support from the Access Database Engine provider (Jet/Ace). However, the intricacies of working with Excel typically make me prefer a straight CSV document. Unless I need to use a refreshable query, CSV is my preferred format for all data imports (it’s faster, too).

Processing with SSIS

I created a new SSIS package, added a flat file connection manager, and pointed it at my recently created CSV file. I specified the Text qualifier (Excel uses ” (double quote) by default), and marked that the first row contains headers.

clip_image003

On Advanced tab, I set the data types (DT_WSTR(500) and DT_I4).

Note, even though the text values in the CSV are DT_STR (ansi strings – no special characters), I used DT_WSTR because that is what the Term Extraction transform supports. Alternatively, we could have used a Data Convert or Derived Column transform to perform the conversion, but I’d rather do it directly in the Source component.

I then added a Data Flow Task, Flat File Source which used my new connection manager, and a Term Extraction transform. There’s not much configuration involved with the term extraction – you select the column you want to analyze, and (optionally) provide names for the output columns.

clip_image004

On the Advanced tab, I changed the Frequency threshold to 1 and Maximum length of term to 20. Changing these values can affect performance of the component, but since we’re processing a limited number of rows (about 500 total), I figured it was safe to do.

clip_image005

The Term Extraction transform redirects error rows by default, so you’ll get a warning if you don’t configure the output path. Instead, I configured the error output to Fail Component on error since we’re not expecting to have any errors in our list of simple terms.

clip_image006

I then closed off the data flow to see the results with a data viewer.

clip_image007

Running the package gave me a break down of the common terms.

clip_image008

The more the term appears in the input, the higher the score will be. This is really useful, but unfortunately it’s not completely representative of the actual search terms since the values had already been aggregates (the “Views” value from the original data set). To get a more accurate number, I had to expand the results before processing them.

Expanding the Search Results with a Script Component

By default Script Components are synchronous – among other things, this means they will have the same number of rows going in as they have coming out. To expand the values counts correctly, we’ll need to add more rows to the data flow. This means I’ll need to make the Script Component Asynchronous.

I added a Script Component (transform) right after my Flat File Source, and opened the editor.

I selected both input columns.

clip_image009

I went to the Inputs and Outputs tab, selected the single output, and set the Synchronous ID value to None.

I added a column for the Term (DT_WSTR(500)).

clip_image010

Open the script editor to start entering the code.

The code is pretty simple. Using an Asynchronous Script Component isn’t that different from a Synchronous one – you just need to remember that you’re adding rows in a new buffer, rather than changing values in the current one. The only method I needed to worry about is the one that processes the incoming rows – Input0_ProcessInputRow.

For each incoming row, I want to look at the term (Search) and the number of times it appears (Views). I then programmatically add a row which contains the term to the output buffer for each time it was viewed.

Running the package after this change, I see that I have a lot more rows going into the Term Extraction transform then I did before (1043 vs. 498), and more accurate scores as a result.

clip_image011

That’s it – thanks for reading!

Leave a Reply

You must be logged in to post a comment.