Speaking at SQL Saturday #111 | Atlanta

Wow, what a lineup for SQL Saturday #111. I’m still amazed and honored anytime I’m on a speaker list of this caliber!

I’ll be co-presenting a pre-con event (full day training) with Andy Leonard on April 13th, and talking about SSIS in SQL Server 2012 as part of the main event. Rumor has it that I might even get to co-present with Patrick LeBlanc, who has been working hard on High Availability Solutions for SSIS in SQL 2012. It’s all very exciting – I hope to see you there!


If you can’t make it to Atlanta, you can also see me for SSIS Performance Design Pattern training at SQL Saturday #105 in Dublin, and SQL Bits X in London.

Exit Codes, DTEXEC, and SSIS Catalog

DTEXEC has two execution modes when running packages in the SSIS Catalog.

By default, the process runs asynchronously – the package execution is scheduled on the server, and DTEXEC returns immediately. DTEXEC does not wait for the package execution to complete. In this case, %ERRORLEVEL% would represent whether the package execution was created successfully, and will not reflect the result of the execution.

To get the result of the package execution, you can set the $ServerOption::SYNCHRONIZED property to True. In this mode, DTEXEC will wait for the package execution to complete. If the execution was successful, the %ERRORLEVEL% will be set to 0. If it failed, it will be set to 1.

See the examples below – the first execution runs a package (2-RowCounts.dtsx) on my local instance, and completes successfully. The second execution runs a package (6-ErrorContext.dtsx) that will fail.

 

For more details on the new DTEXEC command line parameters for the SSIS Catalog, see:

Queries for Data Flow Component Performance in SQL 2012

The built-in SSIS Catalog reports in SSMS show a breakdown of task and component performance of a package execution when you run with a logging level of “Performance”.

SSIS Catalog package performance report

For a more detailed or customized view, you can query the catalog tables directly. The key table to get data flow performance information is [catalog].[execution_component_phases]. Here are two sample queries – the first returns the total time spent on each component (similar to what is displayed in the performance report), and the second returns the times broken down by phase.

 

Running these queries returns something like this:

image

Some notes:

  1. You need to run the package with a Performance logging level to capture this data
  2. You can set a default logging level for the Catalog
  3. The query above determines the execution_id for the latest run – you will need to change this value to look at the results for specific package executions

Retrieving the Value of a Sensitive Parameter in a Script

SQL Server 2012 introduces Sensitive Parameters. These parameter values are masked in the UI, protected according to the ProtectionLevel of your package/project when saved to disk, and encrypted when deployed to the SSIS Catalog. When using SSIS Expressions, sensitive parameters can only be assigned to sensitive properties (such as a connection manager’s Password property).

You may wish to use a sensitive parameter in a script (for example, if you’re doing FTP tasks using a Script Task). Retrieving a sensitive parameter’s value like any other variable/parameter will result in an error at runtime.

For example, the following ScriptTask code to retrieve the value for the “FtpPassword” parameter:

Results in the following exception at runtime:

 

Error: Accessing value of the parameter variable for the sensitive parameter “<XXX>” is not allowed. Verify that the variable is used properly and that it protects the sensitive information.

 

 

 

The correct way to retrieve the value is by using the GetSensitiveValue() function. The following ScriptTask code will return the value successfully.

Note: Once you retrieve the value out of the sensitive parameter, you are responsible for ensuring it is securely handled. Be careful of unintentionally exposing the actual value through logging. See the Handling Passwords entry in books online for more information.

SSIS Quick Tip: Copy & Paste Packages into a Visual Studio Project

The Add Existing Package option for SSIS projects is very helpful, but has one major limitation. The dialog only allows you to add a single package at a time. When working with a lot of packages, this can be a time consuming process.

image

Thankfully, there is a shortcut. You can add multiple packages to a project using Copy & Paste.

  1. Select the packages you want to add to your projects in Windows Explorer, and Copy (Ctrl-C)
  2. Switch to Visual Studio
  3. Select the Project node
  4. Paste (Ctrl-V)

The packages will be added to the SSIS project (and copied to the project directory).

image

Note: This works in all versions of SSIS.

SQL Bits X Schedule has been Posted

The SQL Bits X schedule has been posted and I am very impressed with the lineup! Most conferences will have 3-4 tracks running in parallel – Bits will have 8 sessions to choose from in most timeslots, offering you a wide variety of options.

Friday schedule
Saturday (Community Day) schedule

More details about the pre-con training event I’m doing can be found here.

Here are some of the SSIS related sessions from the conference:

Speaking at SQL Saturday #105 in Dublin, Ireland

I’m honored to be presenting at SQL Saturday #105 in Dublin, Ireland on March 24th. They recently added a 4th session track to keep up with demand, and I was more than happy to fill in one of the timeslots with a session about the developer experience enhancements in SSIS 2012.

I’ll also be covering SSIS Performance Design Patterns in a full day training session the day before the event. The content will be similar to the pre-con session I’m doing the following week at SQL Bits X, with a couple of changes – I’ll be presenting solo, and will spend the last portion of the day covering the enhancements we made to SSIS in SQL Server 2012.

Registration information and details are available here.

SQL Saturday #105

SSIS Quick Tip: Required Parameters

SQL Server 2012 introduces the new Parameter model. Setting a parameter’s Required property to True means that a value must be supplied at runtime by the caller of the package (dtexec, SSMS, SQL Agent, the Execute Package Task, etc). This means that although the parameter has a default value at design time, it will never be used once the project is deployed.

Example

I have a package which loads a single flat file into a destination table. I’ve defined an InputFile parameter for the path of the file I want to load, and use it to set the ConnectionString of my Flat File Connection Manager. For my package to validate, I need to give it a default value (which points to a sample data file I can use for testing).

image

Setting Required to True for this parameter ensures that I won’t accidentally use it after deploying it to my test and production environments, as the package won’t run until I specify a new value.

image

This will hopefully prevent some of the painful-yet-amusing-because-they-didn’t-happen-to-me type stories I’ve heard over the years, like:

  • Having your ETL processes running for a week before you realize that you forgot to update a connection string, and have been updating the test data warehouse with real customer data
  • Loading sample data files that developers use for testing into your production data warehouse, causing such products as “YOUR MOM” and “I HATE THIS JOB” to show up on the company wide weekly sales report

How Can I Improve the Performance of my Destination Component

Question: I’ve created a custom destination component, but I’m not happy with the performance. I tried using multiple threads, but the pipeline buffer doesn’t allow free traversal. What can I do?

 

I’d start by profiling the component to make sure you identify the correct bottlenecks. The most common source of problems is accessing the COM interfaces at runtime – make sure that you keep a reference to any object retrieved through IDTSxxx100 members to avoid repeated COM interop calls. It’s a best practice to cache everything you need in the pre-execute phase.

If you’ve determined that the bottleneck is at the destination, and that multiple threads will improve performance, then you’ll need to make a copy of the data in the buffer before passing it to your destination. I did something similar in the old Data Services Destination sample on codeplex (main/DataServices/Src/Destination/Component.cs). The logic was:

  1. Loop through the buffer
  2. Copy row data into local object
  3. Add batch of data to a work item threadpool for insertion (or use the new Task Parallel Library)
  4. When loop completes, wait for all threads/work items to complete before returning from ProcessInput

In the sample I linked to above, a “batch” was a single row (as the destination only supported RBAR at the time), but you can configure this to be whatever batch size makes sense for your destination.

Note that step #4 is important, as the buffer object that is passed into ProcessInput can be destroyed/modified once the function completes. Waiting allows you to redirect error rows if insertions fail at the destination. The downside to waiting is that your destination component can’t start processing a new buffer until you’re “done” with the current one (i.e. have returned from ProcessInput). You can choose not to wait and return right after you’ve read all of the rows, but you must ensure that:

  1. You make a copy (a real copy – not just an object reference) of any data you need from the PipelineBuffer object
  2. When you receive EndOfRowset, make sure you complete all outstanding work items before returning from ProcessInput

The Data Services sample also contains a really useful class called ThreadPoolWait, which makes it easier to wait for threads to complete.

Hope that helps!

Processing Modes for the CDC Source

The CDC Source offers five different ways to retrieve change data. The format you receive the data in is determined by the “CDC processing mode” you select in the CDC Source Editor UI.

Cdc_source_ui

The two primary modes are All Changes, or Net Changes. Net is the best mode to use for most ETL processes, although it does introduce some overhead on the source system – see the descriptions of the modes below for details.

To describe the modes, we’ll use the following scenario.

  • CDC is enabled on the AdventureWorksDW DimCustomer table (similar to the example in this post)
  • Three new customers (CustomerKey = 30000, 30001, 30002) are added
  • The LastName for two of the customers (30001, 30002) is set to SMITH
  • One of the new customers is deleted (CustomerKey = 30002)

The SQL statements for these steps looks like this:

All

In this process mode, a row will be returned for each change applied to the source table. In this example, the CDC Source returns six rows – three INSERTs, two UPDATEs, and one DELETE. The rows will be ordered by LSN (i.e. in the order in which they were applied). This mode is typically used in ETL processes that require heavy auditing (usually for compliance reasons). While this mode has less overhead on the source system (it simply returns the captured rows, and requires no additional roll-up processing), it does complicate the ETL process, as you will need to merge the changes yourself.

Cdc_all

The order that the operations took place can be determined by the __$seqval field.

This processing mode is similar to using the cdc.fn_cdc_get_all_changes function.

Note: You will (typically) not want to use the CDC Splitter when processing all changes, as sending your change rows down multiple paths will make it difficult to apply them in the correct order.

All with old values

This processing mode is similar to All, except that you get two rows for each UPDATE statement – a row containing the values Before Update, and one containing the values After Update. Using our example, the CDC Source returns eight rows –  three INSERTs, two UPDATEs with Before Update values, two UPDATEs with After Update values, and one DELETE.

Cdc_all_with_old

To distinguish between old and new values, check the __$operation value. A value of 3 indicates the row has Before Update values, 4 means After Update values.

Net

When using this processing mode, the CDC server will “roll-up” all changes for a key into a single row. In this example, the CDC Source returns two rows – two INSERTs for the new rows. The value for CustomerKey 20001 will already have the updated LastName value. CustomerKey 20002 is not returned at all, since it was deleted in the end. This mode does have a performance impact, as it requires the source server to apply the roll-up logic to figure out the final state of the row. However, it greatly simplifies the ETL logic needed for processing the change data – see the CDC Overview post for an example of how to processes Net changes.

Cdc_net

This processing mode is similar to using the cdc.fn_cdc_get_net_changes function.

Note: This mode requires that the source table has a primary key or unique index. It also requires that the @supports_net_changes parameter was set to True when CDC was enabled on the source table. See sys.sp_cdc_enable_table for more details.

Net with update mask

This processing mode is similar to Net, except that there will be an additional boolean column (__$<column-name>__Changed) that indicates whether a column value changed. This may be a useful processing mode if you are only monitoring/processing changes to certain fields, or wish to provide additional processing logic when changes are made to a specific field.

Cdc_net_update_mask

Note: In this example, the __Changed field would be True for all rows, since they are all new (from the perspective of this load). If a change was made to an existing row, unchanged columns would have a False value.

Net with merge

The key things to keep in mind for this processing mode are:

  1. UPDATE and INSERT rows are grouped together with the same operation value (__$operation = 5)
  2. Grouping the rows together makes it easier to use the changes with a MERGE statement

From the cdc.fn_cdc_get_net_changes entry in Books Online for the all with merge row_filter_option:

Returns the LSN of the final change to the row in the metadata columns __$start_lsn. The column __$operation will be one of two values: 1 for delete and 5 to indicate that the operation needed to apply the change is either an insert or an update. The column __$update_mask is always NULL.

Because the logic to determine the precise operation for a given change adds to query complexity, this option is designed to improve query performance when it is sufficient to indicate that the operation needed to apply the change data is either an insert or an update, but it is not necessary to explicitly distinguish between the two. This option is most attractive in target environments where a merge operation is available directly, such as a SQL Server 2008 environment.

You’ll use the following data flow pattern when using this change mode:

  • Retrieve changes with CDC Source
  • Use CDC Splitter to divide rows
    • Note: only the DELETE and UPDATE paths will receive rows, as INSERTs are grouped with UPDATEs in this mode
  • Stage DELETE rows
  • Stage UPDATE rows
  • Use an Execute SQL Task after the data flow to apply DELETE changes in a batch operation
  • Use an Execute SQL Task to run a MERGE statement to process UPDATE rows in a single batch

This pattern is similar to the approach I described in my previous CDC post, except 1) new rows are not inserted into the final table within the data flow and 2) MERGE is used instead of UPDATE FROM … SELECT.

Cdc_net_with_merge