Retrieving the Value of a Sensitive Parameter in a Script

SQL Server 2012 introduces Sensitive Parameters. These parameter values are masked in the UI, protected according to the ProtectionLevel of your package/project when saved to disk, and encrypted when deployed to the SSIS Catalog. When using SSIS Expressions, sensitive parameters can only be assigned to sensitive properties (such as a connection manager’s Password property).

You may wish to use a sensitive parameter in a script (for example, if you’re doing FTP tasks using a Script Task). Retrieving a sensitive parameter’s value like any other variable/parameter will result in an error at runtime.

For example, the following ScriptTask code to retrieve the value for the “FtpPassword” parameter:

Results in the following exception at runtime:

 

Error: Accessing value of the parameter variable for the sensitive parameter “<XXX>” is not allowed. Verify that the variable is used properly and that it protects the sensitive information.

 

 

 

The correct way to retrieve the value is by using the GetSensitiveValue() function. The following ScriptTask code will return the value successfully.

Note: Once you retrieve the value out of the sensitive parameter, you are responsible for ensuring it is securely handled. Be careful of unintentionally exposing the actual value through logging. See the Handling Passwords entry in books online for more information.

SSIS Quick Tip: Copy & Paste Packages into a Visual Studio Project

The Add Existing Package option for SSIS projects is very helpful, but has one major limitation. The dialog only allows you to add a single package at a time. When working with a lot of packages, this can be a time consuming process.

image

Thankfully, there is a shortcut. You can add multiple packages to a project using Copy & Paste.

  1. Select the packages you want to add to your projects in Windows Explorer, and Copy (Ctrl-C)
  2. Switch to Visual Studio
  3. Select the Project node
  4. Paste (Ctrl-V)

The packages will be added to the SSIS project (and copied to the project directory).

image

Note: This works in all versions of SSIS.

SQL Bits X Schedule has been Posted

The SQL Bits X schedule has been posted and I am very impressed with the lineup! Most conferences will have 3-4 tracks running in parallel – Bits will have 8 sessions to choose from in most timeslots, offering you a wide variety of options.

Friday schedule
Saturday (Community Day) schedule

More details about the pre-con training event I’m doing can be found here.

Here are some of the SSIS related sessions from the conference:

Speaking at SQL Saturday #105 in Dublin, Ireland

I’m honored to be presenting at SQL Saturday #105 in Dublin, Ireland on March 24th. They recently added a 4th session track to keep up with demand, and I was more than happy to fill in one of the timeslots with a session about the developer experience enhancements in SSIS 2012.

I’ll also be covering SSIS Performance Design Patterns in a full day training session the day before the event. The content will be similar to the pre-con session I’m doing the following week at SQL Bits X, with a couple of changes – I’ll be presenting solo, and will spend the last portion of the day covering the enhancements we made to SSIS in SQL Server 2012.

Registration information and details are available here.

SQL Saturday #105

SSIS Quick Tip: Required Parameters

SQL Server 2012 introduces the new Parameter model. Setting a parameter’s Required property to True means that a value must be supplied at runtime by the caller of the package (dtexec, SSMS, SQL Agent, the Execute Package Task, etc). This means that although the parameter has a default value at design time, it will never be used once the project is deployed.

Example

I have a package which loads a single flat file into a destination table. I’ve defined an InputFile parameter for the path of the file I want to load, and use it to set the ConnectionString of my Flat File Connection Manager. For my package to validate, I need to give it a default value (which points to a sample data file I can use for testing).

image

Setting Required to True for this parameter ensures that I won’t accidentally use it after deploying it to my test and production environments, as the package won’t run until I specify a new value.

image

This will hopefully prevent some of the painful-yet-amusing-because-they-didn’t-happen-to-me type stories I’ve heard over the years, like:

  • Having your ETL processes running for a week before you realize that you forgot to update a connection string, and have been updating the test data warehouse with real customer data
  • Loading sample data files that developers use for testing into your production data warehouse, causing such products as “YOUR MOM” and “I HATE THIS JOB” to show up on the company wide weekly sales report

How Can I Improve the Performance of my Destination Component

Question: I’ve created a custom destination component, but I’m not happy with the performance. I tried using multiple threads, but the pipeline buffer doesn’t allow free traversal. What can I do?

 

I’d start by profiling the component to make sure you identify the correct bottlenecks. The most common source of problems is accessing the COM interfaces at runtime – make sure that you keep a reference to any object retrieved through IDTSxxx100 members to avoid repeated COM interop calls. It’s a best practice to cache everything you need in the pre-execute phase.

If you’ve determined that the bottleneck is at the destination, and that multiple threads will improve performance, then you’ll need to make a copy of the data in the buffer before passing it to your destination. I did something similar in the old Data Services Destination sample on codeplex (main/DataServices/Src/Destination/Component.cs). The logic was:

  1. Loop through the buffer
  2. Copy row data into local object
  3. Add batch of data to a work item threadpool for insertion (or use the new Task Parallel Library)
  4. When loop completes, wait for all threads/work items to complete before returning from ProcessInput

In the sample I linked to above, a “batch” was a single row (as the destination only supported RBAR at the time), but you can configure this to be whatever batch size makes sense for your destination.

Note that step #4 is important, as the buffer object that is passed into ProcessInput can be destroyed/modified once the function completes. Waiting allows you to redirect error rows if insertions fail at the destination. The downside to waiting is that your destination component can’t start processing a new buffer until you’re “done” with the current one (i.e. have returned from ProcessInput). You can choose not to wait and return right after you’ve read all of the rows, but you must ensure that:

  1. You make a copy (a real copy – not just an object reference) of any data you need from the PipelineBuffer object
  2. When you receive EndOfRowset, make sure you complete all outstanding work items before returning from ProcessInput

The Data Services sample also contains a really useful class called ThreadPoolWait, which makes it easier to wait for threads to complete.

Hope that helps!

Processing Modes for the CDC Source

The CDC Source offers five different ways to retrieve change data. The format you receive the data in is determined by the “CDC processing mode” you select in the CDC Source Editor UI.

Cdc_source_ui

The two primary modes are All Changes, or Net Changes. Net is the best mode to use for most ETL processes, although it does introduce some overhead on the source system – see the descriptions of the modes below for details.

To describe the modes, we’ll use the following scenario.

  • CDC is enabled on the AdventureWorksDW DimCustomer table (similar to the example in this post)
  • Three new customers (CustomerKey = 30000, 30001, 30002) are added
  • The LastName for two of the customers (30001, 30002) is set to SMITH
  • One of the new customers is deleted (CustomerKey = 30002)

The SQL statements for these steps looks like this:

All

In this process mode, a row will be returned for each change applied to the source table. In this example, the CDC Source returns six rows – three INSERTs, two UPDATEs, and one DELETE. The rows will be ordered by LSN (i.e. in the order in which they were applied). This mode is typically used in ETL processes that require heavy auditing (usually for compliance reasons). While this mode has less overhead on the source system (it simply returns the captured rows, and requires no additional roll-up processing), it does complicate the ETL process, as you will need to merge the changes yourself.

Cdc_all

The order that the operations took place can be determined by the __$seqval field.

This processing mode is similar to using the cdc.fn_cdc_get_all_changes function.

Note: You will (typically) not want to use the CDC Splitter when processing all changes, as sending your change rows down multiple paths will make it difficult to apply them in the correct order.

All with old values

This processing mode is similar to All, except that you get two rows for each UPDATE statement – a row containing the values Before Update, and one containing the values After Update. Using our example, the CDC Source returns eight rows –  three INSERTs, two UPDATEs with Before Update values, two UPDATEs with After Update values, and one DELETE.

Cdc_all_with_old

To distinguish between old and new values, check the __$operation value. A value of 3 indicates the row has Before Update values, 4 means After Update values.

Net

When using this processing mode, the CDC server will “roll-up” all changes for a key into a single row. In this example, the CDC Source returns two rows – two INSERTs for the new rows. The value for CustomerKey 20001 will already have the updated LastName value. CustomerKey 20002 is not returned at all, since it was deleted in the end. This mode does have a performance impact, as it requires the source server to apply the roll-up logic to figure out the final state of the row. However, it greatly simplifies the ETL logic needed for processing the change data – see the CDC Overview post for an example of how to processes Net changes.

Cdc_net

This processing mode is similar to using the cdc.fn_cdc_get_net_changes function.

Note: This mode requires that the source table has a primary key or unique index. It also requires that the @supports_net_changes parameter was set to True when CDC was enabled on the source table. See sys.sp_cdc_enable_table for more details.

Net with update mask

This processing mode is similar to Net, except that there will be an additional boolean column (__$<column-name>__Changed) that indicates whether a column value changed. This may be a useful processing mode if you are only monitoring/processing changes to certain fields, or wish to provide additional processing logic when changes are made to a specific field.

Cdc_net_update_mask

Note: In this example, the __Changed field would be True for all rows, since they are all new (from the perspective of this load). If a change was made to an existing row, unchanged columns would have a False value.

Net with merge

The key things to keep in mind for this processing mode are:

  1. UPDATE and INSERT rows are grouped together with the same operation value (__$operation = 5)
  2. Grouping the rows together makes it easier to use the changes with a MERGE statement

From the cdc.fn_cdc_get_net_changes entry in Books Online for the all with merge row_filter_option:

Returns the LSN of the final change to the row in the metadata columns __$start_lsn. The column __$operation will be one of two values: 1 for delete and 5 to indicate that the operation needed to apply the change is either an insert or an update. The column __$update_mask is always NULL.

Because the logic to determine the precise operation for a given change adds to query complexity, this option is designed to improve query performance when it is sufficient to indicate that the operation needed to apply the change data is either an insert or an update, but it is not necessary to explicitly distinguish between the two. This option is most attractive in target environments where a merge operation is available directly, such as a SQL Server 2008 environment.

You’ll use the following data flow pattern when using this change mode:

  • Retrieve changes with CDC Source
  • Use CDC Splitter to divide rows
    • Note: only the DELETE and UPDATE paths will receive rows, as INSERTs are grouped with UPDATEs in this mode
  • Stage DELETE rows
  • Stage UPDATE rows
  • Use an Execute SQL Task after the data flow to apply DELETE changes in a batch operation
  • Use an Execute SQL Task to run a MERGE statement to process UPDATE rows in a single batch

This pattern is similar to the approach I described in my previous CDC post, except 1) new rows are not inserted into the final table within the data flow and 2) MERGE is used instead of UPDATE FROM … SELECT.

Cdc_net_with_merge

 

 

 

SSIS Training at SQL Bits X in London

I am really excited to be invited back to SQL Bits to do a full day of SSIS training. Once again, I’ll be covering my favorite topic – SSIS Performance Design Patterns. The content will be similar to the talk I gave at SQL Bits 9, with a couple of big changes. First, I’ll be co-presenting with John Welch, SQL MVP and SSIS Community Hero. John will add his own years of SSIS experience to the talk, and I expect he’ll do a great job of keeping me honest. Second, we’ve updated the talk with more Data Warehousing specific design patterns, and additional SQL Server 2012 content.

SQLBits

SQL Bits X takes place March 29th – 31st at the Novotel London West. Check the site for registration details.

I’ll be delivering two additional sessions relating to SSIS in SQL Server 2012 on the Friday and Saturday events.

SSIS in SQL Server 2012, Part 1: Developer Enhancements

The SQL Server 2012 release of SQL Server Integration Services (SSIS) includes a broad range of new features and enhancements designed to ease adoption for new users and increase productivity for experienced SSIS developers. In this demo intensive session, Matt Masson – Senior Software Development Engineer on the SSIS team – guides you through the new developer focused functionality in SQL Server 2012. We’ll look at the productivity enhancements in SQL Server Data Tools (formerly the Business Intelligence Development Studio), team development enhancements, and the new Change Data Capture (CDC) features that ship with the product. If you haven’t seen the new SSIS features in SQL Server 2012 yet, you won’t want to miss this session!

SSIS in SQL Server 2012, Part 2: Deployment and Management

The SQL Server 2012 release of SQL Server Integration Services (SSIS) includes a broad range of new capabilities designed to improve the deployment, management and monitoring of SSIS solutions. The core of these capabilities is the new SSIS Catalog. In this session, Matt Masson – Senior Software Development Engineer on the SSIS team – will take a deep look at the SSIS server, including its architecture, implementation and use.

Hope to see you there!

Too Many Sources in a Data Flow

This was a recent customer issue that I wanted to share.

The customer was transferring a large amount of data between two tables, without performing any transformations on it. They wanted to run multiple transfer streams in parallel, and created a way to split up their source data into six partitions. In their original design, they had six Source -> Destination combinations in a single Data Flow Task.

six source components in a data flow

This package worked fine in their development environment, but would result in a deadlock when they ran it in production. Looking at the SSIS logs, they could see that the package was hanging in the Pre-Execute phase of the Data Flow. They ran a profiler trace and found that SQL was showing Async Network IO for five sessions, but the sixth was waiting for memory resources (wait type of RESOURCE_SEMAPHORE). Looking at sys.dm_exec_query_resource_semaphores  (shown below) confirmed this.

Waits

There were two factors contributing to the deadlock:

  1. The source server did not have enough memory to run six of these queries in parallel
  2. In SSIS, the Data Flow does not start executing until all components have completed the Pre-Execute phase

When the data flow runs, each source component prepares its query with the source server in the pre-execute phase. The server was able to grant enough memory for the first five queries (session_id 57 to 61), but the sixth query (session_id = 63) was told to wait. This prevents the pre-execute phase of the sixth OLE DB Source component from completing. The data flow won’t start running until the pre-execute phase completes for each component, and the source server can’t grant memory until the other queries complete, which results in a deadlocked package.

To resolve this issue, the customer changed the package design pattern to have a single Source -> Destination combination per data flow, with six parallel Data Flow Tasks. With this design, they were still occasionally getting RESOURCE_SEMAPHORE waits, but the waits didn’t last forever, as the other Data Flow tasks were able to complete their queries independently. (They were eventually able to remove the waits all together by optimizing their source query.)

An alternate design (which would be my preference) would have been to have a single Data Flow per package, parameterize the source query, and run six instances of the package in parallel.

Summary

Beware of doing too much in a single Data Flow! Unlike Tasks in the control flow, which run independently, the Data Flow can’t start until each component has finished its initialization in the Pre-Execute phase. This is especially important for components that can spend a long time in the Pre-Execute phase, such as a Lookup Transform with Full Cache mode. A Data Flow should only have a single source — unless the data flow is combining data from multiple sources. If your data flow has multiple data paths that do not overlap, then you should consider using separate Data Flow tasks (if they need to be run “together”, put them in the same Sequence Container).