Resources for SSIS Performance Best Practices

Simple post today.

Here are a number of links to resources that I reference in my Performance Design Patterns presentation.

 


You can see me for SSIS Performance Design Pattern training at SQL Saturday #105 in Dublin, SQL Bits X in London, and SQL Saturday #111 in Atlanta.

CDC Control Task Operations

The CDC Control Task is used for a number of different operations. This post lists each of those operations, and briefly describes when to use them.

CDC Control Task Editor

Mark initial load start

This operation is used when executing an initial load from an active database without a snapshot. It is invoked at the beginning of an initial-load package to record the current LSN in the source database before the initial-load package starts reading the source tables. A walkthrough of how this process works can be found in my CDC in SSIS for SQL Server 2012 post.

Mark initial load end

This operation is used when executing an initial load from an active database without a snapshot. It is invoked at the end of an initial-load package to record the current LSN in the source database after the initial-load package finished reading the source tables. This LSN is determined by recording the current time when this operation occurred and then querying the cdc.lsn_time_mapping table in the CDC database looking for a change that occurred after that time

Mark CDC start

This operation is used when then the initial load is made from a snapshot database database or from a quiescence database. It is invoked at any point within the initial load package. The operation accepts a parameter that can be a snapshot LSN, a name of a snapshot database (from which the snapshot LSN will be derived automatically) or it can be left empty, in which case the current database LSN is used as the start LSN for the change processing package. This operation is used as an alternative to the Mark Initial Load Start/End operations.

Get processing range

This operation is used in a change processing package before invoking the data flow that uses the CDC Source data flow. It establishes a range of LSNs that the CDC Source data flow reads when invoked. The range is stored in an SSIS package variable (StateVariable property) that is used by the CDC Source during data flow processing.

Mark processed range

This operation is used in a change processing package at the end of a CDC run (after the CDC data flow is completed successfully) to record the last LSN that was fully processed in the CDC run. The next time Get processing range is used, this position determines the start of the next processing range.

Reset CDC state

This operation is used to reset the persistent CDC state associated with the current CDC context. After this operation is run, the current maximum LSN from the LSN-timestamp sys.fn_cdc_get_max_lsn table becomes the start of the range for the next processing range. An example of when this operation is used is when you want to process only the newly created change records and ignore all old change records.

 

Notes

  • The following operations open a connection to the source system
    • Mark initial load start
    • Mark initial load end
    • Mark CDC start

Thoughts on Branching Strategies for SSIS Projects

A question came in recently from a customer about branching strategies for SSIS projects. Their development team has been using SSIS for years, but have not been using any form of source control software up until now. They felt that the package format improvements in SQL Server 2012, which were focused on making it possible to diff/merge package files, were a good motivation for them to adopt this best practice.

I gave the following high level guidance:

  • Treat your SSIS packages as if they were source files from a compiled language (i.e. C#)
  • Think of your SSIS project as an “Application” (I frequently refer to SSIS projects as “Data Integration Applications” in my talks)
  • There is a lot of guidance out there on software development lifecycle best practices, most of which applies to SSIS development
    • It may take some experimenting to determine what methodology works best for your project and team

I’ve always liked Jamie Thomson’s posts about his experiences with SSIS Team Development (as well as general best practices and naming conventions that apply here as well). The changes we made in SQL 2012 alleviate the need for some of these best practices, but many of them still apply.

In regards to specific branching strategies, it’s common to have three core branches.

Development

This is the branch developers work out of. Developers are responsible for testing changes, and making sure the project is fully functional in this branch. If a developer is making significant changes that would be disruptive to everyone else, or cause the project to be broken for any amount of time, then you should create a sub branch for them to work out of. This branch should be regularly merged (weekly?) to the Integration branch so that issues not caught during developer testing can be identified early.

Integration

The builds from this branch go up to your test environment. This is where you’d do signoffs before the project is moved to production. Most changes coming into this branch should be coming from Development – the only time you’d directly check something in here is during the stabilization period (and this should be rare). Once QA have signed off on the release, you’d merge this to your Main branch.

Main

This is what gets released to production. You should use build tags to identify specific versions/releases. Nothing should be checked directly into this branch without first going through Integration testing.

Note, this branching strategy assumes a single team/project. If you have multiple teams/projects, you need to decide if you are sharing a common Main, or if everyone goes into the same Integration branch (and whether there are multiple levels of Integration). Within the SQL org, for example, we’d have a “Release” branch past Main that we actually do the external builds from (it also allows us to do servicing builds post-RTM).

(The actual SQL branching strategy is a lot more complex than that, but this is the basic idea!)

Are you currently working with SSIS in team environment? Are you using source control? I’d love to hear any recommendations you might have.

Speaking at SQL Saturday #111 | Atlanta

Wow, what a lineup for SQL Saturday #111. I’m still amazed and honored anytime I’m on a speaker list of this caliber!

I’ll be co-presenting a pre-con event (full day training) with Andy Leonard on April 13th, and talking about SSIS in SQL Server 2012 as part of the main event. Rumor has it that I might even get to co-present with Patrick LeBlanc, who has been working hard on High Availability Solutions for SSIS in SQL 2012. It’s all very exciting – I hope to see you there!


If you can’t make it to Atlanta, you can also see me for SSIS Performance Design Pattern training at SQL Saturday #105 in Dublin, and SQL Bits X in London.

Exit Codes, DTEXEC, and SSIS Catalog

DTEXEC has two execution modes when running packages in the SSIS Catalog.

By default, the process runs asynchronously – the package execution is scheduled on the server, and DTEXEC returns immediately. DTEXEC does not wait for the package execution to complete. In this case, %ERRORLEVEL% would represent whether the package execution was created successfully, and will not reflect the result of the execution.

To get the result of the package execution, you can set the $ServerOption::SYNCHRONIZED property to True. In this mode, DTEXEC will wait for the package execution to complete. If the execution was successful, the %ERRORLEVEL% will be set to 0. If it failed, it will be set to 1.

See the examples below – the first execution runs a package (2-RowCounts.dtsx) on my local instance, and completes successfully. The second execution runs a package (6-ErrorContext.dtsx) that will fail.

 

For more details on the new DTEXEC command line parameters for the SSIS Catalog, see:

Queries for Data Flow Component Performance in SQL 2012

The built-in SSIS Catalog reports in SSMS show a breakdown of task and component performance of a package execution when you run with a logging level of “Performance”.

SSIS Catalog package performance report

For a more detailed or customized view, you can query the catalog tables directly. The key table to get data flow performance information is [catalog].[execution_component_phases]. Here are two sample queries – the first returns the total time spent on each component (similar to what is displayed in the performance report), and the second returns the times broken down by phase.

 

Running these queries returns something like this:

image

Some notes:

  1. You need to run the package with a Performance logging level to capture this data
  2. You can set a default logging level for the Catalog
  3. The query above determines the execution_id for the latest run – you will need to change this value to look at the results for specific package executions