In any Business Intelligence project you intend on taking on it is a good idea to build a set of best practices either for yourself or for your company to follow to ensure a successful project.
Having a list of best practices does several things for you
· Makes you a better developer by producing a better end product
· Allows you to standardize development therefore saving time
· Easily explains why you did what you did, especially useful for consultant handing off a project
Why do I need a best practices list?
The benefits of having a list seem obvious so why do people still not do it? Lack of time is usually the reason most people give for not having a list or maybe they do have a list just not in writing. Instead of thinking there is not enough time I want you to think about how much time this will be saving you in the long run. Spending even a half day a day of work on this list could save you hundreds of hours in your career. Hopefully after reading this I will have convinced you it is essential to not only have a list but to put it in writing so you have a quick reference guide.
Best Practices
You may have SSIS best practices that I haven’t included in my list and I would love to hear from you if you do.
1. Have a good naming conventions
Tasks, Transforms, Packages, Connection Managers and Variables should all be given names different from the default. You may already do this but be consistent with it.
Tasks and Transforms
Jamie Thomson has a great best practices blog that includes a list of acronyms for all tasks and transforms. With his method for each of the tasks and transform you would use the task acronym and then a small description of what the task does. For example, I have a Data Flow Task that loads my employee table so I would name the task DFT – Load Employees.
The main goal is to make it easier on yourself long term. If the package you develop fails and you have not changed the tasks names you could get an error that says Script Task 3 failed. This can be incredibly painful to debug especially if you are taking over the work of another developer.
Packages
You can go many different ways with this one. My preferred method is:
{Stage | Process | Load}_{ Dim | Fact}_{StageTableName | DimensionTableName | FactTableName | ObjectName}
Ex. Load_Dim_Product.dtsx or Process_SalesCube
Again consistently is everything with this one. I’ve seen others use a namespace approach (Ex. CompanyName.DataWarehouse.LoadDimProduct.dtsx). However you decide to name your packages make it the same across your entire solution.
Connection Managers
If you use the same connection in all of your packages then make sure to consistently name it across all those packages. Here’s my Connection Manager naming convention method:
{ConnectionType}_{DatabaseName | FileName}_[FileType]
Ex. OLEDB_FinanceDW or FLATFILE_EmployeeExtract_TXT
Variables
Variables are one of the most important objects to have good naming conventions with. I prefer to use Hungarian notation. This method identifies the variable data type as part of the name. The reason it is so important to have this kind of naming convention with variables is because knowing variable data types is essential when working throughout an SSIS package. For example, if you are building an expression using variables you will likely need to know the data type for the expression to evaluate correctly. Hungarian notation is formatted like this:
[DataType][VariableName]
Ex. strFullName
2. Create an auditing framework for your packages
Auditing your SSIS packages is a great way to keep track of information about your package executions. SSIS has a method of natively logging but unfortunately the native logging is limited to what it provides and the possible storage methods are not optimal for reporting. The ultimate goal in the end is to be able to report on your package executions. That’s why I suggest creating your own or buying a tool that provides does it for you. Here are some of the most important things you should track in your auditing framework.
Minimum of Things to Track
Whether the package completes with errors, warnings or neither (success)
How long the package took to run
How many rows were loaded
When packages complete with errors list the errors
Luckily this information is easy to find in SSIS. System variables and row counts can handle most of these. For example, error messages are stored in the system variable ErrorDescription, which is only available in the EventHandlers.
Each package run should be tracked separately so you can compare the performance between runs. I suggest storing the auditing data in a database by itself so your SSIS package is not writing to another database that is essential to your company.
The last step is to develop a set of reports to display the information you’ve just collected. Here you should make it easy to get a understanding on how your environment is performing and if you have any errors to address.
I make it sound easy describing it in a couple paragraphs but this can actually involve some very complex logic. Even a fairly simple Auditing Framework could take up to a week to perfect but once it is complete you have that framework you can use on any project.
3. Use Annotations
Annotations are probably one of the simplest best practices to follow (Right there with naming conventions). It is a simple method of documenting what is going on in the package you have developed. For some reason it is also one of the most frequently ignored.
Imagine you developed a package 6 months ago and haven’t touched it since. You never bothered to add annotations to your package. Now the package fails and you have no idea what the package does anymore. Spending just a few moments to add this simple documentation could have been a lifesaver. Even if you know your time is limited at a company it is only good karma to do this not only for yourself but for others.
4. Utilize Configuration Files on Connections
Configuration files allow your package to dynamically change package properties from outside the package at runtime. The property values are stored in an XML file and can easily be edited to manipulate how the package runs each time. They can be used on just about any object in a package but my suggestion is at the very least on each data source.
The easiest way to show a benefit to using configuration files is how it helps in moving through the development cycle. For example, you have 50 SSIS packages and you’re ready to move from your development environment to a testing phase. Instead of opening all 50 of your packages to change data sources to point to your new testing databases you can make the change just once in the configuration file and your done. When the package runs it will refer to the settings in the file instead of what is in the package.
Many developers like to use one configuration file for all packages instead of one for each data source. I discourage doing this because that if you place all your data sources in a single configuration file then regardless of whether a package needs the connection or not it the file now mandates that it is present in the package. If you do not include them all then you will receive the following error:
For more details on using configuration files download a whitepaper I coauthored on Managing and Configuring SSIS
5. Use Checkpoints to restart packages
Checkpoints are a great tool in SSIS that many developers go years without even testing. With Checkpoints enabled on a package it will save the state of the package as it moves through each step or task and place it in a XML file upon failure of the package. If your package does fail you can correct the problem in your package and rerun from the point of the tasks that did not successfully run the first time. Once the package completes successfully the file is no longer needed and automatically discarded.
Just imagine your package is loading a table with 10 million records. Your package passes the Data Flow that performs this huge load without any problem (Other than the fact that it took two hours to load). The next task in your package is a Send Mail Task and for some reason fails.
You correct the problem in the Send Mail Task, but without using Checkpoints your package would still have to run that Data Flow that loads the 10 million records again (taking another two hours) even though you’ve already done it once. If you had enable Checkpoints on this package you could simply correct the problem in the Send Mail Task and then run the package again starting at the Send Mail Task. Sounds great right?
For step by step directions on enabling checkpoints read my blog on Enabling Checkpoints in SSIS.