Document Template for an ETL Project
This article is a requirements document template for an integration (also known as Extract-Transform-Load (or ETL) project, based on my development experience as an SQL Server Information Services (SSIS) developer over the years.
For a Requirements Document Template for a Reporting Project see my article here.
After enough trial and error from the best and worst clients, business analysts, executive sponsors, and my own shining and less-than-shining moments I have seen many developers confronted with poor requirements turn into ... the DEV Nazi!!
So to make sure that doesn't happen to you, here's a template for your ETL projects.
In scope for this article
- A requirements document template designed for business analysts to cover most ETL projects.
- Witty advice
Out of scope for this article
- Requirements Elicitation, the practice of collecting project requirements of a system from users, customers and other stakeholders
- Requirements documents specific to other types of projects, such as reporting and Data Warehousing
- Any words of wisdom regarding data security
- Design Documents, and issues that typically come up in design.
So here we go..
A 'who changed what when' chronology of all changes, either using Word change tracking or lines like '8/1/15 Bob's changes per mutual agreement'. Keeps everyone honest when there are lots of changes, and if you’re ever in any situations where changes are coming fast and furious, this is invaluable in managing an approved set of requirments.
I’ve known many a business analyst and developer who became quickly overwhelmed in conflict between others on requirment demands.
Also known as project objective, business goals, business problem statement, and various other terms. A simple 'Here's why we're doing this' paragraph. The target audience being those that are likely to only read this paragraph, but this also gives the developer some design decision guidance.
In Scope / Out of Scope
Everybody LOVES this section! Okay, developers LOVE this section. In Scope is a summary of what's in the requirements. Out of Scope is usually a Top 10 list of things that are close but not in, and answers the often asked question 'Are we also getting this too?' This is a developer’s best line of defense against scope creep by false or unstated expectations.
And yes, just because person x told person y a month ago that it’s in requirements, or this email two months ago said it’s in, or was assumed in an elevator conversation last week, or was mentioned on the golf course last year during preliminary negotiations means that it’s in.
If it was discussed and approved in a requirements meeting then it's in, otherwise it's out of scope.
I've also known more than a couple of clients that will negotiate effort, cost, and time,and then scope creep the hell out of a project in order to make themselves look better. Been there, dealt with that.
"But wait, we're a really small operation, and this isn't a big deal" you say? Fine, as long as you can roll with that, but the moment somebody has an requirement expectation that wasn't delivered that can change, forcing you to function as the gatekeeper of requirments in a more formal way.
Service Level Agreement (SLA)
Is there a guarantee of performance that the company has negotiated with the client? If yes, then an initial design assessment needs to take place on whether this is a realistic expectation, as management will often negotiate revenue for performance and penalties for non-performance, and there could be considerable effect on scope and time in order to hit an SLA. This will also drive design as to feed reporting, as many large companies will have a report on their SLA performance.
I’ve been in a few situations where SLA’s where negotiated such that processes would be completed by a time that was either not possible, or not possible given certain requirements, and needed to be handled in the design estimate.
What data / processes / events is this project dependent on
to be successful? I’ve seen plenty of
integration projects put on hold or cancelled because…
Business persons could not agree on key terminology
Other parts of the business have upstream processes that are not completed yet.
Data was not available
Data was available but for a price, and the
business was not willing to pay that price.
The source schema was not finalized so that
development could not begin.
Sample data was not available so development
could not begin.
These expectations need to be identified and managed early
in the project. Also some of these dependencies may not be known to a
business analyst and need to be handled in design.
Things you'll need to know about the source(s) of data going into the ETL
Location of source of data: databases, folder and file location, URL, Web Services task.
Security needed to gain access to this location. In large companies this is often handled by a separate group.
Backup, such as ‘After the file has been processed move it to the x folder’
Backup file retention rules: Various legal requirements that the file be backed up for x days.
Name: Does the name vary based on client, customer, date created, etc.
Data Destinations (Target)
Things you'll need to know about the destination(s) of data going into the ETL
Location of destination databases: Server, Database, any access information.
Security needed to gain access to this location
Are there any calculated values based on source data that need to be created? May not be in requirements but discovered in design.
Auditing columns such as created by, last updated date: Use the fields provided in the file vs. auto generated in the target database?
Source to Target Mapping
The heart of the ETL requirements document. This spells out the schema of the source(s)
and destination(s) in the data feed:
sections such as header and footer, column names, data types, acceptable
values (greater than zero, date no earlier/later than, NULL values)
Figure 1. Source to Target Mapping grid
|Actions for column values
|Source and Target
||Straight pump of data from source column to target column
|Source not in Target
||Ignor this column
|Target, not in Source
Implies a hard-coded or calculated value will be inserted or updated. Can be defined in either requirments or design.
Business Rule Validations - If only a set number of values can be added to a target column, need to know what to do if a value outside of that set is provided.
|Actions for entire rows
|In source, not in target.
||INSERT the row
|Both source and target, but some values are different.
||UPDATE based on the defined primary key
|Both source and target, all values are the same
||Usually no action, but sometimes UPDATE.
|Row is in target, but not in source.
||Sometimes a DELETE, sometimes an UPDATE and set an 'IsActive' column to No and a date column such as 'InactiveDate' with the current datetime.
There are some business analysts that cannot provide a source to target mapping, especially if they don’t have access to the data source, which means the developer has to figure this out themselves. If this is your situation then make sure if it comes to it you’re communicating that you’re doing requirements gathering as well as development. You also may have to state various assumptions in your requirements document on details that were not provided.
Timing / Scheduling
When will the source file(s) be available?
Are these files full-load (meaning an entire set
of data) or incremental (meaning only changes since the last time the file was
Are there any requirements for the timing of
this project, such as ‘This data must be in location x by datetime y so that process z can occur with this new data’
Business Rule Validation
One client I had wanted to make sure that if patient data
came in unless there was a signed contract with the health plan, which meant
there was a related row in a HEALTH_PLAN table.
This is not an error, but a
business rule validation that is handed to catch this data (and possibly stage
it somewhere for later use), and then message various business units that this
happened so that they can negotiate with that health plan.
Another client was an airline and wanted to know if there was ever a flight that after eight hours of the flight leaving there was no data on it landing. Again not an error, but an event of interest to the business.
If data fails a business rule validation, what action does
the ETL take? Accept, accept with
default values, not accept?
After the feed runs, who should receive a message if…
The source files were missing
The ETL job failed and returned an error?
The ETL job ran successfully but threw an error?
The ETL job ran successfully but failed a
business rule validation? For example, Customer sales must be for an existing
The ETL job ran successfully but failed a data
quality validation? For example, Invalid state code such as CAN,
Invalid zip codes, Invalid gender.
The ETL job ran successfully without
The end. Thank you for reading my article, and please email me at jim at jimhorn dot biz with any feedback. I look forward to hearing from you.
Jim ( LinkedIn ) ( Twitter ) ( Experts Exchange ) ( Stack Overflow )