Graeme's Place

Saturday, 7 April 2012

Cleansing Data with SQL Server 2012 Data Quality Services

I’ve been a bit quiet on the blogging side of things for a while, and in my defence I’ve been pretty heads-down working as a vendor for Microsoft as the lead author for a couple of new courses on SQL Server 2012 data warehousing and BI (courses 10777A and 10778A if you’re interested). As part of this work, I’ve been exploring the new data cleansing capabilities in SQL Server 2012 Data Quality Services (DQS). This article is a simple walkthrough of how to use DQS to cleanse data as part of an Enterprise Information Management (EIM) or Extract, Transform, and Load (ETL) solution.

So, what is data cleansing all about then? Well, most people involved in building or managing data-driven applications and BI solutions will have come across the problem of inconsistent or invalid data values for columns (or “domains”) that are used for business analysis. For example, let’s suppose your database stores customer data, including the customer’s address; and you you want to count customer sales by country. When customers or sales employees enter customer data into the system, it’s perfectly possible (and actually quite likely given a large enough volume of customers) that some values will be either entered incorrectly (for example “Unted States” instead of “United States”) or inconsistently (for example, some customers may enter “United States”, some others may enter “USA”, and others still may enter “America”). When you try to aggregate sales, you’ll end up with inaccurate counts because there are several values in use for the same country.

Here’s an Excel workbook containing a subset of data extracted from a SQL Server database table to show some typical data quality problems.

Note that the data contains a number of problems, including:

The City column contains “New York” and “NYC” for New York City.
The Country column contains “United States” and “USA” for the US.
The Country column also contains “United Kingdom” and “Great Britain” for the UK.

DQS enables you to address this problem by cleansing the data based on a known set of values and rules for the key domains (columns) that exist in your datasets. The way that DQS does this is by enabling you to create and maintain a knowledge base that contains the known valid values for a related set of domains, along with validation rules (for example, an EmailAddress value must include a “@” character) and common synonyms can be corrected to a leading value (for example, by correcting “USA” and “America” to the leading value “United States”). After you have created a knowledge base, you can use it to cleanse any data that includes the same domains (so for example, if you create a knowledge base for geographical domains such as City, State, and Country, you can use it to cleanse any data that includes these fields – such as customer data or employee address data. SQL Server 2012 includes the Data Quality Services Client tool (shown below), which you can use to create, maintain, and use DQS knowledge bases.

When you create a new knowledge base, you can do so from scratch, or you can use an existing knowledge base as a starting point. SQL Server 2012 ships with a pre-existing knowledge base for US-based demographic data named DQS Data, and in this example, I’ll use it as the basis for my own CustomerKB knowledge base as shown below.

The DQS Data knowledge base includes a number of pre-defined domains, as shown in the image above. I only need some of these domains, and I’ll need to add some additional ones that are specific to my own data; so I’ve initially selected the Domain Management activity as I create the CustomerKB knowledge base. I only intend to use the Country/Region, US – Last Name, and US – State domains from the DQS Data knowledge base, so I’ll delete the others. The domains I’m retaining contain official values for country and US state names, and common last names (surnames) based on US demographic data such as the 2000 US census.

Since my customer data includes records for customers all over the world, I’ll rename the domains in my knowledge base to remove the “US” prefix. I’ll also add a new domain named City so that I can validate city names in the data.

Note that I can select each domain and view the known values that are currently defined in the knowledge base as shown below. The City domain has no known values (because I’ve just created it), and the others have inherited values from the DQS Data knowledge base. The image below shows the known values for the Country/Region domain. Note that the knowledge base defines leading values for each country (such as “Afghanistan”) and synonyms that, while valid in their own right, should be corrected to the leading value to ensure consistency.

I’ve now completed my initial knowledge base, so I’m ready to finish the domain management activity. Clicking Finish produces a prompt to publish the knowledge base as shown below, but before I’m ready to use it I want to populate the known values for the City domain from my existing data by performing some knowledge discovery; so I’ll click No.

Knowledge Discovery is an activity in which you connect to a data source and map fields in the source to domains in the knowledge base. DQS can then use the data source to discover new values for the domains defined in the knowledge base. The first step in this process is to open the knowledge base for the Knowledge Discovery activity as shown here. Note that the activity is performed using a wizard interface, with a sequence of steps.

After opening the knowledge base, I need to select a data source (I’m using the Excel workbook we saw earlier), and map the columns in the data source to the domains in the knowledge base as shown below. Note that the data source can include columns that are not mapped to domains, and does not need to include a column for every domain in the knowledge base. However, only the mapped domains will be included in the knowledge discovery process.

On the next page, I can start the data discovery analysis. DQS will read the source data and identify new values for the domains in the knowledge base, as shown here.

On the final page of the wizard, you can view the values that have been discovered for each domain. In this example, the values discovered for the City domain include New York and NYC, as shown below. I can identify these as synonyms by selecting them both and clicking the Set selected domain values as synonyms button.

The value I selected first becomes the leading value, as shown here.

For the Country/Region domain, DQS has discovered a new “Great Britain” value. I can mark this as invalid and specify an existing value to which it should be corrected (in this case, “United Kingdom”).

Clearing the Show Only New checkbox reveals the values that already existed before knowledge discovery, and I can see that “Great Britain” is now under the “United Kingdom” leading value. I can also see that there were 151 instances of the existing “United States” value found, along with a further 42 instances of “USA”, which was already specified as a synonym for “United States”.

Now I’m ready to finish the knowledge discovery activity and publish the knowledge base.

After you have published a knowledge base, you can use it to cleanse data from any data source containing columns that can be mapped to the domains defined in it. The simplest way to do this is to create a new data quality project based on the knowledge base and specify the Cleansing activity, as shown here.

Again, the activity takes the form of a wizard with sequential steps. The first step is to map the columns in the data source to the domains in the knowledge base, just as I did when performing the knowledge discovery activity previously; only this time I’m using the full Customers table in my CustomerDB SQL Server database instead of the sample data I had extracted to Excel.

Next, I run the cleansing process and DQS applies the knowledge base to the source data to identify corrected and suggested values. Corrected values are corrections DQS makes to the data based on known rules and synonyms. Suggested values are further possible corrections or new values that are generated based on a number of data quality heuristics that DQS uses when analyzing data.

On the next page, on the Suggestions tab for each domain, I can view the suggestions identified by DQS. Here, DQS has identified a City domain value of "W. York”, which is sufficiently similar to the known value “York” for a correction to be suggested. Note that I can select the value and view the records that contain it to verify that “W. York” is commonly being used to denote “York” in England (as opposed for example, to “New York” in the United States). I can then choose to approve or reject individual instances of the correction, or accept/reject the suggestion that “W. York” should be considered a synonym of “York” (if I approve the suggestion) or added as a new known value in its own right (if I reject the suggestion).

On the New tab, I can view the new values that were discovered for the domain. In this case, a number of new values were identified for the City domain, including Bracknell in England.

On the Corrected tab, I can view the values that were corrected based on pre-existing known synonyms or suggestions that I have approved.

After reviewing the results of the cleansing activity, I can export the cleansed data to a SQL Server database table, and .csv file, or an Excel workbook. Note that I can choose to export just the cleansed data values or I can include the cleansing information for further analysis.

The exported results are shown in the following image. Note that the results include all of the source columns, and that for each of the columns that was mapped to a domain there are five columns in the results: The source value, the output value, the reason for any corrections, the level of confidence (between 0 and 1) for the correction, and the status of the column (correct or corrected).

By creating a data cleansing project, a business user who understands the data domains can act as a “data steward” and enforce the quality of the data in application databases or analytical and reporting systems. Additionally, when you are confident in the ability of your knowledge base to cleanse data, you can incorporate DQS data cleansing into a SQL Server Integration Services (SSIS) data flow that extracts data from a source as part of an ETL process for data warehousing or EIM. The following image shows an SSIS data flow that includes the DQS Cleansing transformation.

In this example, the CustomerDB data source uses an OLE DB connection to extract data from the Customers table in SQL Server. The DQS Cleansing transformation is then configured to use the CustomerKB knowledge base and map the appropriate columns from the data source to domains for cleansing, as shown here.

The Staging DB destination uses an OLE DB connection to load data from the data flow into a staging table as part of an ETL process for a data warehousing solution. The output columns for the mapped domains are used to load the cleansed values into the staging table, as shown here.

Running the SSIS package extracts the source data, applies the DQS knowledge base to cleanse the mapped columns, and loads the cleansed data into the staging database as shown here.

This walkthrough provides a simple example of how you can use DQS to cleanse data and improve data quality for reporting and analysis. There are a number of additional features of DQS that are not shown here, including the ability to define composite domains that consist of multiple columns and the ability to include external reference cleansing data from the Windows Azure Data Market in your knowledge base (for example to apply post code validation and correction rules based on standard data from a postal service authority). You can learn more about using DQS to cleanse data by attending course 10777A: Implementing a Data Warehouse with SQL Server 2012.

Tuesday, 22 February 2011

SQL Server “Denali” Integration Services – Projects and Parameters

Some previous posts in this blog have discussed new features in the SQL Server “Denali” database engine. For this post however, I want to focus on some of the key enhancements in SQL Server “Denali” Integration Services (SSIS). SSIS first appeared in SQL Server 2005 as an evolution of the Data Transformation Services (DTS) component in previous releases, and has steadily become a core element of Extract, Transform, and Load (ETL) operations for many data warehousing implementations.

The big news in the “Denali” release of SQL Server Integration Services, is a whole new deployment model for SSIS solutions. In previous releases, the only available unit of deployment is the package (a .dtsx file), and this could be deployed either to the file system or to the MSDB database in a SQL Server instance. This single-package deployment model is at-odds with the development model for SSIS solutions, in which a developer can create a single project that contains multiple packages. Prior to “Denali”, each package must be deployed and any variables that need to be set at runtime must be managed through a package configuration for each individual package. SSIS in “Denali” still supports this “legacy” deployment model, but now also supports project-level deployment to a the new Integration Services Catalog, and project-level parameters that can be used to set variables across multiple packages within a project.

The first thing you need to do to take advantage of this new deployment model, is to create an Integration Services catalog on an instance of SQL Server. The Integration Services catalog is a central database in which SSIS projects can be stored and managed, and you can have one catalog per SQL Server instance. The Integration Services catalog uses the SQLCLR (the .NET common language runtime hosted within SQL Server), so you need to enable this first by using the following Transact-SQL:

sp_configure 'show advanced options', 1;

GO

RECONFIGURE;

GO

sp_configure 'clr enabled', 1;

GO

RECONFIGURE;

GO

Now you’re ready to create an Integration Services catalog, which you can in SQL Server Management Studio as shown here.

When you create the Integration Services catalog, you’re prompted for a password that can be used to protect the databases master key used to encrypt the data.

After clicking OK, refreshing the Object Explorer view reveals two new items as shown here. The first is a database named SSISDB, and the second is an SSISDB node beneath the Integration Services folder. The database is a regular SQL Server database that contains a number of tables, views, and stored procedures that you can use to manage and run SSIS projects and packages stored in the catalog. It is also where the projects and packages in your catalog are physically stored. The SSISDB node under the Integration Services folder provides a management interface for the catalog and enables you to define a logical folder structure for your catalog.

To create a folder in your catalog, simply right-click the SSISDB node under the Integration Services folder, and click Create Folder. Here I’ve created a folder with the imaginative name My Folder. Note that subfolders named Projects and Environments have automatically been created – we’ll return to these later.

OK, so now we have an Integration Services catalog that contains a folder to which we can deploy an SSIS project; so I guess it’s time we went ahead and created a project to deploy. For our purposes, we’ll create a simple SSIS project that includes a data flow task that extracts the list of database names from the sysdatabases system view in the master database and copies it to a table in another database. I’m going to copy the database list to a table in a database called CmSampleDB, and to make matters a little more interesting, I’m going to create two tables that can act as the destination for the list – one to be used when testing the SSIS solution, and another to be used in production. We’ll design the SSIS project to support a project-level parameter so you can specify which table to use at runtime. Here’s my Transact-SQL code to create the destination tables:

USE CmSampleDB

GO

CREATE TABLE TestDBList

(name nvarchar(250))

GO

CREATE TABLE DBList

(name nvarchar(250))

GO

Now we can go ahead and create the SSIS project using SQL Server Business Intelligence Development Studio (BIDS). Creating an SSIS project in “Denali” is exactly the same as in previous versions, just select the Integration Services Project template as shown here:

When the new project is created, it will contains a single package named Package.dtsx, which you can rename to suit your own requirements – I’m going to name my My Package.dtsx. You can add more packages to the project as required, so for example, I’ll add a second package which I’ll name, um, My Other Package.dtsx. In Solution Explorer, my project now looks like this.

So far, nothing is very different from how you would create an SSIS project in previous releases of SQL Server, but here’s where we’re going to use a new feature – Project Parameters. Project parameters are, as the name suggests, parameters that can be used to pass variable values to the project at runtime. Because these parameters are scoped at the project level, they can be used by any package in the project. To add a project parameter, right-click the project in Solution Explorer and click Project Parameters, or click Project Parameters on the Project menu. Either of these actions displays the Parameters pane as shown here:

As you can see, I’ve used this pane to create a project-level parameter named TableName with a default value of TestDBList. This default value is more correctly known as the Design default value, since it’s used when I run the project within BIDS. When I deploy the project, I can set a Server default value that will override this one when packages in this project are run on the server.

Now I need to create the data flow task that copies the database names from sysdatabases in the master database to the table indicated by the TableName parameter in the CmSampleDB database. To do this I just need to drag a Data Flow task to the design surface of My Package.dtsx as shown here:

Next, I’ll double-click the data flow task to view the data flow design surface, and use the Source Assistant item on the SSIS Toolbox to create a new connection to the master database on my SQL Server instance. Then I can configure the OLE DB source that gets created to extract the name column from the sysdatabases system view by using the following SQL command:

SELECT name FROM sysdatabases

The data flow surface now looks like this:

Next I’ll use the Destination Assistant to add a connection to the CmSampleDB database on my SQL Server instance, and connect the output from the source to the destination as shown here:

To complete the data flow, I need to configure the destination to insert the output from the source into the table specified in the project-level TableName parameter, as shown here:

Now I’m ready to build and deploy the project to the Integration Services catalog I created earlier. Building the project in BIDS creates a .ispac file, which you can then import into the catalog using SQL Server Management Studio, or deploy directly to the catalog from BIDS by clicking Deploy on the Project menu (or by right-clicking the project in Solution Explorer and clicking Deploy). Whichever approach you use, deployment to the catalog is accomplished via the Integration Services Deployment Wizard. After the Welcome screen, the wizard prompts you to select the project you want to deploy – in this case, the .ispac file I just built.

Next, the wizard loads and validates the project before prompting you for the destination. This consists of the server where the Integration Services catalog is hosted, and the path to the folder where you want to deploy the project. Here, I’ve select the My folder folder i created earlier.

Finally, the wizard prompts you to set Server default values for any project parameters. You can use the Design default value, specify a new value, or use an environment variable. We’ll look at environment variables shortly, but for now I’ve set the Server default value for the TableName parameter to DBList.

Completing the wizard deploys the project to the catalog, which you can verify in SQL Server Management Studio. note that the project is actually saved to the Projects sub-folder of the path specified in the wizard, and that all packages within the project are deployed as a single unit.

The final thing I want to do is to define a test environment and a production environment that can be used to control the execution context for the project. To do this, I’ll right-click the Environments folder and click Create Environment. Using this approach I’ve created two environments called Test and Production.

You can edit the properties of each environment to create environment variables, which in turn can be used to set project parameters when project packages are run in the context of the environment. For example, here I’m creating an environment variable named tName in the Test environment with a value of TestDBList. I’ve also created an environment variable with the same name in the Production environment and assigned the value DBList.

Finally, I can hook the environments up to the project by editing the properties of the project in the Integration Services catalog and adding environment references, as shown here…

… and setting parameters to get their values from environment variables as shown here (note that in the CTP release, you must click OK after adding the environment references on the References page before re-opening the Properties window and changing the parameter value on the Parameters page):

So now we have a project deployed to our Integration Services catalog. The project contains two packages - one of which doesn’t actually do anything, and another that copies the list of database names from the sysdatabase system view in the master database to a table in the CmSampleDB database. There is a project-level parameter that is used to indicate which table the database names should be copied to, and this is set to TestDBList or DBList depending on the environment that the package is executed in. To test this, I can right-click My Package.dtsx in Object Explorer and click Run, which produces the following dialog box:

Note that I can select the environment reference I want to use, which will determine the value of the tName environment variable, which will in turn set the value for the TableName project parameter and ultimately determine which table the data is copied to. For this example, I’ll select the Test environment and run the package, and the data is coped to the TestDBList table as shown below:

To review past operations in the catalog, you can right-click SSISDB under the Integration Services folder in Object Explorer and click Operations. This shows a list of all operations that have been performed in the catalog in reverse order of occurrence, so in this example you can see that a project was deployed and then a package was executed.

Double-clicking an entry in the list reveals more information about the operation. For example, here’s the details for the package execution:

Note that clicking the Parameters tab shows the parameter values that were used for this particular execution:

I’ve used this article to give you a quick tour of the new deployment model for SSIS and how you can use project parameters and environment variables to create a more flexible but manageable ETL solution with SSIS in SQL Server “Denali”. For more information about what’s new in SSIS in “Denali”, see SQL Server Books Online.