Graeme's Place: 2012

Friday, 5 October 2012

Role Playing Games with SQL Server 2012 Analysis Services

I’m currently working with Microsoft Learning, writing a course on designing BI solutions with SQL Server 2012. Obviously, this is a huge subject to try to cover, and raises a whole bunch of really interesting design considerations. One of the new things BI developers need to consider with SQL Server 2012, is whether to use a “traditional” multidimensional data model, or whether to use the new-fangled tabular model. In most cases, from an end-user’s perspective (no pun intended), there is little to pick between the two. In fact, in an Excel PivotTable, most users will struggle to spot any difference. However, for the cube developer, there are some significant differences. There are some things you can do easily in multidimensional projects (or indeed, things that are done automatically for you by SQL Server data Tools) which require (sometimes extremely complex) custom development in a tabular model. Other things are relatively straightforward to accomplish in both models, but require different implementations. An example of the latter is the implementation of role-playing dimensions. You can do this in both models, but there are some differences.

Role-playing dimensions are used to create multiple cube dimensions that are based on the same underlying dimension in the database. The classic example is a date dimension in which each member represents a calendar date. In your cube, you may have a a Sales Order measure group that is related to the Date dimension by multiple keys, for example an Order Date and a Delivery Date. Another example might be an Address dimension that is related to a Shipment measure group by both an Origin key and a Destination key. This multi-use of the same underlying dimension means that the dimension table is defined only once, but users can use it to slice the data by different keys – so for example, a user could view sales by order date or by delivery date (or both).

OK, so first, let’s see how a role-playing dimension is implemented in a multidimensional model. I’m using the AdventureWorksDW2012 sample database, which contains a FactResellerSales table that is related to a DimDate table using three key columns – OrderDateKey, ShipDateKey, and DueDateKey. When I create a data source view from the data warehouse tables in the multidimensional project, all three of the relationships are detected as shown here.

Using the wizard to create a cube automatically detects the multiple relationships, and results in a single DimDate dimension in the database but three role-playing dimensions in the cube (Order Date, Ship Date, and Due Date) as shown here.

The role-playing dimensions are really just references to the same DimDate dimension, but aggregations will be calculated based on each relationship. I’ll go ahead and add a hierarchy to the DimDate dimension:

When a user browses the cube in Excel, each of the three role-playing dimensions is available for them to slice the sales data, and all three of these dimensions have the same Calendar Date hierarchy that I defined for the base DimDate dimension:

Now let’s compare the experience with a tabular model. When I import the same tables into a tabular model project, the relationships are detected, and I can create the same hierarchy as before in the DimDate table. However, notice that two of the relationships are shown as dotted lines, while one is shown as a solid line.

This indicates that although the relationships have all been detected, only one of them is active at any one time. When a user browses the model in Excel, they only see one DimDate dimension, which will show aggregations for the active relationship (in this case, Order Date, but there’s no easy way for the user to tell that from the user interface):

The solution to this problem is obvious. So obvious in fact, that it took me a while to figure it out! The answer is to import the same table multiple times, and rename it appropriately:

After you’ve imported one copy of the table for each role-playing dimensions, you simply delete the inactive relationships from the original table, and create new ones to join the relevant keys in the fact table to the new dimension tables. You’ll also need to create duplicates of any hierarchies you want to appear in all of the dimensions.

Now when users browse the model, they’ll see all three dimensions, and as long as you’ve assigned appropriate names to each copy of the table, it should be obvious what each dimension represents.

Monday, 23 July 2012

PowerPivot and Power View in Excel 2013

It’s just typical of my job that just a few short weeks after the publication of some Microsoft Official Curriculum courses that I’ve spent months working on, Microsoft should choose to make a preview of the next release of the software on which they are based available! As you may know, we recently published courses 10778A and 40009A, both of which make use of the PowerPivot and Power View features in Excel and SharePoint 2010; so it was with a certain amount of trepidation that I installed the preview of Office 2013 to get a first look at the enhancements that have been made.

The first, and most obvious, change is that the PowerPivot add-in for Excel no longer needs to be installed from a separate package. It’s built into Excel and only needs to be enabled, which you do by configuring the COM Add-ins in Excel’s options as shown here.

Note that there’s also a Power View add-in – more about that later!

After the PowerPivot add-in has been enabled, users will see the POWERPIVOT tab on the ribbon, as shown here.

With this ribbon, you can not only manage a PowerPivot tabular data model for the workbook as you can in Excel 2010, but you can also create calculated fields and KPIs without having to directly edit the model – making the process a little bit more intuitive for information workers.

Clicking Manage opens the PowerPivot window, which is similar to that of the previous release. There are a few enhancements of course, but anyone familiar with PowerPivot in Excel 2010 will find themselves in familiar territory. In this case, I’ve opened a PowerPivot workbook I created with Excel 2010 based on data in the AdventureWorksDW SQL Server sample database. The changes to this release meant that I was prompted to allow Excel to update the data model and re-save the workbook, so one thing to be aware of is that you can open (and update) Excel 2010 PowerPivot workbooks in Excel 2013, but after they’ve been updated you won’t be able to open them in Excel 2010. You can see the diagram view of my PowerPivot data model below – note that it includes a hierarchy in the Sales Territory table.

After you’ve created the data model in your workbook, you can use it as a source for PivotTables, just as you could in Excel 2010. There are however, one or two nice enhancements on a new ANALYZE tab of the ribbon that make it easier to do things like create slicers. Another new feature is the ability to create timeline filters that make it easier to analyse data based on chronological periods. To add a timeline, just click Insert Timeline and specify any of the time-based attributes that Excel identifies as having a suitable relationship in the model.

After you’ve inserted a timeline, you can use it to filter the data in the PivotTable as shown here.

Earlier, I mentioned that Excel 2013 includes a Power View add-in. This enables information workers to create Power View reports from the data model in the workbook (and external data sources). Previously, Power View was only available in SharePoint Server 2010, but in Office 2013 you can use it right there in an Excel workbook.

To create a Power View report from the data model in the workbook, just click Power View on the INSERT tab of the ribbon.

If necessary, you’ll be prompted to install Silverlight (which is required by the Power View add-in), and after doing so you’ll be able to create a Power View report from the data in your PowerPivot data model as shown here.

Note that you can include hierarchies in a Power View report, which wasn’t supported in the previous release. There are several other enhancements in this release, including support for new data visualizations (such as pie charts), and even visualization of geographical data on a Bing Maps map, as shown here.

This short article just highlights a few of the improvements to PowerPivot and Power View in Excel 2013. There are many more new features in Excel, as well as greater ability to share BI capabilities across the enterprise through enhancements in SharePoint 2013 and SQL Server 2012 SP1, which I look forward to exploring in more depth.

Tuesday, 3 July 2012

Matching Data with SQL Server 2012 Data Quality Services

In a previous post, I described how you can use Data Quality Services (DQS) to create a knowledge base for the domains (data columns) used in your business data and use it to cleanse data by correcting invalid or inconsistent values. Data cleansing is however only one side of the coin when it comes to DQS. You can also use DQS to perform data matching – in other words, finding records that potential duplicates of one another and consolidating them to a single surviving record.

When you think about it, the potential for duplicate data entry in most complex business environments is enormous. For example, let’s imagine an e-commerce site where customers need to register before placing orders. It’s perfectly conceivable that a customer who only uses the site occasionally might re-register with slightly different details because they’ve forgotten that they had registered previously or can’t remember their login credentials. Even if the site applies a policy that demands a unique email address for each registration, there’s nothing to stop the same customer registering multiple times with different email addresses. For an individual sales order, the fact that the customer is registered multiple times is inconsequential – as long as the payment and delivery address details are correct, the order can be processed successfully. However, then the company wants to use its data to perform any kind of business intelligence (BI) reporting or analysis that aggregates information per customer, then the duplicate entries can lead to misleading results.

To use DQS to match data, you must first add a matching policy to a knowledge base. You can use an existing knowledge base that is also used for data cleansing, or you can create a knowledge base specifically for data matching. In this example, I’m opening an existing knowledge base that contains domains for customer records for the Matching Policy activity.

Just as when performing knowledge discovery, I need to map some sample data to the domains defined in the knowledge base. This enables me to test the matching policy against a known data set as I build it, and therefore verify that it successfully identifies known duplicate records. In this case, I’m using data in an Excel workbook as the source for my sample data, but you can also use a table in a SQL Server database.

Having mapped sample data to the domains, I can now define the matching rules for my matching policy. You can include multiple rules, and each one uses a set of weighted comparisons of domain values to identify clusters of records that are potential duplicates of one another.

Potential matches are determines based on a score that is calculated from the weighted comparisons you define in the rule. Here are the comparisons I’ve used in my Match Customer rule:

Domain	Similarity	Weight	Prerequisite
Birth Date	Exact		X
Email Address	Exact	20
Postal Code	Exact	10
Country/Region	Exact	10
First Name	Similar	10
Last Name	Similar	10
Street Address	Similar	20
City	Similar	10
State	Similar	10

Note that an exact match of the Birth Date domain is specified as a prerequisite. In other words, only records where the birth date is an exact match will be considered as candidates for a potential duplicate. Prerequisite domains in a matching rule must use the Exact similarity and have no weighting value. All of the other domains are calculated based on an exact or similar match, and have weightings, which add up to a total of 100.

Assuming the birth date for the records being compared is a match, DQS then makes the other comparisons defined in the matching rule and adds the specified weighting value for each comparison that is true to produce an overall score. For example, consider two records with identical Birth Date values being compared using the Match Customer rule defined above. If the Email Address domains for both records is an exact match, 20 is added to the score. If the First Name domains are similar (for example, “Rob” and “Robert”), another 10 is added to the score, and so on until all of the comparisons in the rule have been made. The resulting score is then compared to the minimum matching score defined for the matching rule (in this case 80). If the score exceeds the minimum matching score, then the records are considered a match. Multiple records that are considered matches for one another are grouped into a cluster.

After you have defined the matching rules, you can use them to find matches in the sample data you mapped earlier. This gives you the opportunity to verify that the rules behave as expected against a known dataset. In this case, the dataset results in a single cluster of matches that includes two records – one for Daniel Garcia and another for Dan Garcia.

Now that I’ve defined my matching policy, I can publish the knowledge base and allow the data stewards in my organization to use it for data matching.

To use a knowledge base to perform data matching, create a new data quality project, specify the knowledge base, and specify the Matching activity as shown here.

The first step, as it is in any DQS project, is to map the fields in your data source to the domains in the knowledge base. Just as before, the data source can be a table in a SQL Server database or an Excel file. This time, I’m using the Customers table in the Staging SQL Server database.

After you’ve mapped the domains, you can start the matching process. When the process is complete, the clusters of matched records is displayed. In this case, there are two clusters, each containing two matches. At tis stage, you can choose to reject any matches that you know aren’t duplicates.

When the matches have all been identified, you can export the results to a SQL Server table or an Excel file. You can also export survivors (one record from each cluster that is chosen as the correct one) based on one of the following survivorship rules:

Pivot record – A record in the cluster that is chosen arbitrarily by DQS.
Most complete and longest record – The record that has the fewest null field values and the longest overall data length.
Most complete record – The record that has the fewest null fields.
Longest record – The record that has the longest overall data length.

The exported results include all of the source data with additional columns for the clusters of matching records to indicate the matching rule used and score calculated for each match, and the pivot record for each match cluster.

The exported survivors contain all of the non-matching records from the original data source and one version of each matched record based on the survivorship rule you selected. In the following example, I’ve highlighted the surviving records from my matching process.

In some case, you can simply replace the original data set with the survivor records to create a de-duplicated set of records. However, in most business scenarios you’ll need to apply some logic (manual or automated) to handle relationships between duplicate records and other tables. For example, before I eliminate the duplicate customer records identified by the matching process in the above example, I would need to reassign any sales orders that are currently related to those customer records to the surviving records.

Hopefully you’ve found this brief introduction to data matching with DQS useful. To learn more about DQS and its use in a data warehousing solution, you can attend Microsoft Official Curriculum (MOC) course 10777A: Implementing a Data Warehouse with SQL Server 2012.

Saturday, 7 April 2012

Cleansing Data with SQL Server 2012 Data Quality Services

I’ve been a bit quiet on the blogging side of things for a while, and in my defence I’ve been pretty heads-down working as a vendor for Microsoft as the lead author for a couple of new courses on SQL Server 2012 data warehousing and BI (courses 10777A and 10778A if you’re interested). As part of this work, I’ve been exploring the new data cleansing capabilities in SQL Server 2012 Data Quality Services (DQS). This article is a simple walkthrough of how to use DQS to cleanse data as part of an Enterprise Information Management (EIM) or Extract, Transform, and Load (ETL) solution.

So, what is data cleansing all about then? Well, most people involved in building or managing data-driven applications and BI solutions will have come across the problem of inconsistent or invalid data values for columns (or “domains”) that are used for business analysis. For example, let’s suppose your database stores customer data, including the customer’s address; and you you want to count customer sales by country. When customers or sales employees enter customer data into the system, it’s perfectly possible (and actually quite likely given a large enough volume of customers) that some values will be either entered incorrectly (for example “Unted States” instead of “United States”) or inconsistently (for example, some customers may enter “United States”, some others may enter “USA”, and others still may enter “America”). When you try to aggregate sales, you’ll end up with inaccurate counts because there are several values in use for the same country.

Here’s an Excel workbook containing a subset of data extracted from a SQL Server database table to show some typical data quality problems.

Note that the data contains a number of problems, including:

The City column contains “New York” and “NYC” for New York City.
The Country column contains “United States” and “USA” for the US.
The Country column also contains “United Kingdom” and “Great Britain” for the UK.

DQS enables you to address this problem by cleansing the data based on a known set of values and rules for the key domains (columns) that exist in your datasets. The way that DQS does this is by enabling you to create and maintain a knowledge base that contains the known valid values for a related set of domains, along with validation rules (for example, an EmailAddress value must include a “@” character) and common synonyms can be corrected to a leading value (for example, by correcting “USA” and “America” to the leading value “United States”). After you have created a knowledge base, you can use it to cleanse any data that includes the same domains (so for example, if you create a knowledge base for geographical domains such as City, State, and Country, you can use it to cleanse any data that includes these fields – such as customer data or employee address data. SQL Server 2012 includes the Data Quality Services Client tool (shown below), which you can use to create, maintain, and use DQS knowledge bases.

When you create a new knowledge base, you can do so from scratch, or you can use an existing knowledge base as a starting point. SQL Server 2012 ships with a pre-existing knowledge base for US-based demographic data named DQS Data, and in this example, I’ll use it as the basis for my own CustomerKB knowledge base as shown below.

The DQS Data knowledge base includes a number of pre-defined domains, as shown in the image above. I only need some of these domains, and I’ll need to add some additional ones that are specific to my own data; so I’ve initially selected the Domain Management activity as I create the CustomerKB knowledge base. I only intend to use the Country/Region, US – Last Name, and US – State domains from the DQS Data knowledge base, so I’ll delete the others. The domains I’m retaining contain official values for country and US state names, and common last names (surnames) based on US demographic data such as the 2000 US census.

Since my customer data includes records for customers all over the world, I’ll rename the domains in my knowledge base to remove the “US” prefix. I’ll also add a new domain named City so that I can validate city names in the data.

Note that I can select each domain and view the known values that are currently defined in the knowledge base as shown below. The City domain has no known values (because I’ve just created it), and the others have inherited values from the DQS Data knowledge base. The image below shows the known values for the Country/Region domain. Note that the knowledge base defines leading values for each country (such as “Afghanistan”) and synonyms that, while valid in their own right, should be corrected to the leading value to ensure consistency.

I’ve now completed my initial knowledge base, so I’m ready to finish the domain management activity. Clicking Finish produces a prompt to publish the knowledge base as shown below, but before I’m ready to use it I want to populate the known values for the City domain from my existing data by performing some knowledge discovery; so I’ll click No.

Knowledge Discovery is an activity in which you connect to a data source and map fields in the source to domains in the knowledge base. DQS can then use the data source to discover new values for the domains defined in the knowledge base. The first step in this process is to open the knowledge base for the Knowledge Discovery activity as shown here. Note that the activity is performed using a wizard interface, with a sequence of steps.

After opening the knowledge base, I need to select a data source (I’m using the Excel workbook we saw earlier), and map the columns in the data source to the domains in the knowledge base as shown below. Note that the data source can include columns that are not mapped to domains, and does not need to include a column for every domain in the knowledge base. However, only the mapped domains will be included in the knowledge discovery process.

On the next page, I can start the data discovery analysis. DQS will read the source data and identify new values for the domains in the knowledge base, as shown here.

On the final page of the wizard, you can view the values that have been discovered for each domain. In this example, the values discovered for the City domain include New York and NYC, as shown below. I can identify these as synonyms by selecting them both and clicking the Set selected domain values as synonyms button.

The value I selected first becomes the leading value, as shown here.

For the Country/Region domain, DQS has discovered a new “Great Britain” value. I can mark this as invalid and specify an existing value to which it should be corrected (in this case, “United Kingdom”).

Clearing the Show Only New checkbox reveals the values that already existed before knowledge discovery, and I can see that “Great Britain” is now under the “United Kingdom” leading value. I can also see that there were 151 instances of the existing “United States” value found, along with a further 42 instances of “USA”, which was already specified as a synonym for “United States”.

Now I’m ready to finish the knowledge discovery activity and publish the knowledge base.

After you have published a knowledge base, you can use it to cleanse data from any data source containing columns that can be mapped to the domains defined in it. The simplest way to do this is to create a new data quality project based on the knowledge base and specify the Cleansing activity, as shown here.

Again, the activity takes the form of a wizard with sequential steps. The first step is to map the columns in the data source to the domains in the knowledge base, just as I did when performing the knowledge discovery activity previously; only this time I’m using the full Customers table in my CustomerDB SQL Server database instead of the sample data I had extracted to Excel.

Next, I run the cleansing process and DQS applies the knowledge base to the source data to identify corrected and suggested values. Corrected values are corrections DQS makes to the data based on known rules and synonyms. Suggested values are further possible corrections or new values that are generated based on a number of data quality heuristics that DQS uses when analyzing data.

On the next page, on the Suggestions tab for each domain, I can view the suggestions identified by DQS. Here, DQS has identified a City domain value of "W. York”, which is sufficiently similar to the known value “York” for a correction to be suggested. Note that I can select the value and view the records that contain it to verify that “W. York” is commonly being used to denote “York” in England (as opposed for example, to “New York” in the United States). I can then choose to approve or reject individual instances of the correction, or accept/reject the suggestion that “W. York” should be considered a synonym of “York” (if I approve the suggestion) or added as a new known value in its own right (if I reject the suggestion).

On the New tab, I can view the new values that were discovered for the domain. In this case, a number of new values were identified for the City domain, including Bracknell in England.

On the Corrected tab, I can view the values that were corrected based on pre-existing known synonyms or suggestions that I have approved.

After reviewing the results of the cleansing activity, I can export the cleansed data to a SQL Server database table, and .csv file, or an Excel workbook. Note that I can choose to export just the cleansed data values or I can include the cleansing information for further analysis.

The exported results are shown in the following image. Note that the results include all of the source columns, and that for each of the columns that was mapped to a domain there are five columns in the results: The source value, the output value, the reason for any corrections, the level of confidence (between 0 and 1) for the correction, and the status of the column (correct or corrected).

By creating a data cleansing project, a business user who understands the data domains can act as a “data steward” and enforce the quality of the data in application databases or analytical and reporting systems. Additionally, when you are confident in the ability of your knowledge base to cleanse data, you can incorporate DQS data cleansing into a SQL Server Integration Services (SSIS) data flow that extracts data from a source as part of an ETL process for data warehousing or EIM. The following image shows an SSIS data flow that includes the DQS Cleansing transformation.

In this example, the CustomerDB data source uses an OLE DB connection to extract data from the Customers table in SQL Server. The DQS Cleansing transformation is then configured to use the CustomerKB knowledge base and map the appropriate columns from the data source to domains for cleansing, as shown here.

The Staging DB destination uses an OLE DB connection to load data from the data flow into a staging table as part of an ETL process for a data warehousing solution. The output columns for the mapped domains are used to load the cleansed values into the staging table, as shown here.

Running the SSIS package extracts the source data, applies the DQS knowledge base to cleanse the mapped columns, and loads the cleansed data into the staging database as shown here.

This walkthrough provides a simple example of how you can use DQS to cleanse data and improve data quality for reporting and analysis. There are a number of additional features of DQS that are not shown here, including the ability to define composite domains that consist of multiple columns and the ability to include external reference cleansing data from the Windows Azure Data Market in your knowledge base (for example to apply post code validation and correction rules based on standard data from a postal service authority). You can learn more about using DQS to cleanse data by attending course 10777A: Implementing a Data Warehouse with SQL Server 2012.