Graeme's Place: July 2012

It’s just typical of my job that just a few short weeks after the publication of some Microsoft Official Curriculum courses that I’ve spent months working on, Microsoft should choose to make a preview of the next release of the software on which they are based available! As you may know, we recently published courses 10778A and 40009A, both of which make use of the PowerPivot and Power View features in Excel and SharePoint 2010; so it was with a certain amount of trepidation that I installed the preview of Office 2013 to get a first look at the enhancements that have been made.

The first, and most obvious, change is that the PowerPivot add-in for Excel no longer needs to be installed from a separate package. It’s built into Excel and only needs to be enabled, which you do by configuring the COM Add-ins in Excel’s options as shown here.

Note that there’s also a Power View add-in – more about that later!

After the PowerPivot add-in has been enabled, users will see the POWERPIVOT tab on the ribbon, as shown here.

With this ribbon, you can not only manage a PowerPivot tabular data model for the workbook as you can in Excel 2010, but you can also create calculated fields and KPIs without having to directly edit the model – making the process a little bit more intuitive for information workers.

Clicking Manage opens the PowerPivot window, which is similar to that of the previous release. There are a few enhancements of course, but anyone familiar with PowerPivot in Excel 2010 will find themselves in familiar territory. In this case, I’ve opened a PowerPivot workbook I created with Excel 2010 based on data in the AdventureWorksDW SQL Server sample database. The changes to this release meant that I was prompted to allow Excel to update the data model and re-save the workbook, so one thing to be aware of is that you can open (and update) Excel 2010 PowerPivot workbooks in Excel 2013, but after they’ve been updated you won’t be able to open them in Excel 2010. You can see the diagram view of my PowerPivot data model below – note that it includes a hierarchy in the Sales Territory table.

After you’ve created the data model in your workbook, you can use it as a source for PivotTables, just as you could in Excel 2010. There are however, one or two nice enhancements on a new ANALYZE tab of the ribbon that make it easier to do things like create slicers. Another new feature is the ability to create timeline filters that make it easier to analyse data based on chronological periods. To add a timeline, just click Insert Timeline and specify any of the time-based attributes that Excel identifies as having a suitable relationship in the model.

After you’ve inserted a timeline, you can use it to filter the data in the PivotTable as shown here.

Earlier, I mentioned that Excel 2013 includes a Power View add-in. This enables information workers to create Power View reports from the data model in the workbook (and external data sources). Previously, Power View was only available in SharePoint Server 2010, but in Office 2013 you can use it right there in an Excel workbook.

To create a Power View report from the data model in the workbook, just click Power View on the INSERT tab of the ribbon.

If necessary, you’ll be prompted to install Silverlight (which is required by the Power View add-in), and after doing so you’ll be able to create a Power View report from the data in your PowerPivot data model as shown here.

Note that you can include hierarchies in a Power View report, which wasn’t supported in the previous release. There are several other enhancements in this release, including support for new data visualizations (such as pie charts), and even visualization of geographical data on a Bing Maps map, as shown here.

This short article just highlights a few of the improvements to PowerPivot and Power View in Excel 2013. There are many more new features in Excel, as well as greater ability to share BI capabilities across the enterprise through enhancements in SharePoint 2013 and SQL Server 2012 SP1, which I look forward to exploring in more depth.

In a previous post, I described how you can use Data Quality Services (DQS) to create a knowledge base for the domains (data columns) used in your business data and use it to cleanse data by correcting invalid or inconsistent values. Data cleansing is however only one side of the coin when it comes to DQS. You can also use DQS to perform data matching – in other words, finding records that potential duplicates of one another and consolidating them to a single surviving record.

When you think about it, the potential for duplicate data entry in most complex business environments is enormous. For example, let’s imagine an e-commerce site where customers need to register before placing orders. It’s perfectly conceivable that a customer who only uses the site occasionally might re-register with slightly different details because they’ve forgotten that they had registered previously or can’t remember their login credentials. Even if the site applies a policy that demands a unique email address for each registration, there’s nothing to stop the same customer registering multiple times with different email addresses. For an individual sales order, the fact that the customer is registered multiple times is inconsequential – as long as the payment and delivery address details are correct, the order can be processed successfully. However, then the company wants to use its data to perform any kind of business intelligence (BI) reporting or analysis that aggregates information per customer, then the duplicate entries can lead to misleading results.

To use DQS to match data, you must first add a matching policy to a knowledge base. You can use an existing knowledge base that is also used for data cleansing, or you can create a knowledge base specifically for data matching. In this example, I’m opening an existing knowledge base that contains domains for customer records for the Matching Policy activity.

Just as when performing knowledge discovery, I need to map some sample data to the domains defined in the knowledge base. This enables me to test the matching policy against a known data set as I build it, and therefore verify that it successfully identifies known duplicate records. In this case, I’m using data in an Excel workbook as the source for my sample data, but you can also use a table in a SQL Server database.

Having mapped sample data to the domains, I can now define the matching rules for my matching policy. You can include multiple rules, and each one uses a set of weighted comparisons of domain values to identify clusters of records that are potential duplicates of one another.

Potential matches are determines based on a score that is calculated from the weighted comparisons you define in the rule. Here are the comparisons I’ve used in my Match Customer rule:

Domain	Similarity	Weight	Prerequisite
Birth Date	Exact		X
Email Address	Exact	20
Postal Code	Exact	10
Country/Region	Exact	10
First Name	Similar	10
Last Name	Similar	10
Street Address	Similar	20
City	Similar	10
State	Similar	10

Note that an exact match of the Birth Date domain is specified as a prerequisite. In other words, only records where the birth date is an exact match will be considered as candidates for a potential duplicate. Prerequisite domains in a matching rule must use the Exact similarity and have no weighting value. All of the other domains are calculated based on an exact or similar match, and have weightings, which add up to a total of 100.

Assuming the birth date for the records being compared is a match, DQS then makes the other comparisons defined in the matching rule and adds the specified weighting value for each comparison that is true to produce an overall score. For example, consider two records with identical Birth Date values being compared using the Match Customer rule defined above. If the Email Address domains for both records is an exact match, 20 is added to the score. If the First Name domains are similar (for example, “Rob” and “Robert”), another 10 is added to the score, and so on until all of the comparisons in the rule have been made. The resulting score is then compared to the minimum matching score defined for the matching rule (in this case 80). If the score exceeds the minimum matching score, then the records are considered a match. Multiple records that are considered matches for one another are grouped into a cluster.

After you have defined the matching rules, you can use them to find matches in the sample data you mapped earlier. This gives you the opportunity to verify that the rules behave as expected against a known dataset. In this case, the dataset results in a single cluster of matches that includes two records – one for Daniel Garcia and another for Dan Garcia.

Now that I’ve defined my matching policy, I can publish the knowledge base and allow the data stewards in my organization to use it for data matching.

To use a knowledge base to perform data matching, create a new data quality project, specify the knowledge base, and specify the Matching activity as shown here.

The first step, as it is in any DQS project, is to map the fields in your data source to the domains in the knowledge base. Just as before, the data source can be a table in a SQL Server database or an Excel file. This time, I’m using the Customers table in the Staging SQL Server database.

After you’ve mapped the domains, you can start the matching process. When the process is complete, the clusters of matched records is displayed. In this case, there are two clusters, each containing two matches. At tis stage, you can choose to reject any matches that you know aren’t duplicates.

When the matches have all been identified, you can export the results to a SQL Server table or an Excel file. You can also export survivors (one record from each cluster that is chosen as the correct one) based on one of the following survivorship rules:

Pivot record – A record in the cluster that is chosen arbitrarily by DQS.
Most complete and longest record – The record that has the fewest null field values and the longest overall data length.
Most complete record – The record that has the fewest null fields.
Longest record – The record that has the longest overall data length.

The exported results include all of the source data with additional columns for the clusters of matching records to indicate the matching rule used and score calculated for each match, and the pivot record for each match cluster.

The exported survivors contain all of the non-matching records from the original data source and one version of each matched record based on the survivorship rule you selected. In the following example, I’ve highlighted the surviving records from my matching process.

In some case, you can simply replace the original data set with the survivor records to create a de-duplicated set of records. However, in most business scenarios you’ll need to apply some logic (manual or automated) to handle relationships between duplicate records and other tables. For example, before I eliminate the duplicate customer records identified by the matching process in the above example, I would need to reassign any sales orders that are currently related to those customer records to the surviving records.

Hopefully you’ve found this brief introduction to data matching with DQS useful. To learn more about DQS and its use in a data warehousing solution, you can attend Microsoft Official Curriculum (MOC) course 10777A: Implementing a Data Warehouse with SQL Server 2012.

Monday 23 July 2012

PowerPivot and Power View in Excel 2013

Tuesday 3 July 2012

Matching Data with SQL Server 2012 Data Quality Services