Friday, 12 July 2013

Migrating SQL Server Databases to Windows Azure

When the IT industry started getting excited about this thing called “the cloud” a few years ago, there were many (myself included) who were sceptical about the willingness of organizations to abandon their own on-premises infrastructure and start moving business applications to the web. At first, the idea of relying on IT services hosted and managed by some Internet provider seemed  dubious at best – sure, individual consumers could use an email server provided by their ISP, but businesses need to manage their own IT. Don’t they?
Then, gradually, we started to make concessions.
…OK, maybe a a hosted Exchange Server would reduce some administrative overheads.
…Yes, maybe using a hosting service can provide better availability for some Web applications.
…Alright, using a software-as-a-service solution for CRM might reduce licensing and hardware costs.
Fast forward to today, and we’ve come a long way – to the point where even that stalwart of enterprise software packages, Microsoft Office, is now delivered to 1 in 4 of Microsoft’s enterprise customers in the form of the Office 365 cloud service rather than as on-premises desktop software. Despite the initial doubts of the naysayers, it’s beginning to look like this cloud thing might just catch on after all! Most of the applications we use every day, as consumers and increasingly as employees, are delivered as cloud services that we can consume anywhere and on an increasing array of mobile devices. Organizations are seeing this extended reach, while at the same time reducing overheads for hardware, software licensing, maintenance, and other costs – A win-win scenario if ever I saw one.
However, there’s always been one section of the IT community that is even more conservative than the Finance department. One group of IT professionals that that regards even the smallest change with deep suspicion. One last bastion of fierce resistance to new fangled trends. I’m talking of course, about database administrators. You can move my Exchange Server to the cloud. You can store all of our documents in SharePoint Online. You can even deliver Office applications through a browser. But you’ll have to take my on-premises database from my cold, dead hands!
But even that resistance is beginning to crumble. It makes sense for web-hosted applications to store their data on the web, and as more and more IT services are moved to the cloud, it also makes sense to include traditional business application data stores as a part of that migration. They key issues that need to be addressed are:
  • Can the data be moved to the cloud without compromising security or compliance requirements?
  • Can a hosted solution cope with the volume of data in our databases, and support our transactions without compromising performance?
  • Can a hosted database meet our availability and disaster recovery requirements?
  • Can we migrate some, but not all, of our databases – and still retain centralized, consistent administration of all data stores?
Then, assuming that the answer to those three questions is “yes” (and in many cases, it is), the only remaining question is:
  • Can we migrate our databases with minimal interruption of service, and without breaking our existing applications?
Well, let’s see how Windows Azure shapes up in terms of these critical questions.
Windows Azure is Microsoft’s cloud platform, and it underpins many of the cloud services that the company offers. It also provides Platform-as-a-Service (PaaS) and Infrastructure-as-a-Service(IaaS) solutions that organizations can use to build new solutions, and migrate existing IT services to the cloud. In terms of database offerings, there are two primary options to consider:
  • Windows Azure SQL Database – a PaaS offering that enables you to host data in a (mostly) SQL Server-compatible service without having to worry about configuring and managing hardware or the operating system.
  • SQL Server in a Windows Azure Virtual Machine – an IaaS offering that is exactly what it sounds like – a virtual machine running Windows with SQL Server installed in it.

Windows Azure SQL Database

OK, let’s start with Windows Azure SQL Database. This was formerly known as SQL Azure, and provides many of the data storage and management features of SQL Server without the need to manage the operating system or SQL Server instance. If your database consists of traditional relational data in tables, with views and stored procedures used to provide a layer of abstraction, then Windows Azure SQL Database may well be a good option for you. It can’t support “special” data, such as spatial or XML data types, and there are a few other limitations see General Guidelines and Limitations (Windows Azure SQL Database) on MSDN for details; but it supports all of the functionality required by a large percentage of typical business application databases.
So, how does it stack up against the questions we asked earlier?
Can the data be moved to the cloud without compromising security or compliance requirements?
It depends on your specific requirements; but network connectivity to the database can be restricted to a specific range of IP addresses, and can be performed over SSL connections. Client requests are authenticated using SQL Server native  authentication based on a login and password, and the same permissions-based authorization scheme used in SQL Server is used to control access to tables and other database objects. In terms of compliance policies, you have control over which geographic region in which the data center hosting your Windows Azure SQL database server is located.
Can a hosted solution cope with the volume of data in our databases, and support our transactions without compromising performance?
You specify the size of your database as you create it, but it can grow to a maximum of 150 GB. Additionally, you can use federations to partition data across multiple databases to increase scalability and performance.
Can a hosted database meet our availability and disaster recovery requirements?
Windows Azure provides built-in resiliency by internally replicating your database across three redundant storage locations within the data center where your server is hosted. You can back up a SQL Database by copying it to another SQL Database.
Can we migrate some, but not all, of our databases – and still retain centralized, consistent administration of all data stores?
Because SQL database is a PaaS offering, much of the physical server administration you would typically need to manage for a SQL Server instance is handled for you by Windows Azure. For logical administration tasks, such as managing users or creating database objects, you can use SQL Server Management Studio to connect to Windows Azure SQL database – enabling you to manage on-premises SQL Server instances and cloud-based Windows Azure SQL Database instances in the same tool.
Can we migrate our databases with minimal interruption of service, and without breaking our existing applications?
Well, let’s take a look at this demonstration and find out:

As you can see, it’s really pretty straightforward to migrate an on-premises SQL Server database to Windows Azure SQL Database. If your database doesn’t depend on any features of SQL Server that aren’t supported by Windows Azure SQL Database, and you’re happy to let Windows Azure look after the physical configuration of your database server, then Windows Azure SQL Database is a good option for you.

SQL Server in a Windows Azure Virtual Machine

So, what about applications where you need to support SQL Server capabilities that aren’t available in Windows Azure SQL Database? Or where you specifically want control over the operating system and server-level configuration of your database server?
In this case, provisioning a virtual machine in Windows Azure might be a better option. There are a number of pre-defined virtual machine images, some of which include an installation of SQL Server; and if none of them suits you, there’s always the option to create your own and install whatever software you require. So how does this option meet our data migration requirements?
Can the data be moved to the cloud without compromising security or compliance requirements?
As with Windows Azure SQL Database, you can choose the geographic region of the data center where your virtual machine will be hosted. Access to the server is controlled through an endpoint that you must define for your Windows Azure virtual machine, and all network connectivity to the virtual machine can be restricted by using Windows firewall. You can use Windows Azure virtual networking to integrate virtual machines in Windows Azure with your corporate Active Directory infrastructure, and use Windows authentication to connect to SQL Server. Or alternatively, you can use SQL Server native authentication through logins and passwords, or even set up certificate-based authentication – exactly the same as with an on-premises instance of SQL Server. Additionally, you can make use of SQL Server’s security-related capabilities such as transparent database encryption and auditing.
Can a hosted solution cope with the volume of data in our databases, and support our transactions without compromising performance?
When you provision a virtual machine in Windows Azure, you can specify the number of virtual cores and the amount of memory allocated to the VM. At the moment, the largest VM available has 8 cores and 56 GB of memory, but I’d expect that to get larger over time. The VM uses Windows Azure storage  for its virtual hard disks, and you can add multiple VHDs and use filegroups to stripe data across them. This technique has been shown to improve IOPS performance.
Can a hosted database meet our availability and disaster recovery requirements?
As with all Windows Azure storage blobs, the VHDs for the VM are replicated across three redundant physical data storage devices in the data center. Additionally, you can use SQL Server HA capabilities, such as AlwaysOn Availability Groups to protect against failure of a VM. You can back up databases in a Windows Azure VM just as you would for an on premises instance of SQL Server, and use the SQL Agent to automate backup tasks on a scheduled basis.
Can we migrate some, but not all, of our databases – and still retain centralized, consistent administration of all data stores?
SQL Server in a virtual machine in Windows Azure is still just SQL Server. You can use SQL Server Management Studio to connect to it, and you can use all of the same management tools and agents you use for your on-premises database servers.
Can we migrate our databases with minimal interruption of service, and without breaking our existing applications?
Once again, here’s a demonstration:

Note that this demonstration is based on pre-release software, and may not reflect what actually ships with SQL Server 2014. However, it’s clear that the intention is to include a simple, wizard-based tool that will help you easily migrate on-premises SQL Server databases to Windows Azure virtual machines.

Conclusion

Migration of IT services to the cloud is inevitable. There are simply too many cost, scalability, and mobility advantages to justify not doing it. However, I don’t think it will happen in one big mass movement – and in particular, I think corporate databases will be among the last elements to be migrated. For at least a while, probably many years, we’ll be living in a hybrid world where some data is managed on-premises, and other data is moved to the cloud. To support that scenario, we need tools and technologies that make it easy to move data from one place to the other, and to manage it consistently wherever it’s hosted.
The combination of SQL Server on-premises, Windows Azure SQL Database, and SQL Server in a Windows Azure virtual machine manages to pull this trick off well. With similar merging of private and public cloud network infrastructure support in Windows Server and System Center, the lines between “the cloud” and “the enterprise” are blurring to the point where, from an IT management perspective, it really doesn’t matter where a service is physically located.
If you want to learn more about Windows Azure database options, visit http://www.windowsazure.com/en-us/solutions/data-management/.

Friday, 5 July 2013

What the Heck is Hekaton?

SQL Server 2014 introduces a new in-memory OLTP capability that was previously known by its codename, “Hekaton”. The technology introduces two new concepts to SQL Server: memory-optimized tables and native stored procedures. This article explores these features and provides a simple demonstration of how to use them.

The idea of optimizing data access performance by using in-memory storage is not new. SQL Server has always used caching to keep recently accessed data in memory, and recent releases have seen the addition of in-memory technology for large volume data analytics (PowerPivot and tabular models in Analysis Services) and high-performance table indexes that primarily benefits data warehouse workloads (columnstore indexes). What’s new in SQL Server 2014 is the ability to optimize an entire table for in-memory storage, effectively eliminating disk i/o for CRUD operations and massively improving query performance.

Memory-Optimized Tables

Memory-optimized tables are tables that you define using CREATE TABLE statements, in a similar fashion to traditional disk-based tables. However, memory-optimized tables are different from disk-based tables in the following ways:

  • The CREATE TABLE statement is used to generate a C struct, which is in turn compiled into a DLL and loaded into memory.
  • All data for the table is stored in memory, and all operations on the data occur in memory. By default, memory-optimized tables are durable (so they’re persisted to disk in order to survive restarts and support high-availability); but when the database is online, the table is always accessed directly in memory with no need to read pages from disk.
  • Columns in memory-optimized tables are indexed using  hash indexes (range indexes may be supported in a later build), in which the result of hashing the  indexed value determines the in-memory “bucket” in which the row is stored. Rows with the same hashed value are stored as a linked list within the bucket.
  • Table data is persisted to disk as a stream, not in 8K pages like a traditional table. The data must be stored in a filegroup that is created with the CONTAINS MEMORY_OPTIMIZED_DATA option. Indexes are not persisted, and will be regenerated in the event of a restart.
  • Some data types -  notably text, image, and nvarchar(max) - are not supported. Similarly some features such as identity columns and foreign-key constraints cannot be used in memory-optimized tables.

Native Stored Procedures

Memory-optimized tables can co-exist with disk-based tables, and you can execute Transact-SQL queries that contain joins between disk-based tables and memory-optimized tables. In fact, you can use Transact-SQL to query memory optimized tables just like any other table, so you can improve the performance of some workloads by changing existing disk-based tables to memory-optimized tables without breaking existing applications that query them. The ability to use regular Transact-SQL to query memory optimized tables is provided by an interop layer in the SQL Server engine that does the necessary work to convert Transact-SQL statements into C code that can access the in-memory data structures in the compiled DLL for the table.

However, if your application code only needs to access data in memory optimized tables, you can further improve performance by using native stored procedures. Native stored procedures are created using the familiar CREATE STORED PROCEDURE statement to define the Transact-SQL statements you want to execute. The code is then translated into C and compiled into a DLL, just like a memory optimized table. The DLL is then loaded into memory, and since the instructions it contains are now compiled as native machine code, execution performance is greatly improved. There are some limitations in this release, and only the most commonly used Transact-SQL statements and functions are supported in native stored procedures; but for a large percentage of common database workloads, you should find that using memory optimized tables and native stored procedures can significantly improve application performance.

The following demonstration shows how to use memory optimized tables and native stored procedures.

So, should you convert all of your tables and stored procedures to take advantage of this new technology? Probably not (at least, not yet). There are some workloads where the new in-memory capabilities will bring enormous benefits in terms of improved performance; but there are also some cases where current limitations prevent them from being used. Even when an existing disk-based table is fully compatible with a memory optimized schema, you may find minimal improvement for some i/o workloads.

The important thing to understand when planning to use (or not use) memory optimized tables, is that the performance benefit is not purely a result of storing the data in memory. After all, SQL Server does a pretty good job of caching commonly accessed data in disk-based tables anyway. The crucial difference to the way data in a memory optimized table is accessed is that no locks or latches are used to support concurrency. In a disk-based table, if multiple transactions need to access the data concurrently, locks are used to ensure consistency and avoid one transaction’s results being affected by the data modifications of another transaction. Although SQL Server does support row-level locking, transactions that affect multiple rows can quickly escalate locking to page-level – causing concurrency issues that affect query performance. This can be especially acute in tables with so-called “hotspots” – for example a table with a clustered index on an incrementing key value, where all new rows are inserted at the end of the table. Memory optimized tables do not use locks to manage concurrency. Instead, a form of row-versioning is used to track modifications to rows by multiple transactions; which in any case usually happen so quickly (sub-millisecond) that concurrency clashes are extremely rare. If the i/o pattern for your table typically incurs a lot of locking, then making the table memory optimized will probably improve performance. If not, then you may not benefit significantly from changing the table. As an example, in the video demo, the 500,000 inserts were wrapped in a single transaction – which when executed against the disk-based table incurred locking to support isolation for 500,000 atomic INSERT statements. When creating the demo, I noticed that removing the BEGIN TRAN and COMMIT statements that enclose the loop (so that the inserts were done as 500,000 independent INSERT statements) resulted in a much less significant difference in the time taken to load the disk-based table and the time taken to load the memory optimized table (typically, the memory optimized table was around 5-6 seconds quicker).

This article is based on the CTP 1 release of SQL Server 2014, and details are liable to change between now and the eventual release of the product. You can download the preview of SQL Server 2014 from http://www.microsoft.com/en-us/sqlserver/sql-server-2014.aspx.

The Transact-SQL code used in this demonstration is available from here.

Wednesday, 3 July 2013

GeoFlow Demo for Course 20467B

Yesterday I posted a demo that Microsoft Certified Trainers can use in course 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012 to show students how to use the Data Explorer add-in for Excel 2013. GeoFlow is another new Excel add-in that you might want to demonstrate in class. It enables users to visualize data that includes geographic and temporal dimensions on an animated map, showing how data points in specific geographic locations change over time. You can get the GeoFlow add-in from the Microsoft Office web site, and you can download the demo steps in a PDF document from here.

Click the thumbnail below to view the demonstration in a new window/tab.

Note that GeoFlow is a preview release at the moment, and is subject to change in the future. The demo is provided as-is, and no support will be provided for it by Microsoft Learning or Content Master.

Enjoy!

 

Update: GeoFlow has been renamed Power Map, and forms part of the Power BI capability being added to Microsoft Office 365.

Tuesday, 2 July 2013

Data Explorer Demo for Course 20467B

Earlier this year we released Microsoft Official Curriculum course 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012. Since then, Microsoft has released a preview of Data Explorer, an add-in for Excel that enables users to browse and query data in a variety of sources. Data Explorer builds on the self-service BI techniques taught in course 20467B, and if you are an instructor delivering the course, you can add value by demonstrating how to use it within the context of a Microsoft-based BI solution. To help you, I’ve put together a simple demo that should be easy to set up and perform in the virtual machine environment for the course. It will probably work best towards the end of module 8 (Designing a Microsoft Excel-Based Reporting Solution), and you can download the steps in a PDF document from here.

Click the thumbnail below to view the demo.

Of course, bear in mind that Data Explorer is a preview release at the moment, and is subject to change in the future. The demo is provided as-is, and no support will be provided for it by Microsoft Learning or Content Master. Nevertheless, I hope you find it useful!

Update: Data Explorer has been renamed Power Query, and forms part of the Power BI capability being added to Microsoft Office 365.

Friday, 5 October 2012

Role Playing Games with SQL Server 2012 Analysis Services

I’m currently working with Microsoft Learning, writing a course on designing BI solutions with SQL Server 2012. Obviously, this is a huge subject to try to cover, and raises a whole bunch of really interesting design considerations. One of the new things BI developers need to consider with SQL Server 2012, is whether to use a “traditional” multidimensional data model, or whether to use the new-fangled tabular model. In most cases, from an end-user’s perspective (no pun intended), there is little to pick between the two. In fact, in an Excel PivotTable, most users will struggle to spot any difference. However, for the cube developer, there are some significant differences. There are some things you can do easily in multidimensional projects (or indeed, things that are done automatically for you by SQL Server data Tools) which require (sometimes extremely complex) custom development in a tabular model. Other things are relatively straightforward to accomplish in both models, but require different implementations. An example of the latter is the implementation of role-playing dimensions. You can do this in both models, but there are some differences.

Role-playing dimensions are used to create multiple cube dimensions that are based on the same underlying dimension in the database. The classic example is a date dimension in which each member represents a calendar date. In your cube, you may have a a Sales Order measure group that is related to the Date dimension by multiple keys, for example an Order Date and a Delivery Date. Another example might be an Address dimension that is related to a Shipment measure group by both an Origin key and a Destination key. This multi-use of the same underlying dimension means that the dimension table is defined only once, but users can use it to slice the data by different keys – so for example, a user could view sales by order date or by delivery date (or both).

OK, so first, let’s see how a role-playing dimension is implemented in a multidimensional model. I’m using the AdventureWorksDW2012 sample database, which contains a FactResellerSales table that is related to a DimDate table using three key columns – OrderDateKey, ShipDateKey, and DueDateKey. When I create a data source view from the data warehouse tables in the multidimensional project, all three of the relationships are detected as shown here.

image

Using the wizard to create a cube automatically detects the multiple relationships, and results in a single DimDate dimension in the database but three role-playing dimensions in the cube (Order Date, Ship Date, and Due Date) as shown here.

image

The role-playing dimensions are really just references to the same DimDate dimension, but aggregations will be calculated based on each relationship. I’ll go ahead and add a hierarchy to the DimDate dimension:

image

When a user browses the cube in Excel, each of the three role-playing dimensions is available for them to slice the sales data, and all three of these dimensions have the same Calendar Date hierarchy that I defined for the base DimDate dimension:

image

Now let’s compare the experience with a tabular model. When I import the same tables into a tabular model project, the relationships are detected, and I can create the same hierarchy as before in the DimDate table. However, notice that two of the relationships are shown as dotted lines, while one is shown as a solid line.

image

This indicates that although the relationships have all been detected, only one of them is active at any one time. When a user browses the model in Excel, they only see one DimDate dimension, which will show aggregations for the active relationship (in this case, Order Date, but there’s no easy way for the user to tell that from the user interface):

image

The solution to this problem is obvious. So obvious in fact, that it took me a while to figure it out! The answer is to import the same table multiple times, and rename it appropriately:

image

After you’ve imported one copy of the table for each role-playing dimensions, you simply delete the inactive relationships from the original table, and create new ones to join the relevant keys in the fact table to the new dimension tables. You’ll also need to create duplicates of any hierarchies you want to appear in all of the dimensions.

image

Now when users browse the model, they’ll see all three dimensions, and as long as you’ve assigned appropriate names to each copy of the table, it should be obvious what each dimension represents.

image

Monday, 23 July 2012

PowerPivot and Power View in Excel 2013

It’s just typical of my job that just a few short weeks after the publication of some Microsoft Official Curriculum courses that I’ve spent months working on, Microsoft should choose to make a preview of the next release of the software on which they are based  available! As you may know, we recently published courses 10778A and 40009A, both of which  make use of the PowerPivot and Power View features in Excel and SharePoint 2010; so it was with a  certain amount of trepidation that I installed the preview of Office 2013 to get a  first look at the enhancements that have been made.

The first, and most obvious, change is that the PowerPivot  add-in for Excel no longer needs to be installed from a separate package. It’s built into Excel and only needs to be enabled, which you do by configuring the COM Add-ins in Excel’s options as shown here.

Picture1

Note that there’s also a Power View add-in – more about that later!

After the PowerPivot add-in has been enabled, users will see the POWERPIVOT tab on the ribbon, as shown here.

Picture2

With this ribbon, you can not only manage a PowerPivot tabular data model for the workbook as you can in Excel 2010, but you can also create calculated fields and KPIs without having to directly edit the model – making the process a little bit more intuitive for information workers.

Clicking Manage opens the PowerPivot window, which is similar to that of the previous release. There are a few enhancements of course, but anyone familiar with PowerPivot in Excel 2010 will find themselves in familiar territory. In this case, I’ve opened a PowerPivot workbook I created with Excel 2010 based on data in the AdventureWorksDW SQL Server sample database. The changes to this release meant that I was prompted to allow Excel to update the data model and re-save the workbook, so one thing to be aware of is that you can open (and update) Excel 2010 PowerPivot workbooks in Excel 2013, but after they’ve been updated you won’t be able to open them in Excel 2010. You can see the diagram view of my PowerPivot data model below – note that it includes a hierarchy in the Sales Territory table.

Picture3

After you’ve created the data model in your workbook, you can use it as a source for PivotTables, just as you could in Excel 2010. There are however, one or two nice enhancements on a new ANALYZE tab of the ribbon that make it easier to do things like create slicers. Another new feature is the ability to create timeline filters that make it easier to analyse data based on chronological periods. To add a timeline, just click Insert Timeline and specify any of the time-based attributes that Excel identifies as having a suitable relationship in the model.

Picture4

After you’ve inserted a timeline, you can use it to filter the data in the PivotTable as shown here.

Picture5

Earlier, I mentioned that Excel 2013 includes a Power View add-in. This enables information workers to create Power View reports from the data model in the workbook (and external data sources). Previously, Power View was only available in SharePoint Server 2010, but in Office 2013 you can use it right there in an Excel workbook.

To create a Power View report from the data model in the workbook, just click Power View on the INSERT tab of the ribbon.

Picture6

If necessary, you’ll be prompted to install Silverlight (which is required by the Power View add-in), and after doing so you’ll be able to create a Power View report from the data in your PowerPivot data model as shown here.

Picture7

Note that you can include hierarchies in a Power View report, which wasn’t supported in the previous release. There are several other enhancements in this release, including support for new data visualizations (such as pie charts), and even visualization of geographical data on a Bing Maps map, as shown here.

Picture8

This short article just highlights a few of the improvements to PowerPivot and Power View in Excel 2013. There are many more new features in Excel, as well as greater ability to share BI capabilities across the enterprise through enhancements in SharePoint 2013 and SQL Server 2012 SP1, which I look forward to exploring in more depth.

Tuesday, 3 July 2012

Matching Data with SQL Server 2012 Data Quality Services

In  a previous post, I described how you can use Data Quality Services (DQS) to create a knowledge base for the domains (data columns) used in your business data and use it to cleanse data by correcting invalid or inconsistent values. Data cleansing is however only one side of the coin when it comes to DQS. You can also use DQS to perform data matching – in other words, finding records that potential duplicates of one another and consolidating them to a single surviving record.

When you think about it, the potential for duplicate data entry in most complex business environments is enormous. For example, let’s imagine an e-commerce site where customers need to register before placing orders. It’s perfectly conceivable that a customer who only uses the site occasionally might re-register with slightly different details because they’ve forgotten that they had registered previously or can’t remember their login credentials.  Even if the site applies a policy that demands a unique email address for each registration, there’s nothing to stop the same customer registering multiple times with different email addresses. For an individual sales order, the fact that the customer is registered multiple times is inconsequential – as long as the payment and delivery address details are correct, the order can be processed successfully. However, then the company wants to use its data to perform any kind of business intelligence (BI) reporting or analysis that aggregates information per customer, then the duplicate entries can lead to misleading results.

To use DQS to match data, you must first add a matching policy to a knowledge base. You can use an existing knowledge base that is also used for data cleansing, or you can create a knowledge base specifically for data matching. In this example, I’m opening an existing knowledge base that contains domains for customer records for the Matching Policy activity.

Picture1

Just as when performing knowledge discovery, I need to map some sample data to the domains defined in the knowledge base. This enables me to test the matching policy against a known data set as I build it, and therefore verify that it successfully identifies known duplicate records. In this case, I’m using data in an Excel workbook as the source for my sample data, but you can also use a table in a SQL Server database.

Picture2

Having mapped sample data to the domains, I can now define the matching rules for my matching policy. You can include multiple rules, and each one uses a set of weighted comparisons of domain values to identify clusters of records that are potential duplicates of one another.

Picture3

Potential matches are determines based on a score that is calculated from the weighted comparisons you define in the rule. Here are the comparisons I’ve used in my Match Customer rule:

Domain Similarity Weight Prerequisite
Birth Date Exact  

X

Email Address Exact 20  
Postal Code Exact 10  
Country/Region Exact 10  
First Name Similar 10  
Last Name Similar 10  
Street Address Similar 20  
City Similar 10  
State Similar 10  

Note that an exact match of the Birth Date domain is specified as a prerequisite. In other words, only records where the birth date is an exact match will be considered as candidates for a potential duplicate. Prerequisite domains in a matching rule must use the Exact similarity and have no weighting value. All of the other domains are calculated based on an exact or similar match, and have weightings, which add up to a total of 100.

Assuming the birth date for the records being compared is a match, DQS then makes the other comparisons defined in the matching rule and adds the specified weighting value for each comparison that is true to produce an overall score. For example, consider two records with identical Birth Date values being compared using the Match Customer rule defined above. If the  Email Address domains for both records is an exact match, 20 is added to the score. If the First Name domains are similar (for example, “Rob” and “Robert”), another 10 is added to the score, and so on until all of the comparisons in the rule have been made. The resulting score is then compared to the minimum matching score defined for the matching rule (in this case 80). If the score exceeds the minimum matching score, then the records are considered a match. Multiple records that are considered matches for one another are grouped into a cluster.

After you have defined the matching rules, you can use them to find matches in the sample data you mapped earlier. This gives you the opportunity to verify that the rules behave as expected against a known dataset. In this case, the dataset results in a single cluster of matches that includes two records – one for Daniel Garcia and another for Dan Garcia.

Picture4

Now that I’ve defined my matching policy, I can publish the knowledge base and allow the data stewards in my organization to use it for data matching.

To use a knowledge base to perform data matching, create a new data quality project, specify the knowledge base, and specify the Matching activity as shown here.

Picture5

The first step, as it is in any DQS project, is to map the fields in your data source to the domains in the knowledge base. Just as before, the data source can be a table in a SQL Server database or an Excel file. This time, I’m using the Customers table in the Staging SQL Server database.

Picture6

After you’ve mapped the domains, you can start the matching process. When the process is complete, the clusters of matched records is displayed. In this case, there are two clusters, each containing two matches. At tis stage, you can choose to reject any matches that you know aren’t duplicates.

Picture7

When the matches have all been identified, you can export the results to a SQL Server table or an Excel file. You can also export survivors (one record from each cluster that is chosen as the correct one) based on one of the following survivorship rules:

  • Pivot record – A record in the cluster that is chosen arbitrarily by DQS.
  • Most complete and longest record – The record that has the fewest null field values and the longest overall data length.
  • Most complete record – The record that has the fewest null fields.
  • Longest record – The record that has the longest overall data length.

Picture8

The exported results include all of the source data with additional columns for the clusters of matching records to indicate the matching rule used and score calculated for each match, and the pivot record for each match cluster.

Picture9

The exported survivors contain all of the non-matching records from the original data source and one version of each matched record based on the survivorship rule you selected. In the following example, I’ve highlighted the surviving records from my matching process.

Picture10

In some case, you can simply replace the original data set with the survivor records to create a de-duplicated set of records. However, in most business scenarios you’ll need to apply some logic (manual or automated) to handle relationships between duplicate records and other tables. For example, before I eliminate the duplicate customer records identified by the matching process in the above example, I would need to reassign any sales orders that are currently related to those customer records to the surviving records.

Hopefully you’ve found this brief introduction to data matching with DQS useful. To learn more about DQS and its use in a data warehousing solution, you can attend Microsoft Official Curriculum (MOC) course 10777A: Implementing a Data Warehouse with SQL Server 2012.