Wednesday, 14 August 2013

Using the Buffer Pool Extension in SQL Server 2014

SQL Server makes use of physical memory to cache frequently accessed data pages in a buffer pool. This reduces disk I/O and optimizes overall performance. An easy way to improve performance of I/O bound workloads is therefore to simply add more physical memory. However, in some cases, adding memory to a server is not possible – for example because of  hardware limitations in the motherboard. Additionally, although the cost of memory continues to drop, when viewed as a per-megabyte cost, RAM is significantly more expensive than disk devices – including solid state disk (SSD) devices, which provide significant performance improvements over mechanical disks.

SQL Server 2014 introduces the buffer pool extension; a feature that enables you to take advantage of non-volatile storage like SSD devices, and use them to extend the buffer pool. In some scenarios, this can be a cost-effective way to improve the performance of database workloads when adding more memory to the server is not an option. With the buffer pool extension enabled, SQL Server 2014 uses the non-volatile storage for clean pages (that is, pages that have been committed), making them faster to retrieve than if they had been paged out of the buffer to their disk storage location. By using the buffer pool extension for only clean pages, the risk of data loss in the event of a server or storage device failure is avoided (in the case of storage device failure, the buffer pool extension is automatically disabled).

The following video demonstrates how to enable and disable the buffer pool extension in SQL Server 2014.

This article is based on the CTP 1 release of SQL Server 2014, and details are subject to change between now and the eventual release of the product. For more information about the buffer pool extension in SQL Server 2014, visit

Thursday, 25 July 2013

Power Query for Excel Demo

A couple of weeks ago, I posted a demo that instructors of Microsoft Learning course 20467B can use to demonstrate the Data Explorer add-in for Excel 2013. Since then, Microsoft has rebranded Data Explorer as “Power Query”, and announced that it will form part of the Power BI capabilities in Office 365.

A new version of the add-in is now available here so I’ve updated the demo steps, which you can download from my SkyDrive folder. Other than the renaming of the Data Explorer tab on the ribbon to Power Query, the steps are much the same as they were before, so the following video is still a reasonably good guide to the tool.

Friday, 12 July 2013

Migrating SQL Server Databases to Windows Azure

When the IT industry started getting excited about this thing called “the cloud” a few years ago, there were many (myself included) who were sceptical about the willingness of organizations to abandon their own on-premises infrastructure and start moving business applications to the web. At first, the idea of relying on IT services hosted and managed by some Internet provider seemed  dubious at best – sure, individual consumers could use an email server provided by their ISP, but businesses need to manage their own IT. Don’t they?
Then, gradually, we started to make concessions.
…OK, maybe a a hosted Exchange Server would reduce some administrative overheads.
…Yes, maybe using a hosting service can provide better availability for some Web applications.
…Alright, using a software-as-a-service solution for CRM might reduce licensing and hardware costs.
Fast forward to today, and we’ve come a long way – to the point where even that stalwart of enterprise software packages, Microsoft Office, is now delivered to 1 in 4 of Microsoft’s enterprise customers in the form of the Office 365 cloud service rather than as on-premises desktop software. Despite the initial doubts of the naysayers, it’s beginning to look like this cloud thing might just catch on after all! Most of the applications we use every day, as consumers and increasingly as employees, are delivered as cloud services that we can consume anywhere and on an increasing array of mobile devices. Organizations are seeing this extended reach, while at the same time reducing overheads for hardware, software licensing, maintenance, and other costs – A win-win scenario if ever I saw one.
However, there’s always been one section of the IT community that is even more conservative than the Finance department. One group of IT professionals that that regards even the smallest change with deep suspicion. One last bastion of fierce resistance to new fangled trends. I’m talking of course, about database administrators. You can move my Exchange Server to the cloud. You can store all of our documents in SharePoint Online. You can even deliver Office applications through a browser. But you’ll have to take my on-premises database from my cold, dead hands!
But even that resistance is beginning to crumble. It makes sense for web-hosted applications to store their data on the web, and as more and more IT services are moved to the cloud, it also makes sense to include traditional business application data stores as a part of that migration. They key issues that need to be addressed are:
  • Can the data be moved to the cloud without compromising security or compliance requirements?
  • Can a hosted solution cope with the volume of data in our databases, and support our transactions without compromising performance?
  • Can a hosted database meet our availability and disaster recovery requirements?
  • Can we migrate some, but not all, of our databases – and still retain centralized, consistent administration of all data stores?
Then, assuming that the answer to those three questions is “yes” (and in many cases, it is), the only remaining question is:
  • Can we migrate our databases with minimal interruption of service, and without breaking our existing applications?
Well, let’s see how Windows Azure shapes up in terms of these critical questions.
Windows Azure is Microsoft’s cloud platform, and it underpins many of the cloud services that the company offers. It also provides Platform-as-a-Service (PaaS) and Infrastructure-as-a-Service(IaaS) solutions that organizations can use to build new solutions, and migrate existing IT services to the cloud. In terms of database offerings, there are two primary options to consider:
  • Windows Azure SQL Database – a PaaS offering that enables you to host data in a (mostly) SQL Server-compatible service without having to worry about configuring and managing hardware or the operating system.
  • SQL Server in a Windows Azure Virtual Machine – an IaaS offering that is exactly what it sounds like – a virtual machine running Windows with SQL Server installed in it.

Windows Azure SQL Database

OK, let’s start with Windows Azure SQL Database. This was formerly known as SQL Azure, and provides many of the data storage and management features of SQL Server without the need to manage the operating system or SQL Server instance. If your database consists of traditional relational data in tables, with views and stored procedures used to provide a layer of abstraction, then Windows Azure SQL Database may well be a good option for you. It can’t support “special” data, such as spatial or XML data types, and there are a few other limitations see General Guidelines and Limitations (Windows Azure SQL Database) on MSDN for details; but it supports all of the functionality required by a large percentage of typical business application databases.
So, how does it stack up against the questions we asked earlier?
Can the data be moved to the cloud without compromising security or compliance requirements?
It depends on your specific requirements; but network connectivity to the database can be restricted to a specific range of IP addresses, and can be performed over SSL connections. Client requests are authenticated using SQL Server native  authentication based on a login and password, and the same permissions-based authorization scheme used in SQL Server is used to control access to tables and other database objects. In terms of compliance policies, you have control over which geographic region in which the data center hosting your Windows Azure SQL database server is located.
Can a hosted solution cope with the volume of data in our databases, and support our transactions without compromising performance?
You specify the size of your database as you create it, but it can grow to a maximum of 150 GB. Additionally, you can use federations to partition data across multiple databases to increase scalability and performance.
Can a hosted database meet our availability and disaster recovery requirements?
Windows Azure provides built-in resiliency by internally replicating your database across three redundant storage locations within the data center where your server is hosted. You can back up a SQL Database by copying it to another SQL Database.
Can we migrate some, but not all, of our databases – and still retain centralized, consistent administration of all data stores?
Because SQL database is a PaaS offering, much of the physical server administration you would typically need to manage for a SQL Server instance is handled for you by Windows Azure. For logical administration tasks, such as managing users or creating database objects, you can use SQL Server Management Studio to connect to Windows Azure SQL database – enabling you to manage on-premises SQL Server instances and cloud-based Windows Azure SQL Database instances in the same tool.
Can we migrate our databases with minimal interruption of service, and without breaking our existing applications?
Well, let’s take a look at this demonstration and find out:

As you can see, it’s really pretty straightforward to migrate an on-premises SQL Server database to Windows Azure SQL Database. If your database doesn’t depend on any features of SQL Server that aren’t supported by Windows Azure SQL Database, and you’re happy to let Windows Azure look after the physical configuration of your database server, then Windows Azure SQL Database is a good option for you.

SQL Server in a Windows Azure Virtual Machine

So, what about applications where you need to support SQL Server capabilities that aren’t available in Windows Azure SQL Database? Or where you specifically want control over the operating system and server-level configuration of your database server?
In this case, provisioning a virtual machine in Windows Azure might be a better option. There are a number of pre-defined virtual machine images, some of which include an installation of SQL Server; and if none of them suits you, there’s always the option to create your own and install whatever software you require. So how does this option meet our data migration requirements?
Can the data be moved to the cloud without compromising security or compliance requirements?
As with Windows Azure SQL Database, you can choose the geographic region of the data center where your virtual machine will be hosted. Access to the server is controlled through an endpoint that you must define for your Windows Azure virtual machine, and all network connectivity to the virtual machine can be restricted by using Windows firewall. You can use Windows Azure virtual networking to integrate virtual machines in Windows Azure with your corporate Active Directory infrastructure, and use Windows authentication to connect to SQL Server. Or alternatively, you can use SQL Server native authentication through logins and passwords, or even set up certificate-based authentication – exactly the same as with an on-premises instance of SQL Server. Additionally, you can make use of SQL Server’s security-related capabilities such as transparent database encryption and auditing.
Can a hosted solution cope with the volume of data in our databases, and support our transactions without compromising performance?
When you provision a virtual machine in Windows Azure, you can specify the number of virtual cores and the amount of memory allocated to the VM. At the moment, the largest VM available has 8 cores and 56 GB of memory, but I’d expect that to get larger over time. The VM uses Windows Azure storage  for its virtual hard disks, and you can add multiple VHDs and use filegroups to stripe data across them. This technique has been shown to improve IOPS performance.
Can a hosted database meet our availability and disaster recovery requirements?
As with all Windows Azure storage blobs, the VHDs for the VM are replicated across three redundant physical data storage devices in the data center. Additionally, you can use SQL Server HA capabilities, such as AlwaysOn Availability Groups to protect against failure of a VM. You can back up databases in a Windows Azure VM just as you would for an on premises instance of SQL Server, and use the SQL Agent to automate backup tasks on a scheduled basis.
Can we migrate some, but not all, of our databases – and still retain centralized, consistent administration of all data stores?
SQL Server in a virtual machine in Windows Azure is still just SQL Server. You can use SQL Server Management Studio to connect to it, and you can use all of the same management tools and agents you use for your on-premises database servers.
Can we migrate our databases with minimal interruption of service, and without breaking our existing applications?
Once again, here’s a demonstration:

Note that this demonstration is based on pre-release software, and may not reflect what actually ships with SQL Server 2014. However, it’s clear that the intention is to include a simple, wizard-based tool that will help you easily migrate on-premises SQL Server databases to Windows Azure virtual machines.


Migration of IT services to the cloud is inevitable. There are simply too many cost, scalability, and mobility advantages to justify not doing it. However, I don’t think it will happen in one big mass movement – and in particular, I think corporate databases will be among the last elements to be migrated. For at least a while, probably many years, we’ll be living in a hybrid world where some data is managed on-premises, and other data is moved to the cloud. To support that scenario, we need tools and technologies that make it easy to move data from one place to the other, and to manage it consistently wherever it’s hosted.
The combination of SQL Server on-premises, Windows Azure SQL Database, and SQL Server in a Windows Azure virtual machine manages to pull this trick off well. With similar merging of private and public cloud network infrastructure support in Windows Server and System Center, the lines between “the cloud” and “the enterprise” are blurring to the point where, from an IT management perspective, it really doesn’t matter where a service is physically located.
If you want to learn more about Windows Azure database options, visit

Friday, 5 July 2013

What the Heck is Hekaton?

SQL Server 2014 introduces a new in-memory OLTP capability that was previously known by its codename, “Hekaton”. The technology introduces two new concepts to SQL Server: memory-optimized tables and native stored procedures. This article explores these features and provides a simple demonstration of how to use them.

The idea of optimizing data access performance by using in-memory storage is not new. SQL Server has always used caching to keep recently accessed data in memory, and recent releases have seen the addition of in-memory technology for large volume data analytics (PowerPivot and tabular models in Analysis Services) and high-performance table indexes that primarily benefits data warehouse workloads (columnstore indexes). What’s new in SQL Server 2014 is the ability to optimize an entire table for in-memory storage, effectively eliminating disk i/o for CRUD operations and massively improving query performance.

Memory-Optimized Tables

Memory-optimized tables are tables that you define using CREATE TABLE statements, in a similar fashion to traditional disk-based tables. However, memory-optimized tables are different from disk-based tables in the following ways:

  • The CREATE TABLE statement is used to generate a C struct, which is in turn compiled into a DLL and loaded into memory.
  • All data for the table is stored in memory, and all operations on the data occur in memory. By default, memory-optimized tables are durable (so they’re persisted to disk in order to survive restarts and support high-availability); but when the database is online, the table is always accessed directly in memory with no need to read pages from disk.
  • Columns in memory-optimized tables are indexed using  hash indexes (range indexes may be supported in a later build), in which the result of hashing the  indexed value determines the in-memory “bucket” in which the row is stored. Rows with the same hashed value are stored as a linked list within the bucket.
  • Table data is persisted to disk as a stream, not in 8K pages like a traditional table. The data must be stored in a filegroup that is created with the CONTAINS MEMORY_OPTIMIZED_DATA option. Indexes are not persisted, and will be regenerated in the event of a restart.
  • Some data types -  notably text, image, and nvarchar(max) - are not supported. Similarly some features such as identity columns and foreign-key constraints cannot be used in memory-optimized tables.

Native Stored Procedures

Memory-optimized tables can co-exist with disk-based tables, and you can execute Transact-SQL queries that contain joins between disk-based tables and memory-optimized tables. In fact, you can use Transact-SQL to query memory optimized tables just like any other table, so you can improve the performance of some workloads by changing existing disk-based tables to memory-optimized tables without breaking existing applications that query them. The ability to use regular Transact-SQL to query memory optimized tables is provided by an interop layer in the SQL Server engine that does the necessary work to convert Transact-SQL statements into C code that can access the in-memory data structures in the compiled DLL for the table.

However, if your application code only needs to access data in memory optimized tables, you can further improve performance by using native stored procedures. Native stored procedures are created using the familiar CREATE STORED PROCEDURE statement to define the Transact-SQL statements you want to execute. The code is then translated into C and compiled into a DLL, just like a memory optimized table. The DLL is then loaded into memory, and since the instructions it contains are now compiled as native machine code, execution performance is greatly improved. There are some limitations in this release, and only the most commonly used Transact-SQL statements and functions are supported in native stored procedures; but for a large percentage of common database workloads, you should find that using memory optimized tables and native stored procedures can significantly improve application performance.

The following demonstration shows how to use memory optimized tables and native stored procedures.

So, should you convert all of your tables and stored procedures to take advantage of this new technology? Probably not (at least, not yet). There are some workloads where the new in-memory capabilities will bring enormous benefits in terms of improved performance; but there are also some cases where current limitations prevent them from being used. Even when an existing disk-based table is fully compatible with a memory optimized schema, you may find minimal improvement for some i/o workloads.

The important thing to understand when planning to use (or not use) memory optimized tables, is that the performance benefit is not purely a result of storing the data in memory. After all, SQL Server does a pretty good job of caching commonly accessed data in disk-based tables anyway. The crucial difference to the way data in a memory optimized table is accessed is that no locks or latches are used to support concurrency. In a disk-based table, if multiple transactions need to access the data concurrently, locks are used to ensure consistency and avoid one transaction’s results being affected by the data modifications of another transaction. Although SQL Server does support row-level locking, transactions that affect multiple rows can quickly escalate locking to page-level – causing concurrency issues that affect query performance. This can be especially acute in tables with so-called “hotspots” – for example a table with a clustered index on an incrementing key value, where all new rows are inserted at the end of the table. Memory optimized tables do not use locks to manage concurrency. Instead, a form of row-versioning is used to track modifications to rows by multiple transactions; which in any case usually happen so quickly (sub-millisecond) that concurrency clashes are extremely rare. If the i/o pattern for your table typically incurs a lot of locking, then making the table memory optimized will probably improve performance. If not, then you may not benefit significantly from changing the table. As an example, in the video demo, the 500,000 inserts were wrapped in a single transaction – which when executed against the disk-based table incurred locking to support isolation for 500,000 atomic INSERT statements. When creating the demo, I noticed that removing the BEGIN TRAN and COMMIT statements that enclose the loop (so that the inserts were done as 500,000 independent INSERT statements) resulted in a much less significant difference in the time taken to load the disk-based table and the time taken to load the memory optimized table (typically, the memory optimized table was around 5-6 seconds quicker).

This article is based on the CTP 1 release of SQL Server 2014, and details are liable to change between now and the eventual release of the product. You can download the preview of SQL Server 2014 from

The Transact-SQL code used in this demonstration is available from here.

Wednesday, 3 July 2013

GeoFlow Demo for Course 20467B

Yesterday I posted a demo that Microsoft Certified Trainers can use in course 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012 to show students how to use the Data Explorer add-in for Excel 2013. GeoFlow is another new Excel add-in that you might want to demonstrate in class. It enables users to visualize data that includes geographic and temporal dimensions on an animated map, showing how data points in specific geographic locations change over time. You can get the GeoFlow add-in from the Microsoft Office web site, and you can download the demo steps in a PDF document from here.

Click the thumbnail below to view the demonstration in a new window/tab.

Note that GeoFlow is a preview release at the moment, and is subject to change in the future. The demo is provided as-is, and no support will be provided for it by Microsoft Learning or Content Master.



Update: GeoFlow has been renamed Power Map, and forms part of the Power BI capability being added to Microsoft Office 365.

Tuesday, 2 July 2013

Data Explorer Demo for Course 20467B

Earlier this year we released Microsoft Official Curriculum course 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012. Since then, Microsoft has released a preview of Data Explorer, an add-in for Excel that enables users to browse and query data in a variety of sources. Data Explorer builds on the self-service BI techniques taught in course 20467B, and if you are an instructor delivering the course, you can add value by demonstrating how to use it within the context of a Microsoft-based BI solution. To help you, I’ve put together a simple demo that should be easy to set up and perform in the virtual machine environment for the course. It will probably work best towards the end of module 8 (Designing a Microsoft Excel-Based Reporting Solution), and you can download the steps in a PDF document from here.

Click the thumbnail below to view the demo.

Of course, bear in mind that Data Explorer is a preview release at the moment, and is subject to change in the future. The demo is provided as-is, and no support will be provided for it by Microsoft Learning or Content Master. Nevertheless, I hope you find it useful!

Update: Data Explorer has been renamed Power Query, and forms part of the Power BI capability being added to Microsoft Office 365.

Sunday, 3 March 2013

A Lap Around HDInsight

I’m currently working with Microsoft’s Patterns and Practices team, researching and documenting best practices guidance for big data analysis with HDInsight. For those of you who may not know, HDInsight is Microsoft’s distribution of Hadoop – an open source platform for analysis of huge volumes of complex data; and as a database guy with a background in BI, I’m obviously interested in figuring out the best ways to use HDInsight, and how it integrates with the existing SQL Server technologies that I’m used to working with. The results of our work will be published soon, but in the meantime, I thought it might be useful to share some basic steps to help you get started with HDInsight.

Before we start however, it might be useful to explain a bit about HDInsight, Hadoop, and Big Data. If you’re involved in database development or BI, you can hardly have missed the growing buzz about Big Data. There seem to be a lot of definitions being bandied around (and products being advertised), and it can all be very confusing – so, here’s my take on it.

  • “Big Data” is a catch-all term that’s popularly used to describe data that you want to analyze, but is difficult to process with traditional database technologies. The difficulty is usually caused by one or more of the so-called three V’s:
    • Volume: The sheer amount of data makes analysis with a traditional relational database impractical.
    • Variety: The data is in difficult to process formats, and may be completely unstructured.
    • Velocity: The data is generated at a rapid pace and must be processed as close to real-time as possible.
  • Hadoop is an Apache open source project for a big data analysis solution that is based on distributing the data processing across multiple server nodes in a cluster. The distributed processing uses a technique pioneered by those bright folks at Google, called Map/Reduce. The data in a Hadoop cluster is stored on a distributed file system called HDFS, which enables each node to work on a subset of the data in parallel (the Map phase), and then the results from each node are consolidated into a single result (the Reduce phase). You can write the Map/Reduce code to process your data in Java (which is the native language for Hadoop), or in other languages supported by the Hadoop distribution you’re using (such as JavaScript, Python, or C#). In addition to the core Hadoop project, there are a number of other projects that add value to a Hadoop cluster. In particular, a project named Pig provides an abstraction layer over Map/Reduce that enables you to write and execute data processing workflow steps in a language named Pig Latin, which are then translated to Map/Reduce jobs by the Pig interpreter. Another significant Hadoop related project is Hive, which provides an abstraction over Map/Reduce in which data in HDFS files can be represented as tables and queried with a SQL-like language named HiveQL.
  • HDInsight is the brand-name of the Microsoft distribution of Hadoop. It’s an open source distribution, based on the same Hadoop core as other distributions, and available as an on-premise application for Windows Server, or as a cloud service hosted in Windows Azure.

Provisioning a Cluster

The first thing you need to do is to provision an HDInsight cluster, and this is your first major decision point. You can opt to install a local, on-premise HDInsight cluster on Windows Server via the Microsoft Web Platform Installer, or you can create a cloud-based cluster on Windows Azure. At the time of writing, HDInsight is available as a Windows Azure preview service at

To create an HDInsight cluster, you need to enable preview services in your Windows Azure subscription, and subscribe to the HDInsight preview. It usually takes a few days for your subscription activation to arrive by email, and then you can return to the Windows Azure portal to provision your cluster. The Windows Azure-based HDInsight service uses a Windows Azure blob store to host the HDFS file system structures used by the cluster, so you need to create a Storage account for the cluster. The storage account must be co-located with the HDInsight cluster nodes, which in the current preview means that you must create your storage account in the East US data center. For this example, I’ve created a storage account named graemesplace:


After you’ve created your storage account, you can create an HDInsight cluster. The Quick Create option lets you specify a cluster name, choose from a range of pre-configured cluster sizes, specify a password for the default Admin account, and match the storage account you created earlier to the cluster. You can also use the Custom Create option to have more control over the cluster, for example specifying a user name other than Admin. I’ve used the Quick Create option to create a cluster named graemesplace as shown here:


The DNS name you assign will determine the address for the cluster in the domain. I’ve specified “graemesplace”, so the address of my cluster is

The provisioning process takes several minutes, and when your cluster is ready you can view it in the Windows Azure portal as shown here:


At the bottom of the HDInsight page in the portal, there are links to connect to the cluster (via a remote desktop session) and to manage the cluster. This second option is accomplished through a dashboard page, which is shown here:


Obtaining and Loading Data for Analysis

So now I have an HDInsight cluster, I’m ready to analyze some data. Being British, I am of course obsessed with the weather, so I plan to do some analysis of historic weather in my home town of Dunfermline, Scotland. To obtain this data, I’m going to use the UK Met Office Weather dataset that is available in the Windows Azure Datamarket, as shown here. If you haven’t explored the Azure Datamarket, it’s a one-stop shop for data from government organizations and private commercial data publishers, and includes a pretty eclectic mix of datasets that you can purchase and download. Of course, like any true Scotsman, I’ve chosen the Met office weather data because it’s free!


The Met office data includes weather data from all over the UK, but since I’m only interested in weather statistics for Dunfermline, I’ll filter my search by the site code field (which uniquely identifies each weather station). I can look up the site code for Dunfermline in the Site dataset as shown here.


Now I can filter the data from the DailyForecast dataset based on the ID for the Dunfermline site as shown here.


Note the URL for the currently expressed query. This is a REST-based URI that can be used to download the data from a client application. Note also that you can click Show to display a secure primary account key, which is used to authenticate requests from client applications. One such client application is Excel, in which you can download data from the Windows Azure Datamarket as shown here.


To download the data, I just specify the URL and account key from the query page in the Azure Datamarket site, and (after a while) the data appears in the worksheet. I can then remove any columns I don’t need as shown here.


Note that the data includes multiple measurements for each day (taken at different times). To analyze the data properly, I plan to use HDInsight to calculate an average figure for each day. I’m happy that I have the columns I need, so I just remove the headings and save the data as a comma-delimited text file to be uploaded to my HDInsight cluster.

Since the storage for the HDInsight cluster is actually a Windows Azure storage volume (ASV), I can upload the data using any tool that can connect to the Azure blob store. However, to keep things simple, I’m going to use interactive console in the HDInsight dashboard and use the fs.put() JavaScript command to display a dialog box that enables you to  upload the data file as shown here:


You can also use the interactive console to navigate the HDFS file system using UNIX-style commands prefixed with a “#” character. For example, to display a directory listing you can type #ls. HDFS has a root folder (indicated by “/”), and each user has a home folder under the /user folder, so my home folder on the cluster is /user/admin. By default, your home folder is the current folder when you connect; and as a shortcut, you can use a “.” character to indicate your home folder, so I can view the contents of a subfolder named weather in my home folder by using either of the following commands:

#ls /user/admin/weather

#ls ./weather

So now I’ve uploaded the comma-delimited text file containing the weather data to ./weather/data.txt, and I’m ready to process it.

Understanding Map/Reduce

Fundamentally, all processing on HDInsight takes the form of a Map/Reduce job. Each node performs a Map job on a subset of the data, and then a Reduce job is performed by a designated node to consolidate the results of the Map jobs and produce a single output file. Natively, Map/Reduce jobs are implemented in Java and compiled to .jar executable, which are uploaded and executed as jobs in the cluster. However, you can also use other languages to implement Map/Reduce code. I’m not a Java programmer (indeed, some would say I’m not a programmer!), so I’ve limited myself to creating a simple example Map/Reduce program in JavaScript as shown here:


The code consists of a map function and a reduce function. The map function is run on all cluster nodes, each node submitting a subset of the source data one line at a time in the value parameter. The map code creates an array from the line of data, using a comma as a delimiter, and writes a key/value pair consisting of the date and temperature fields from the source data to the context object.

The reduce function operates on each distinct key generated from the map operations, and iterates through the values associated with that key. The key value is the date, and the values are the temperature readings for that date. The code adds all of the temperature values, and then divides them by the number of readings for that date to calculate an average value. The consolidated results for each key are then written as a key/value pair, so we should end up with a single date and average temperature for each date value in the source data.

The HDInsight dashboard provides a handy Create Job tile that you can use to run a Map/Reduce job that has been implemented as a Java .jar executable. However, since I’ve chosen to implement my code in JavaScript, I’ll need another way to execute it. The easiest way to run the job is to upload the JavaScript file (in this case to ./weather/getdailytemp.js) and  use Pig to call it, which I can do by entering the following command in the interactive JavaScript console:

pig.from(“/user/admin/weather/data.txt”).mapReduce(“/user/admin/weather/getdailytemp.js”, date, avgtemp:long).to(“/user/admin/weather/avgdailytemp”);

This command tells Pig to run a Map/Reduce job on the weather data by using the JavaScript file I’ve uploaded and produce results with two columns: a date column (which in the absence of an explicit data type will be a chararray) and an avgtemp column (which is declared as a long integer). The results are then stored as a file in the ./weather/avgdailytemp folder.

When the job has finished (which can take some time), viewing the contents of the output directory reveals an output file (in this case named part-m-00000) and a file named _SUCCESS (which is simply a flag to indicate the outcome of the job).


You can use the #cat command to view the contents of the results file as shown here:


Using Pig

While I could write custom Map/Reduce code for all of the data processing my analysis needs, it would require considerably more programming skills than I currently have, and would be extremely time-consuming. Fortunately, I can use Pig to do more than just run my own Map/Reduce code; I can also use it to create data processing workflows in a high-level language called Pig Latin, and have Pig translate those workflows to Map/Reduce code for me.

You can execute Pig Latin statements interactively in a command line shell called (I kid you not) Grunt, or you can create a text file that includes a sequence of Pig Latin commands to be executed as a script. The easiest way to get to the Grunt shell is to use the Connect link in the Windows Azure portal or the Remote Desktop shortcut in the HDInsight dashboard to open a remote desktop session with the cluster (logging in with the credentials you specified when provisioning the cluster), and then using the Hadoop Command Line shortcut on the desktop to run pig from the c:\apps\dist\pig-0.9.3-SNAPSHOT\bin folder as shown here.


When using Pig to process data, typically you execute a sequence of Pig Latin commands, each of which defines a relation. You can think of a relation as being a result set, often tabular in structure. The first command often uses the Pig Latin LOAD statement to populate a relation from a data file, and subsequent commands create new relations by performing operations on the relations created by previous commands. For example, I can use the following sequence of Pig Latin commands to process the weather data in ./weather/data.txt:

Forecasts = LOAD './weather/data.txt' USING PigStorage(',') AS (day, date, temp:long, wind:long);
GroupedForecasts = GROUP Forecasts BY date;
GroupedAvgs = FOREACH GroupedForecasts GENERATE group, AVG(Forecasts.temp) AS avgtemp, AVG(Forecasts.wind) AS avgwind;
AvgWeather = FOREACH GroupedAvgs GENERATE FLATTEN(group) as date, avgtemp, avgwind;

The first command loads the comma-delimited weather data to create a relation named Forecasts with four columns (day, date, temp, and wind).

The second command creates a new relation named GroupedForecasts that contains the Forecasts relation grouped by date.

The third command creates a relation named GroupedAvgs that calculates the average temp and wind values for each group in GroupedForecasts.

The fourth command creates a relation named AvgWeather, which flattens the date group in the GroupedAvgs relation to create a row for each date with date, avgtemp and avgwind columns.

Pig doesn’t actually generate any map/Reduce jobs until you call either a DUMP statement (to display the contents of a relation in the console) or a STORE statement (to save a relation as a file). For example, executing the command DUMP AvgWeather after the four commands above would create the necessary Map/Reduce jobs to process the workflow, and display the results in the console window. The command STORE AvgWeather INTO ‘./weather/dailyaverages’ would store results in the ./weather/dailyaverages folder.

To run the commands as a script, I simply save them in a file named ProcessWeather.pig and run them from the command line using the pig program, as shown here:


The output from the Pig script is similar to that of the Map/Reduce job I ran earlier. When the script has completed, the ./weather/dailyaverages folder contains a file named part-r-00000, a _SUCCESS file, and a _logs folder. using #cat to view the part-r-00000 file shows the results of the processing:


Using Hive

Until now, all of the data processing we’ve done with Map/Reduce code and Pig Latin has generated results in text files. You can view the text files in the interactive console or download them to a client application such as Excel, but it would be good to be able just to consume results from HDInsight the same was as you do from a relational database – by executing queries against tables. That’s exactly what Hive is designed to let you do.

With Hive, you can create tables that get their data from files in HDFS folders. The important thing to understand here is that Hive tables don’t contain data; but rather they provide a metadata layer that projects a schema onto data in underlying files. What does that mean? Well, Hive (in common with Pig and all Map/Reduce code) uses an approach called schema on read, in which a table is associated with a folder in the HDFS folder structure (actually, a table could be partitioned across multiple folders, but let’s not get ahead of ourselves!). When you query a table, the data in any files found in the related folder is read, and only then are the columns and data types in the table definition used to apply structure to the data. You can create a Hive table based on an empty folder, and then copy files to that folder (either source files or the results of Map/Reduce or Pig processing). When you query the table, a Map/Reduce job is generated to apply the table schema to the data in the files you have placed in the folder.

For example, I could create a table for average weather readings by using the following HiveQL statement:

CREATE TABLE avgweather
(weatherdate STRING,
avgtemp FLOAT,
avgwind FLOAT)

If you’re used to working with SQL Server, or any other SQL-based database, this should look pretty familiar. However, because the data will actually be retrieved at query-time from underlying files, there are a few additional options that you typically don’t find in relational databases. For example, you can specify the delimiter you want Hive to use when mapping data in the files to columns in the table by including a ROW FORMAT DELIMITED clause as shown in the following screenshot of the Hive interactive console. In this case, the fields are tab-delimited (which matches the output generated by the Pig script I ran earlier):


The folder that this table maps to is automatically created in the /hive/warehouse folder in the HDFS file system. If you prefer not to use the default location, you can specify an alternative location as shown in the following code:

CREATE TABLE avgweather
(weatherdate STRING,
avgtemp FLOAT,
avgwind FLOAT)
STORED AS TEXTFILE LOCATION '/user/graeme/weather/avgweather'

The ability to use a non-default location is useful if you want processes outside of Hive to be able to access the data files, or if the files already exist and you just want to create a table over them. However, you should be aware that by default, Hive treats all tables as INTERNAL tables, which means that it tightly binds the lifetime of the folder to that of the table. Put more simply, if you delete a table by using the DROP TABLE command, the folder associated with the table (along with all the files it contains) is deleted. If you want the folder to exist independently of the table, you must create it as an EXTERNAL table as shown here:

(weatherdate STRING,
avgtemp FLOAT,
avgwind FLOAT)
STORED AS TEXTFILE LOCATION '/user/graeme/weather/avgweather'

After you have created a table, you can populate it by simply adding files to the folder with which the table is associated. You can do this by using the #cp (copy), #mv (move), #copyFromLocal, or #moveFromLocal HDFS commands in the interactive JavaScript console, or by using the LOAD HiveQL command in the interactive Hive window as shown here:


This code loads data from the part-r-00000 file generated by the Pig script earlier by moving (not copying) the part-r-00000 file from the dailyaverages folder created by the Pig script earlier to the avgweather folder for the table.

Now that the folder associated with the table contains some data, you can query the table to project its schema onto the data, as shown here:


Using Hive from Excel

One of the main benefits of Hive is that HDInsight includes an ODBC driver for Hive that enables client applications to connect to HDInsight and execute HiveQL queries, just like they can with SQL Server (or any other ODBC-compliant database server).  To install the ODBC driver, download click the Downloads tile on the HDInsight dashboard and download the appropriate version for your system (32-bit or 64-bit). Note that the driver download also installs an add-in for Excel, so be sure to choose the CPU architecture that matches your Office installation (and remember that you can install 32-bit Office on 64-bit Windows – I’m just saying!):


After you’ve installed the driver, you can use it from Excel (and other ODBC-aware tools like SQL Server Integration Services, SQL Server Analysis Services (in tabular mode), and SQL Server Reporting Services). Installing the driver also installs a Hive Pane add-in for Excel, which makes it easy for users to connect to HDInsight and create HiveQL queries. However, I experienced a few issues when using this with Excel 2013 (and it seems I’m not alone), so rather than try to create connections explicitly in Excel, I recommend you create a system data source name (DSN) and use that – for some reason it seems a little more stable in this pre-release build!

To create a DSN, go to Administrative Tools and open Data Sources (ODBC). Then create a DSN that uses the Hive ODBC driver and references your HDInsight cluster, as shown here:


Now you’re ready to connect to Hive on your HDInsight cluster from Excel. For example, if you’re using Office Professional Plus 2013, you can use the PowerPivot add-in to connect to an ODBC source as shown here:


Then select the DSN you created earlier and enter the required credentials:


You can use the Table Import Wizard to select a Hive table, or enter a HiveQL query. In this case, I’ve just selected the avgweather table I created earlier:


After the table is imported, you can edit it in the data model, for example to add calculated columns based on DAX expressions or change data types. In this case, I’ve changed the data type of the weatherdate column to Date.


Now that the table has been imported into the data model of the workbook, I can use it to analyze and visualize the data it contains. For example, I can create a Power view report that shows the temperature and wind measurements as a line chart across the entire time period contained in the table.


I’ve only really scratched the surface of HDInsight in this article, so be sure to keep an eye on the Patterns and Practices site for more detailed guidance. In the meantime, if I’ve inspired you to experiment with HDInsight for yourself, the best place to start is