Who's driving this car? At first glance it appears that as a developer, you have very little if no control over how MapReduce behaves. In some regards this is an accurate assessment. You have no control over when or where a MapReduce job runs, what data a specific map job will process or which reducer will handle the map's intermediate output. Feeling helpless yet? Don't worry the truth is that despite all that, there are a number of ninja techniques you can use to take control of how data moves...
Read More
In my last post ( HERE ) we talked about troubleshooting and resolving issues with problematic MDX queries. In this post we will look at techniques to tune and troubleshoot the processing side of your Analysis Services cube. Understanding Cube Processing Some of the common questions I hear as a consultant are "Why does my cube take 4 hours to process?" or "How can I reduce the time it takes to process my cube?". The answer to both of these questions starts with identifying the processing bottlen...
In this post I am going to deviate from Hadoop and HDInsight to focus on SQL Server Analysis Services Mutli-dimensional and more specifically MDX queries. As a consultant one of the common issues I encounter more so than design is that of performance. Typically, the performance issues SSAS users encounter occur in one of two realms: cube processing and query execution, while this post will focus on the latter we start by establishing a higher level of understanding of what happens when an XMLA c...
In my last post, we took a helicopter tour of the MapReduce framework and its many facets. I believe its important to have a functional understanding of MapReduce even if you never intend to never work directly with it since the more user-friendly abstractions of both Pig and Hive depend on it. In this post we will again turn to Java as we let our fingers do the walking to build our first MapReduce program. For this demo we will start slowly, implementing first the map and reduce functions...
Somewhere between teaching a BI Bootcamp class and wrestling my troop of kids, I promised myself I would get a blog post in this week. Luckily, I've had a few code heavy posts, so we will dial it back slightly as I briefly introduce MapReduce for Hadoop/HDInsight. Most of the MapReduce posts I've seen to date, talk very specifically about how to implement a C# MapReduce job on HDInsight. Before we go there, I think it's a topic that deserves a somewhat more abstract/academic discussion so that w...
Okay...okay...I know...the pig jokes are lame and getting old by now...maybe a picture of a kitten dressed like a Pig will cheer you up. Luckily this is the last of my introductory Pig posts before moving on to MapReduce. In this post we are going to spend some time creating and playing around with Pig User-Defined Functions (UDFs). We will look at what they are, how they are developed and ultimately leveraged as operators within you Pig Latin scripts. So without further ado..... What is a Pig U...
Wheww! What a year its been....It's been a crazy year from writing books, volunteering and speaking at events through the country (all while still managing to do my regular day job). Now that I've got a handle on things it's time to do a little housekeeping..... That being said....I am in the process of moving my blog over to WordPress. I will continue to "simul-post" my work here but will no longer spend the usual 30+ minutes per post that it takes to tweak the layout. Feel free to check out my...
In my last post (see HERE ), I introduce the Apache Pig project and showed you the equivalent of the "Hello World" demo in Pig. In this post, we are going to use the GSOD (Global Summary of the Day) station weather reports to calculate the average maximum daily temperature for each station. If you have not loaded the data, please see my previous post on Preparing and Loading data. Notes & Considerations You will need to set-up and use the PiggyBank UDFs (User Defined Functions) library. For ...
In previous posts, we have looked at what it takes to get started with with Hadoop on Windows using HDInsight. We also looked at Hive, which is the data warehousing framework built on top of Hadoop. In this post, we will dig a little deeper into the Hadoop Ecosystem focusing in on the parallel language and runtime known as Pig. Pig, More than just bacon Pig got its start at Yahoo in 2006, originally created as a research tool intended to allow for ad-hoc queries and exploration of large semi-str...
In my next couple of blog entries, I will be focusing on PIG and then MapReduce. Before that however, I need to prepare a dataset and get it loaded in HDFS. The data that I will be working with is weather data, specifically the NOAA Global Summary of the Day (GSOD) data available for over 9,000 weather stations. GSOD data can be downloaded from the NOAA ftp site using the following address: ftp://ftp.ncdc.noaa.gov/pub/data/gsod . For this demo, I am only going to focus on a single full year's wo...
This post will be the holding place where I put misc. tools and tips for HDInsight Build Tools 1. Apache ANT ( http://ant.apache.org/manual/install.html ) Extract archive to c:\ant\ then modify the classpath to include Ant: set ANT_HOME=c:\ant set PATH=%PATH%;%ANT_HOME%\bin 2. Apache IVY ( http://ant.apache.org/ivy/history/latest-milestone/install.html ) Copy Ivy.JAR to Ant lib folder 3. Git Client ( http://git-scm.com/downloads ) Data Preparation/Research Tools 1. CURL ( http://curl.haxx.se/dow...
Hive Introduction Within the Hadoop ecosystem, you can use HDFS to load and store data and MapReduce to do both simple and hardcore processing. One of the missing pieces to the puzzle that is familiar to data warehousing professionals is the ability to interact with the data. Enter HIVE. Hive got is start at Facebook as they struggled to deal with the massive quantity of data accumulating daily within their Hadoop cluster. While it was easy for developers to write MapReduce jobs in a variety of ...
I am passionate when it comes to analytics, data mining and machine learning and I think most organizations do too little when it comes to this arena. That's why one of my favorite parts of the Hadoop ecosystem is Mahout. Mahout is a scalable machine learning library that includes multiple out of the box machine learning and data mining algorithms including clustering, classification, collaborative filtering and frequent pattern mining. If you are using HDInsight in the cloud Mahout comes pre-in...
It's been a while since I've had the opportunity to blog so when I decided to install HDInsight on a VM, I figured what better opportunity to get back in the swing of it. The Jumping Off Point To get things started, I am using VirtualBox as my VM host and I am running a fully patched (all 150+ of them) version of Windows Server 2008R2. Not that its relevant, but to be thorough, I've also installed SQL Server 2012 SP 1 as it will be used in subsequent blogs. A Tale of Two Installers Before we div...
In Part 6 of this Master Data Services blog series we will look at how we enforce quality standards and ensure accuracy in our master data by implementing business rules. In the prior parts to this series, we have spent time reviewing important Master Data concepts and MDM architectures. We also looked at configuring MDS before learning about the model, entities, attributes and members. In our last post we looked at derived and explicit hierarchies before being introduced to collections. Series ...
In continuing this blog post series on Master Data Services 2012, we will dial in on MDS Hierarchies and Collections. So far in this series we have spent time reviewing important Master Data concepts and MDM architectures. We also looked at configuring MDS. Finally, in the last post we started to dig deeper into MDS by looking at Models, Entities, Attributes and Members. Series Index Part 1 - Understanding Master Data Part 2 - Master Data Management Architectures Part 3 - Installing Master Data ...
Although not new in SSIS 2012, Build Configurations have become exponentially more useful with the introduction of parameters and the new project deployment model. Before we dive in to see how useful this feature is, let's take a moment to review parameters and the project deployment model. Parameters Parameters are a new feature intended to replace and simplify configuration of SSIS packages when running under the new project deployment model. They are treated like read-only variables and have ...
It's been more than a month since my last post and unfortunately this crazy thing called work kept me away for far too long. We will build some momentum and get back on track in this post. In the first three posts of this series we spent time to build a foundation for understanding important master data concepts and architectures. We also spent time to set-up and configure Master Data Services. In this post, we will get to the meat of MDS 2012. Specifically, we will dial in on the model, entitie...
In the first two parts of this blog series we spent time talking about master data, master data management (MDM) and the architectural patterns that are prevalent in MDM solutions. In this post we will start narrowing our focus to Master Data Services (MDS) in SQL Server 2012 by starting with the installation and set-up process. Pre-Requisites Before we get started there are a couple of pre-requisites to be aware of. The first and one which will cause you problems in numerous places but will not...
SQL Server 2012 introduced a number of new features. One of the more interesting is an extension of the full-text search capability, called semantic search or more formally statistical semantic search. The premise behind semantic search is that statistical analysis of unstructured data or documents, built on top of existing full-text indexes, is used to provide meaning or context to search phrases. With this features we are able to do a number of interesting tasks like extracting relevant terms,...
In my last post, I introduced at a high-level the concepts important to both master data and master data management. We also discussed some of the driving factors behind the renewed interest in this arena and how people, process and technology are involved. In this post we are going to leave all that behind to focus solely on the technology aspect of implementing a Master Data Management (MDM) solution. Series Index Part 1 - Understanding Master Data Architecture It's worth spending some time in...
Data rich and knowledge poor. That's a common cliche we hear today and is the mantra or driving force behind the contemporary BI movement. But as more and more companies begin to explore and utilize the wealth of data they have at their fingertips, they are finding new challenges to old problems. In this blog series we are going to explore master data, master data management and the new Master Data Services (MDS) available in SQL Server 2012. This first post is foundational in that we define a b...
In this, the sixth installment of the DQS series we are going to move to the next level and demonstrate the ins and outs of a matching project using the matching policy we created in the last post. For brevity, we will skip over most of the general info associated with Data Quality Projects since it was covered in prior posts. If you have not been following along you can catch up below. DQS Blog Series Index Part 1: Getting Started with Data Quality Services (DQS) 2012 Part 2: Building Out a Kno...
In my last post " Building Out a DQS Knowledge Base " we set-up and configured a DQS knowledge base. In this post we are going to explore the DQS knowledge discovery process. Before we get started let's recap our demo scenario. We are working with set of data that contains all MLB and NFL teams, the league they play in, their stadium names and capacities. Our knowledge base was built using a domain for each data element (Team, League, Stadium, Capacity). We also constructed a composite domain to...
We are wrapping up this series on DQS with an overview of the administrative tasks that you will inevitably encounter in your own adventures with DQS. Specifically we will dial in on the activity monitoring capabilities, configuration and the security model. If you haven't been following along you can catch up using the links below. DQS Blog Series Index Part 1: Getting Started with Data Quality Services (DQS) 2012 Part 2: Building Out a Knowledge Base Part 3: Knowledge Discovery in DQS Part 4: ...