Who is online?  0 guests and 0 members
Home  »  Blogs  »  cprice1979

Communifire Blogs

Blogs RSS Feed

cprice1979 : Most Recent postings

cprice1979

MapReduce Ninja Moves: Combiners, Shuffle & Doing A Sort

2 days ago by cprice1979  -  Comments: 0  -  Views: [136]

Who's driving this car? At first glance it appears that as a developer, you have very little if no control over how MapReduce behaves. In some regards this is an accurate assessment. You have no control over when or where a MapReduce job runs, what data a specific map job will process or which reducer will handle the map's intermediate output. Feeling helpless yet? Don't worry the truth is that despite all that, there are a number of ninja techniques you can use to take control of how data moves...

Read More

cprice1979

Tuning Multi-Dimensional Cube Processing

10 days ago by cprice1979  -  Comments: 1  -  Views: [398]

In my last post ( HERE ) we talked about troubleshooting and resolving issues with problematic MDX queries. In this post we will look at techniques to tune and troubleshoot the processing side of your Analysis Services cube. Understanding Cube Processing Some of the common questions I hear as a consultant are "Why does my cube take 4 hours to process?" or "How can I reduce the time it takes to process my cube?". The answer to both of these questions starts with identifying the processing bottlen...

Read More

cprice1979

Troubleshooting MDX Queries

10 days ago by cprice1979  -  Comments: 1  -  Views: [572]

In this post I am going to deviate from Hadoop and HDInsight to focus on SQL Server Analysis Services Mutli-dimensional and more specifically MDX queries. As a consultant one of the common issues I encounter more so than design is that of performance. Typically, the performance issues SSAS users encounter occur in one of two realms: cube processing and query execution, while this post will focus on the latter we start by establishing a higher level of understanding of what happens when an XMLA c...

Read More

cprice1979

MapReduce - First Glance

17 days ago by cprice1979  -  Comments: 0  -  Views: [291]

In my last post, we took a helicopter tour of the MapReduce framework and its many facets. I believe its important to have a functional understanding of MapReduce even if you never intend to never work directly with it since the more user-friendly abstractions of both Pig and Hive depend on it.  In this post we will again turn to Java as we let our fingers do the walking to build our first MapReduce program. For this demo we will start slowly, implementing first the map and reduce functions...

Read More

cprice1979

Map/Reduce - A Brief Introduction

28 days ago by cprice1979  -  Comments: 0  -  Views: [599]

Somewhere between teaching a BI Bootcamp class and wrestling my troop of kids, I promised myself I would get a blog post in this week. Luckily, I've had a few code heavy posts, so we will dial it back slightly as I briefly introduce MapReduce for Hadoop/HDInsight. Most of the MapReduce posts I've seen to date, talk very specifically about how to implement a C# MapReduce job on HDInsight. Before we go there, I think it's a topic that deserves a somewhat more abstract/academic discussion so that w...

Read More

cprice1979

MMM More Bacon - Pig User-Defined Functions (UDFs)

4/20/2013 by cprice1979  -  Comments: 0  -  Views: [487]

Okay...okay...I know...the pig jokes are lame and getting old by now...maybe a picture of a kitten dressed like a Pig will cheer you up. Luckily this is the last of my introductory Pig posts before moving on to MapReduce. In this post we are going to spend some time creating and playing around with Pig User-Defined Functions (UDFs). We will look at what they are, how they are developed and ultimately leveraged as operators within you Pig Latin scripts. So without further ado..... What is a Pig U...

Read More

cprice1979

Moving Day!

4/17/2013 by cprice1979  -  Comments: 0  -  Views: [359]

Wheww! What a year its been....It's been a crazy year from writing books, volunteering and speaking at events through the country (all while still managing to do my regular day job). Now that I've got a handle on things it's time to do a little housekeeping..... That being said....I am in the process of moving my blog over to WordPress. I will continue to "simul-post" my work here but will no longer spend the usual 30+ minutes per post that it takes to tweak the layout. Feel free to check out my...

Read More

cprice1979

Shakin' Bacon: Using Pig To Process Data

4/17/2013 by cprice1979  -  Comments: 0  -  Views: [995]

In my last post (see HERE ), I introduce the Apache Pig project and showed you the equivalent of the "Hello World" demo in Pig. In this post, we are going to use the GSOD (Global Summary of the Day) station weather reports to calculate the average maximum daily temperature for each station. If you have not loaded the data, please see my previous post on Preparing and Loading data. Notes & Considerations You will need to set-up and use the PiggyBank UDFs (User Defined Functions) library. For ...

Read More

cprice1979

When Pigs Fly: An apache pig introduction

4/15/2013 by cprice1979  -  Comments: 0  -  Views: [465]

In previous posts, we have looked at what it takes to get started with with Hadoop on Windows using HDInsight. We also looked at Hive, which is the data warehousing framework built on top of Hadoop. In this post, we will dig a little deeper into the Hadoop Ecosystem focusing in on the parallel language and runtime known as Pig. Pig, More than just bacon Pig got its start at Yahoo in 2006, originally created as a research tool intended to allow for ad-hoc queries and exploration of large semi-str...

Read More

cprice1979

Preparing Data for Hadoop

4/12/2013 by cprice1979  -  Comments: 0  -  Views: [435]

In my next couple of blog entries, I will be focusing on PIG and then MapReduce. Before that however, I need to prepare a dataset and get it loaded in HDFS. The data that I will be working with is weather data, specifically the NOAA Global Summary of the Day (GSOD) data available for over 9,000 weather stations. GSOD data can be downloaded from the NOAA ftp site using the following address: ftp://ftp.ncdc.noaa.gov/pub/data/gsod . For this demo, I am only going to focus on a single full year's wo...

Read More

cprice1979

Being Productive with HDInsight

4/9/2013 by cprice1979  -  Comments: 0  -  Views: [450]

This post will be the holding place where I put misc. tools and tips for HDInsight Build Tools 1. Apache ANT ( http://ant.apache.org/manual/install.html ) Extract archive to c:\ant\ then modify the classpath to include Ant: set ANT_HOME=c:\ant set PATH=%PATH%;%ANT_HOME%\bin 2. Apache IVY ( http://ant.apache.org/ivy/history/latest-milestone/install.html ) Copy Ivy.JAR to Ant lib folder 3. Git Client ( http://git-scm.com/downloads ) Data Preparation/Research Tools 1. CURL ( http://curl.haxx.se/dow...

Read More

cprice1979

HIVE on HDInsight: First Glance

4/7/2013 by cprice1979  -  Comments: 0  -  Views: [614]

Hive Introduction Within the Hadoop ecosystem, you can use HDFS to load and store data and MapReduce to do both simple and hardcore processing. One of the missing pieces to the puzzle that is familiar to data warehousing professionals is the ability to interact with the data. Enter HIVE. Hive got is start at Facebook as they struggled to deal with the massive quantity of data accumulating daily within their Hadoop cluster. While it was easy for developers to write MapReduce jobs in a variety of ...

Read More

cprice1979

Installing Mahout for HDInsight on Windows Server

4/4/2013 by cprice1979  -  Comments: 0  -  Views: [616]

I am passionate when it comes to analytics, data mining and machine learning and I think most organizations do too little when it comes to this arena. That's why one of my favorite parts of the Hadoop ecosystem is Mahout. Mahout is a scalable machine learning library that includes multiple out of the box machine learning and data mining algorithms including clustering, classification, collaborative filtering and frequent pattern mining. If you are using HDInsight in the cloud Mahout comes pre-in...

Read More

cprice1979

Installing HDInsight

4/4/2013 by cprice1979  -  Comments: 0  -  Views: [542]

It's been a while since I've had the opportunity to blog so when I decided to install HDInsight on a VM, I figured what better opportunity to get back in the swing of it. The Jumping Off Point To get things started, I am using VirtualBox as my VM host and I am running a fully patched (all 150+ of them) version of Windows Server 2008R2. Not that its relevant, but to be thorough, I've also installed SQL Server 2012 SP 1 as it will be used in subsequent blogs. A Tale of Two Installers Before we div...

Read More

cprice1979

MDS 2012: Part 6–Business Rules

7/17/2012 by cprice1979  -  Comments: 2  -  Views: [3521]

In Part 6 of this Master Data Services blog series we will look at how we enforce quality standards and ensure accuracy in our master data by implementing business rules. In the prior parts to this series, we have spent time reviewing important Master Data concepts and MDM architectures. We also looked at configuring MDS before learning about the model, entities, attributes and members. In our last post we looked at derived and explicit hierarchies before being introduced to collections. Series ...

Read More

cprice1979

MDS 2012: Part 5–Hierarchies and Collections

7/11/2012 by cprice1979  -  Comments: 0  -  Views: [3289]

In continuing this blog post series on Master Data Services 2012, we will dial in on MDS Hierarchies and Collections. So far in this series we have spent time reviewing important Master Data concepts and MDM architectures. We also looked at configuring MDS. Finally, in the last post we started to dig deeper into MDS by looking at Models, Entities, Attributes and Members. Series Index Part 1 - Understanding Master Data Part 2 - Master Data Management Architectures Part 3 - Installing Master Data ...

Read More

cprice1979

Build Configurations in SSIS 2012

7/8/2012 by cprice1979  -  Comments: 1  -  Views: [10038]

Although not new in SSIS 2012, Build Configurations have become exponentially more useful with the introduction of parameters and the new project deployment model. Before we dive in to see how useful this feature is, let's take a moment to review parameters and the project deployment model. Parameters Parameters are a new feature intended to replace and simplify configuration of SSIS packages when running under the new project deployment model. They are treated like read-only variables and have ...

Read More

cprice1979

MDS 2012: Part 4–Models, Entities, Attributes & Members

6/29/2012 by cprice1979  -  Comments: 0  -  Views: [4002]

It's been more than a month since my last post and unfortunately this crazy thing called work kept me away for far too long. We will build some momentum and get back on track in this post. In the first three posts of this series we spent time to build a foundation for understanding important master data concepts and architectures. We also spent time to set-up and configure Master Data Services. In this post, we will get to the meat of MDS 2012. Specifically, we will dial in on the model, entitie...

Read More

cprice1979

MDS 2012: Part 3–Installing Master Data Services

4/23/2012 by cprice1979  -  Comments: 0  -  Views: [7842]

In the first two parts of this blog series we spent time talking about master data, master data management (MDM) and the architectural patterns that are prevalent in MDM solutions. In this post we will start narrowing our focus to Master Data Services (MDS) in SQL Server 2012 by starting with the installation and set-up process. Pre-Requisites Before we get started there are a couple of pre-requisites to be aware of. The first and one which will cause you problems in numerous places but will not...

Read More

cprice1979

Preparing for Sematic Search

4/20/2012 by cprice1979  -  Comments: 0  -  Views: [2810]

SQL Server 2012 introduced a number of new features. One of the more interesting is an extension of the full-text search capability, called semantic search or more formally statistical semantic search. The premise behind semantic search is that statistical analysis of unstructured data or documents, built on top of existing full-text indexes, is used to provide meaning or context to search phrases. With this features we are able to do a number of interesting tasks like extracting relevant terms,...

Read More

cprice1979

MDS 2012: Part 2–Master Data Management Architectures

4/11/2012 by cprice1979  -  Comments: 0  -  Views: [5918]

In my last post, I introduced at a high-level the concepts important to both master data and master data management. We also discussed some of the driving factors behind the renewed interest in this arena and how people, process and technology are involved. In this post we are going to leave all that behind to focus solely on the technology aspect of implementing a Master Data Management (MDM) solution. Series Index Part 1 - Understanding Master Data Architecture It's worth spending some time in...

Read More

cprice1979

MDS 2012: Part 1 Understanding Master Data

4/4/2012 by cprice1979  -  Comments: 3  -  Views: [3863]

Data rich and knowledge poor. That's a common cliche we hear today and is the mantra or driving force behind the contemporary BI movement. But as more and more companies begin to explore and utilize the wealth of data they have at their fingertips, they are finding new challenges to old problems. In this blog series we are going to explore master data, master data management and the new Master Data Services (MDS) available in SQL Server 2012. This first post is foundational in that we define a b...

Read More

cprice1979

Matching Projects in DQS

3/29/2012 by cprice1979  -  Comments: 1  -  Views: [2475]

In this, the sixth installment of the DQS series we are going to move to the next level and demonstrate the ins and outs of a matching project using the matching policy we created in the last post. For brevity, we will skip over most of the general info associated with Data Quality Projects since it was covered in prior posts. If you have not been following along you can catch up below. DQS Blog Series Index Part 1: Getting Started with Data Quality Services (DQS) 2012 Part 2: Building Out a Kno...

Read More

cprice1979

Knowledge Discovery in DQS

3/29/2012 by cprice1979  -  Comments: 0  -  Views: [3344]

In my last post " Building Out a DQS Knowledge Base " we set-up and configured a DQS knowledge base. In this post we are going to explore the DQS knowledge discovery process. Before we get started let's recap our demo scenario. We are working with set of data that contains all MLB and NFL teams, the league they play in, their stadium names and capacities. Our knowledge base was built using a domain for each data element (Team, League, Stadium, Capacity). We also constructed a composite domain to...

Read More

cprice1979

Activity Monitoring, Configuration & Security in DQS

3/29/2012 by cprice1979  -  Comments: 3  -  Views: [2341]

We are wrapping up this series on DQS with an overview of the administrative tasks that you will inevitably encounter in your own adventures with DQS. Specifically we will dial in on the activity monitoring capabilities, configuration and the security model. If you haven't been following along you can catch up using the links below. DQS Blog Series Index Part 1: Getting Started with Data Quality Services (DQS) 2012 Part 2: Building Out a Knowledge Base Part 3: Knowledge Discovery in DQS Part 4: ...

Read More