Who is online?  0 guests and 0 members
Home  »  Blogs  »  cprice1979

Communifire Blogs

Blogs RSS Feed

cprice1979 : Most Recent postings

cprice1979

Streaming #Pig

9/27/2013 by cprice1979  -  Comments: 0  -  Views: [1579]

As a C# developer there are a number of opportunities available for writing code that is either used by or interacts with a Hadoop/HDInsight cluster. A number of these have been well publicized and documented. In fact there is an entire .Net SDK for Hadoop ( HERE ) that will allow you to easily write streaming MapReduce jobs, manage your cluster and even write LINQ-based queries against Hive (among other things). One option that you may not have been aware of is the streaming feature built into ...

Read More

cprice1979

Oink: Improving #Pig Development

9/25/2013 by cprice1979  -  Comments: 0  -  Views: [936]

Over the last couple (ok more than a couple) of months, we've taken a meandering stroll through the different parts and pieces that form the foundation of the Hadoop ecosystem. We've covered Hive , Mahout , Recommendation Engines and even a little bit about Pig . In this post we are going to again circle back to Pig and instead of focusing on the sexy, whiz-bang features we are going to look more at the practical everyday skills needed to work productively with Pig. Pig Scripts After you underst...

Read More

cprice1979

Indexes & Views in #Hive

9/7/2013 by cprice1979  -  Comments: 0  -  Views: [4568]

In my last Hive post, we introduced partitions and bucketing both of which allow you to horizontally slice data to make it more manageable and easy to query. Staying the course in this post we will introduce two more techniques to improve your experience in Hive through the use of indexes and views. Indexes In the SQL Server world, indexes are a critical for efficiently retrieving rows from tables. Without an index, it is necessary to perform a full scan of all the data to retrieve the required ...

Read More

cprice1979

Partitions & Buckets in #Hive

8/31/2013 by cprice1979  -  Comments: 0  -  Views: [5728]

In my previous post, we discussed the map, array and struct data types and their implementation in Hive. Continuing on the Hive theme, this post will introduce partitioning and bucketing as method for segmenting large data sets to improve query performance. Partitions If you have previous experience working in the relational database world then inevitably the concept of partitions and partitioning is not new. Partitions are fundamentally horizontal slices of data which allow larges sets of data ...

Read More

cprice1979

Introduction to #Hive Collections

8/21/2013 by cprice1979  -  Comments: 0  -  Views: [4317]

After a much needed vacation in the sunny Florida Keys and some time away from the work and blogosphere world, its time to get back on the hamster wheel. Like most RDBMS systems Hive supports a number of different primitive data types including various size integers, precision floating point, boolean, timestamp and of course strings. Beyond those basics, Hive supports three data types not typically found in other database systems. These three data types: array, struct and map are collection impl...

Read More

cprice1979

Integrating Master Data Services Into The Enterprise

7/10/2013 by cprice1979  -  Comments: 0  -  Views: [1179]

A lot of what is typically discussed in regards to Master Data Services (MDS) revolves around either the Silverlight web-based interface or the Excel plug-in. What's not often discussed are the back-end integrations scenarios for loading, extracting (querying) and the otherwise day-to-day management or reporting requirements a Master Data solution might involve. In this post, we will look at common integration scenarios and techniques for handling each within MDS. Loading Data (Batch) Out-of-the...

Read More

cprice1979

#Mahout recommendation Engines: Part 3 - Moving Data

7/3/2013 by cprice1979  -  Comments: 0  -  Views: [1233]

In the previous two posts of this series we built a foundation for designing and building a recommendation engine. In the first post we  built an understanding for what a recommendation engine looks like and how it works. In the second post we introduced Mahout as a platform for building a recommendation engine. This posts builds on that as we start designing a recommendation from end-to-end, beginning with the ELT process for moving data to a Hadoop Cluster. Sources of Data If you recall f...

Read More

cprice1979

Upcoming Events

6/23/2013 by cprice1979  -  Comments: 0  -  Views: [639]

This weekend starts a summer/fall sprint of speaking engagements. I hope you will considering joining me at one of the many in person and virtual sessions I will be hosting. Below are some of those happenings, feel free to check out my events page for a more complete page. SQL Saturdays 6/29 - South Florida #226 MDS in Practice: An Integrated Approach ( http://www.sqlsaturday.com/viewsession.aspx?sat=226&sessionid=14901 ) Into the Wild, Taming Unstructured Data ( http://www.sqlsaturday.com/v...

Read More

cprice1979

Making the Case for Statistical Semantic Search

6/17/2013 by cprice1979  -  Comments: 0  -  Views: [868]

As we sit on the cusp of SQL Server 2014, it seems a little odd to be writing a blog whose objective is to introduce File Table and Semantic Search. While both of these feature were new in SQL Server 2012 and both received quite a bit of attention I have found recently that they are still either misunderstood or simply overlooked. Regardless, either case is unfortunate because both features are very powerful and open up a number of possibilities in handling what is commonly referred to as unstru...

Read More

cprice1979

#Mahout Recommendation Engines: Part 2 - Ride the Elephant

6/14/2013 by cprice1979  -  Comments: 0  -  Views: [3205]

In Part 1 of this blog series we built a foundation by introducing the various techniques that can be used to generate recommendations for products or items to your users. In this post, we begin looking at the Mahout as a platform for building a recommender including setting up a data model, common methods for calculating similarity and finally the algorithms used to generate recommendations Understanding Recommendations in Mahout Mahout is a machine learning library of algorithms that grew out ...

Read More

cprice1979

#Mahout Recommendation Engine: Part 1 - Types of Recommenders

6/14/2013 by cprice1979  -  Comments: 0  -  Views: [3556]

Recommendation Engines have become a pervasive and daily part of our digitally connected lives. Whether your shopping on Amazon or reading new articles on your Yahoo! home page the products and news you offered are the result of some implicit or explicit behavior that is used to drive a computational engine that uses patterns to predict (hopefully successfully) your likes and dislikes in order to serve up recommendations. While this technology is nothing new, advancement in toolsets have made th...

Read More

cprice1979

Hello My Name is Sqoop

5/30/2013 by cprice1979  -  Comments: 0  -  Views: [1566]

If my previous post we have looked at different means and methods for loading and subsequently working with data in a Hadoop environment. Largely missing from the discussion to date however is how SQL Server and other relational database play in this sandbox. While there are multiple points of integration the focus of this post will be on SQL-to-Hadoop tool better known as Sqoop. Have a Double Sqoop Sqoop is a relatively new command-line tool whose primary purpose is efficiently moving data betw...

Read More

cprice1979

#SQLPASS Abstract Review - My Perspective

5/24/2013 by cprice1979  -  Comments: 0  -  Views: [877]

I have been fortunate enough to participate as a team lead for the past two years on the abstract review committee for PASS Summit and I wanted to take a moment to provide some feedback based on my own personal experiences. First, this year was by far the toughest. The quality of abstracts was phenomenal which made the job of abstract review and session selection very tough (this is good thing btw). Much of this is not new. I am hoping that it will help you make more sense of the abstract review...

Read More

cprice1979

PASS Summit 2013

5/22/2013 by cprice1979  -  Comments: 0  -  Views: [716]

It's official!! I will be presenting a session on HDInsight and Predictive Analytics at PASS Summit 2013 in Charlotte, North Carolina. This is the first time the event is being held in Charlotte instead of Seattle and while I have attended previous Summits for many years in various capacities, this year is special as it will be my first time presenting. I hope you will consider joining me this year at PASS Summit! For more information and check out the official website at: http://www.sqlpass.org...

Read More

cprice1979

MapReduce Ninja Moves: Combiners, Shuffle & Doing A Sort

5/20/2013 by cprice1979  -  Comments: 0  -  Views: [2943]

Who's driving this car? At first glance it appears that as a developer, you have very little if no control over how MapReduce behaves. In some regards this is an accurate assessment. You have no control over when or where a MapReduce job runs, what data a specific map job will process or which reducer will handle the map's intermediate output. Feeling helpless yet? Don't worry the truth is that despite all that, there are a number of ninja techniques you can use to take control of how data moves...

Read More

cprice1979

Tuning Multi-Dimensional Cube Processing

5/11/2013 by cprice1979  -  Comments: 1  -  Views: [3258]

In my last post ( HERE ) we talked about troubleshooting and resolving issues with problematic MDX queries. In this post we will look at techniques to tune and troubleshoot the processing side of your Analysis Services cube. Understanding Cube Processing Some of the common questions I hear as a consultant are "Why does my cube take 4 hours to process?" or "How can I reduce the time it takes to process my cube?". The answer to both of these questions starts with identifying the processing bottlen...

Read More

cprice1979

Troubleshooting MDX Queries

5/11/2013 by cprice1979  -  Comments: 1  -  Views: [2367]

In this post I am going to deviate from Hadoop and HDInsight to focus on SQL Server Analysis Services Mutli-dimensional and more specifically MDX queries. As a consultant one of the common issues I encounter more so than design is that of performance. Typically, the performance issues SSAS users encounter occur in one of two realms: cube processing and query execution, while this post will focus on the latter we start by establishing a higher level of understanding of what happens when an XMLA c...

Read More

cprice1979

MapReduce - First Glance

5/4/2013 by cprice1979  -  Comments: 0  -  Views: [778]

In my last post, we took a helicopter tour of the MapReduce framework and its many facets. I believe its important to have a functional understanding of MapReduce even if you never intend to never work directly with it since the more user-friendly abstractions of both Pig and Hive depend on it.  In this post we will again turn to Java as we let our fingers do the walking to build our first MapReduce program. For this demo we will start slowly, implementing first the map and reduce functions...

Read More

cprice1979

Map/Reduce - A Brief Introduction

4/24/2013 by cprice1979  -  Comments: 0  -  Views: [1345]

Somewhere between teaching a BI Bootcamp class and wrestling my troop of kids, I promised myself I would get a blog post in this week. Luckily, I've had a few code heavy posts, so we will dial it back slightly as I briefly introduce MapReduce for Hadoop/HDInsight. Most of the MapReduce posts I've seen to date, talk very specifically about how to implement a C# MapReduce job on HDInsight. Before we go there, I think it's a topic that deserves a somewhat more abstract/academic discussion so that w...

Read More

cprice1979

MMM More Bacon - Pig User-Defined Functions (UDFs)

4/20/2013 by cprice1979  -  Comments: 0  -  Views: [1574]

Okay...okay...I know...the pig jokes are lame and getting old by now...maybe a picture of a kitten dressed like a Pig will cheer you up. Luckily this is the last of my introductory Pig posts before moving on to MapReduce. In this post we are going to spend some time creating and playing around with Pig User-Defined Functions (UDFs). We will look at what they are, how they are developed and ultimately leveraged as operators within you Pig Latin scripts. So without further ado..... What is a Pig U...

Read More

cprice1979

Moving Day!

4/17/2013 by cprice1979  -  Comments: 0  -  Views: [849]

Wheww! What a year its been....It's been a crazy year from writing books, volunteering and speaking at events through the country (all while still managing to do my regular day job). Now that I've got a handle on things it's time to do a little housekeeping..... That being said....I am in the process of moving my blog over to WordPress. I will continue to "simul-post" my work here but will no longer spend the usual 30+ minutes per post that it takes to tweak the layout. Feel free to check out my...

Read More

cprice1979

Shakin' Bacon: Using Pig To Process Data

4/17/2013 by cprice1979  -  Comments: 0  -  Views: [3100]

In my last post (see HERE ), I introduce the Apache Pig project and showed you the equivalent of the "Hello World" demo in Pig. In this post, we are going to use the GSOD (Global Summary of the Day) station weather reports to calculate the average maximum daily temperature for each station. If you have not loaded the data, please see my previous post on Preparing and Loading data. Notes & Considerations You will need to set-up and use the PiggyBank UDFs (User Defined Functions) library. For ...

Read More

cprice1979

When Pigs Fly: An apache pig introduction

4/15/2013 by cprice1979  -  Comments: 0  -  Views: [1278]

In previous posts, we have looked at what it takes to get started with with Hadoop on Windows using HDInsight. We also looked at Hive, which is the data warehousing framework built on top of Hadoop. In this post, we will dig a little deeper into the Hadoop Ecosystem focusing in on the parallel language and runtime known as Pig. Pig, More than just bacon Pig got its start at Yahoo in 2006, originally created as a research tool intended to allow for ad-hoc queries and exploration of large semi-str...

Read More

cprice1979

Preparing Data for Hadoop

4/12/2013 by cprice1979  -  Comments: 0  -  Views: [1222]

In my next couple of blog entries, I will be focusing on PIG and then MapReduce. Before that however, I need to prepare a dataset and get it loaded in HDFS. The data that I will be working with is weather data, specifically the NOAA Global Summary of the Day (GSOD) data available for over 9,000 weather stations. GSOD data can be downloaded from the NOAA ftp site using the following address: ftp://ftp.ncdc.noaa.gov/pub/data/gsod . For this demo, I am only going to focus on a single full year's wo...

Read More

cprice1979

Being Productive with HDInsight

4/9/2013 by cprice1979  -  Comments: 0  -  Views: [1004]

This post will be the holding place where I put misc. tools and tips for HDInsight Build Tools 1. Apache ANT ( http://ant.apache.org/manual/install.html ) Extract archive to c:\ant\ then modify the classpath to include Ant: set ANT_HOME=c:\ant set PATH=%PATH%;%ANT_HOME%\bin 2. Apache IVY ( http://ant.apache.org/ivy/history/latest-milestone/install.html ) Copy Ivy.JAR to Ant lib folder 3. Git Client ( http://git-scm.com/downloads ) Data Preparation/Research Tools 1. CURL ( http://curl.haxx.se/dow...

Read More