Leveraging your hadoop cluster better - running performant code at scale
Somebody once said that hadoop is a way of running highly unperformant code at scale. In this talk I want to show how we can change that and make map reduce jobs more performant. I will show how to analyze them at scale and optimize the job itself, instead of just tinkering with hadoop options. The result is a much better utilized cluster and jobs that run in a fraction of the original time running performant code at scale! Most of the time when speaking about Hadoop people only consider scale, however, when looking at it it very often runs highly unperformant jobs. By actually looking at the performance characteristics of the jobs themselves and optimizing and tuning those far better results can be achieved. Examples include small changes that cut jobs down from 15 hours to 2 hours without adding any more hardware. The concepts and techniques explained in the talk will be applicable regardless which tool is used to identify the performance characteristics, what is important is that by applying performance analysis and optimization techniques that we have used on other applications for a long time we can make hadoop jobs much more effective and performant! The attendees will be able to understand those techniques and apply them to their map/reduce/PIG/hive or other mapreduce jobs.