Tutorial

Apache Hadoop FAQ for BI professionals

Staff Writer, SearchBusinessIntelligence.in

Over the past few years, the open source technology Apache Hadoop has become quite popular amongst BI and DW professionals. In this tutorial, we will explain the concept by answering some of the frequently asked questions (FAQ) on Hadoop. 


Special Hadoop resources

Guide:

Apache Hadoop integration guide: New and notable developments

Apache Hadoop is instrumental in analyzing big data within the enterprise. Learn about big data, Hadoop and middleware trends in this learning guide.

Video:

Hadoop's run into enterprise cloud

Free enterprise cloud software from Hadoop is causing a lot of buzz in the market, and the economics of the public cloud are leading customers to more mid-sized provider.

In this guide:


  • What is Apache Hadoop? 

Apache Hadoop is a free, Java-based programming framework designed for parallel processing of large amounts of data in a distributed computing environment. Hadoop can scale up, in a fault-tolerant manner, from one machine to thousands. This scalability means the individual machines in the processing cluster can be inexpensive, while the cluster itself stays resilient. With Hadoop, applications can process petabytes of data on thousands of processing nodes.

  • Who supports and funds Hadoop? 

Hadoop is one of the projects of the Apache Software Foundation. The main Hadoop project is contributed to by a global network of developers. Sub-projects of Hadoop are supported by the world’s largest Web companies, including Facebook and Yahoo.

<< Back to Top

  • Why is Hadoop popular? 

Hadoop's popularity is partly due to the fact that it is used by some of the world's largest Internet businesses to analyze unstructured data. Hadoop enables distributed applications to handle data volumes in the order of thousands of exabytes.

  • Where does Hadoop find applicability in business? 

Hadoop, as a scalable system for parallel data processing, is useful for analyzing large data sets. Examples are search algorithms, market risk analysis, data mining on online retail data, and analytics on user behavior data.

Hadoop's scalability makes it attractive to businesses because of the exponentially increasing nature of the data they handle. Another core strength of Hadoop is that it can handle structured as well as unstructured data, from a variable number of sources.

<< Back to Top

  • What are the enterprise adoption challenges associated with Hadoop? 

  1. To many enterprises, the Hadoop framework is attractive because it gives them the power to analyze their data, regardless of volume. Not all enterprises, however, have the expertise to drive that analysis such that it delivers business value.
  2. Scaling up and optimizing Hadoop computing clusters involves custom coding, which can mean a steep learning curve for data analytics developers.
  3. Hadoop was not originally designed with the security functionalities typically required for sensitive enterprise data.
  4. Other potential problem areas for enterprise adoption of Hadoop include integration with existing databases and applications, and the absence of industry-wide best practices.
  • How has Hadoop evolved over the years? 

Hadoop originally derives from Google's implementation of a programming model called MapReduce. Google's MapReduce framework could break down a program into many parallel computations, and run them on very large data sets, across a large number of computing nodes. An example use for such a framework is search algorithms running on Web data.

Hadoop, initially associated only with web indexing, evolved rapidly to become a leading platform for analyzing big data. Cloudera, an enterprise software company, began providing Hadoop-based software and services in 2008.

In 2012, GoGrid, a cloud infrastructure company, partnered with Cloudera to accelerate the adoption of Hadoop-based business applications.Also in 2012, Dataguise, a data security company, launched a data protection and risk assessment tool for Hadoop.

<< Back to Top

  • Apache Hadoop: Companion projects 

The Apache Software Foundation maintains several companion projects to Hadoop:

Apache Cassandra is a database management system designed for large amounts of data. Key aspects are fault tolerance, scalability, Hadoop integration, and replication support.

HBase is a non-relational, fault-tolerant distributed database designed for storing large amounts of sparse data.

Hive is a data warehousing system for Hadoop that enables easy data summarization.

Apache Pig consists of a high-level language to create data analysis programs, along with the infrastructure for evaluating those programs.

Apache ZooKeeper is a centralized service for distributed applications. It maintains configuration information and provides a naming registry, distributed synchronization, and group services.

Chukwa is a data collection system that monitors large distributed systems; it includes a toolkit for analyzing the results.

The Apache Mahout project aims to produce free implementations, on the Hadoop platform, of scalable machine learning algorithms.

<< Back to Top

Further reading: 

 

Hadoop enables distributed ‘big data’ processing across clouds

Can Hadoop tools ease the brain pain?

Building massively distributed applications with Hadoop

How Hadoop and open source BI software affect BI architecture

Customary data warehouse concepts vs. Hadoop: Forrester

How do Hadoop and MapReduce compare

Hadoop for big data puts architects on journey of discovery

This was first published in September 2012