Looking out at the media landscape, it’s easy to assume the term big data is automatically tied to the word Hadoop, describing an open source technology used for processing large data sets in a distributed computing
David Menninger, research director for the San Ramon, Calif.-based Ventana Research, recently published The Challenge of Big Data: Benchmarking Large-Scale Data Management Insights, which maps out the big data terrain based on surveys from 163 qualified respondents. According to the report, how a business manages large sets of structured and unstructured data growing at a rapid pace is still evolving, but deploying only one tool, like Hadoop, is not cutting it. Instead, businesses entering into the big data mix are doing so with basic analytics and a variety of tools, starting with those already in-house. The research also found that fielding the big data challenge also means overcoming hurdles, the most significant of which is a skills deficit.
What is ‘big data?’
“Big data” is used to describe the voluminous amount of structured, unstructured and semi-structured data a company creates -- data that in many cases would take too much time and cost too much money to load into a conventional relational database for analysis.
Read more from the Whatis.com definition of big data.
The Challenge of Big Data is the second of a two-part research series that first looked at Hadoop usage (published in July) and then, more generically, at big data. Both reports are based on the same survey results. To avoid influencing respondents, Ventana used the phrase “large-scale data” rather than big data and didn’t include questions about Hadoop until two-thirds of the way through the survey.
Big data technology
For all its buzz, the research shows Hadoop is only being used by about 22% of respondents. Almost half, 45%, indicated they have no plans to evaluate or introduce the technology into their architecture. Menninger said he wasn’t surprised by this finding nor did he believe the numbers would change if the survey was given today rather than six months ago.
“Hot technology trends have an adoption curve: Early adopters, fast followers, mass market and laggards,” he said. “While Hadoop [interest] is rising dramatically, it’s still going to find the same pattern, and I don’t think we are yet at the point of mass market.”
Instead of Hadoop, the most popular technology for big data today is relatively basic: 89% of respondents indicated the relational database as their primary large-scale data mechanism. Most likely, businesses are using the technology by default until they can’t, Menninger said.
“Organizations have to cross some sort of threshold before those relational databases are not sufficient,” Menninger said.
Of the 89% of respondents using relational databases for big data, 93% indicated they also employed a secondary big data tool. Menninger points to the statistic as evidence that no one technology has emerged as a silver bullet. Instead, businesses are cobbling together bits and pieces of tools and technology.
What did surprise Menninger was how prevalent in-memory technologies are throughout the big data landscape. The survey found 33% of respondents are using in-memory databases. Another 17% indicated they planned to use the technology in the next year or two.
Balancing the basics -- for now
Although businesses may have big data, they still tend to analyze that data using basic techniques. Most respondents, 94%, indicated query and reporting capabilities are available within their organizations for big data analytics, while only 55% said the more advanced analytics of prediction and data mining capabilities are available.
More on 'big data'
Read how “big data” dominated the headlines in 2011
Learn how clickstream data edged one retailer into the “big data” territory
Discover why know how to talk the “big data” talk is so important
“As the data volumes become larger, doing a simple analysis becomes insufficient,” Menninger said in a webinar on the research findings. “The notion of trying to browse through billions of values to find the ones that are important becomes more challenging and more difficult.”
Sifting big data in search of a particular value is not the most efficient method, but Menninger is also unsurprised by the findings. Instead, he believes that, like technology, techniques follow a trajectory that begins with the basics and graduates to the more advanced levels. That belief is legitimized when Menninger looks at a Ventana survey from 2010 of more than 2,000 organizations.
“The more advanced analytics are the least often used,” he said of the 2010 survey findings; advanced analytics includes planning, forecasting, what-if analyses and predictive analytics. “So this isn’t specific to big data.”
Even so, businesses capable of advancing their big data analytics programs are better positioned than those that rely on basic queries and reports, he said.
Part of advancing a big data program may mean investing in new talent or additional training for employees. Menninger’s research found that the biggest obstacles had nothing to do with technology; instead, the two top issues are staffing and training, according to the survey.
Two-thirds of respondents reported having to train staff for current projects and 56% reported they’ll have to train staff for future projects.
“We need more people who can understand how to work with big data volumes and who can apply more advanced analytics techniques to big data volumes,” Menninger said. “We just don’t have enough people who are capable and trained for doing these things.”
Like connecting big data to Hadoop, many are beginning to connect advanced analytics to data scientists, a highly skilled workforce capable of digging into data and pulling out unseen patterns and insights, but Menninger said that’s not quite the case.
“There’s a different mind-set and a different set of skills to understand and tune a big data implementation from an implementation that runs on a single machine,” he said. “Understanding that intersection between the types of analysis you’re performing and the way the data is split across machines is still very important in terms of making the systems operate well.”