Organizations that want to start making sense of "big data" have a choice to make. They can employ traditional data warehouse concepts and their existing data warehouse architecture,
deploy the increasingly popular open source Hadoop distributed processing platform or ultimately use some combination of both approaches.
I'm not going to side with the people who seem to imply or state that Hadoop is the only way to do this stuff.
James Kobielus, senior analyst, Forrester Research Inc.
For organizations that want to move beyond basic business intelligence (BI) reporting to in-depth data mining and predictive analytics, the third option will most likely be the best way to go, according to James Kobielus, a senior data management analyst with Cambridge, Mass.-based Forrester Research Inc.
SearchDataManagement.com recently got on the phone with Kobielus to find out how organizations today are gaining valuable insights from huge amounts of fast-flowing data. Kobielus discussed the best ways to use existing data warehouse architectures, the advantages and disadvantages of Hadoop, and gave his assessment of data warehousing vendors in the age of big data. Here are some excerpts from that conversation:
I've seen few different definitions of big data. How does Forrester define the increasingly popular phrase?
James Kobielus: Big data refers to a paradigm for extremely scalable analytics. I like to use the phrase 'extremely scalable analytics' as the heart of what people mean by big data and all of its manifestations. To some degree people talk about the three Vs. There is the volume of data -- terabytes to petabytes and beyond. There is the velocity of data -- or real time acquisition, transformation, query and access. There is the variety of data. There is a huge range of structured and unstructured and semi-structured sources. The analytics aspect refers to everything in the kitchen sink that you might use to extract meaning from all of those data sets.
What do organizations need to know about data warehouse concepts to start making sense of big data today?
Kobielus: I think there's three ways in which a data warehouse helps you make sense of great gobs of data. Number one: In an enterprise data warehouse, you organize data in terms of subject areas, and quite often those subject areas are persisted in, for example, OLAP cubes that are either physically materialized or logically partitioned in a data warehousing architecture. [In other words, you] have customer data in one partition, you have financial data in another, human resources data in a third and so on. That helps you make sense of the data in terms of its relevance to particular downstream applications and users. That is the core of data warehousing database administration. That's Inmon and Kimball and so on, and that's one way in which you should use the data warehouse to make sense of big data.
What is the second way to start making sense of big data?
Kobielus: Number two [centers on] the notion of in-database analytics and using the data warehouse to execute data profiling, data cleansing and data mining or regression analysis to do segmentation of data. In other words, it's about using the full suite of data mining capabilities but executing them within the data warehouse. This helps you make sense of that data because you're using data mining or you're using regression analysis to essentially look for patterns in the data sets. You then use in-database data mining to populate downstream analytical data marts used by data mining and statistical modeling professionals who build visualizations of complex patterns. [For example, they use those patterns to identify] influential customers to whom you should make targeted offers. Using in-database analytics and things like MapReduce to automate more of that data mining in a highly-parallel and highly-scalable database architecture helps you make sense of all that data.
How prevalent are in-database analytics today? Is everyone doing it?
Kobielus: Not everyone, but a growing number of enterprises are doing it. It's understood as a best practice [or a] target architecture towards which you evolve your data warehousing practices if you're big on data mining. You know, a great many data warehouses in the real world are for operational business intelligence and reporting and ad hoc queries and don't do any data mining. But the bigger you get, the more likely you are to be doing extensive data mining and the more likely you are to be implementing or moving towards in-database analytics. [The goal there is] both to accelerate and scale up your data mining initiatives but also to harmonize all of your data mining initiatives around a common pool of reference data that you maintain in the data warehouse.
What is the third best practice for making sense of big data?
Kobielus: Number three [is using] the data warehouse as the focus of data governance [and having] the master data properly maintained in your data warehouse. When your data warehouse is the focus of data governance and cleansing, that helps you make sense of all that information. You might have dozens or hundreds of source applications that are feeding data into your data warehouse. As the data floods in real time, the data warehouse becomes a critical pivot point in terms of ensuring that big data sets are trustworthy and fit for downstream consumption.
How are the leading data warehouse vendors doing in their efforts to help organizations process big data stores?
Kobielus: Teradata, Oracle-Exadata, IBM-Netezza, HP-Vertica and others all do big data. The vast majority of [data warehouse vendors] can scale now to petabytes in a grid or a cloud architecture and almost all of them can do in-database analytics, namely inside a massively parallel data warehouse grid or cloud fabric. They all also support the ability to do the transforms and the cleansing natively inside the enterprise data warehouse.
With all of the media attention it gets, one might think that Hadoop is by far the best way to process big data stores today. Is this the case?
Kobielus: If you're going to do big data, you're going to do it in your enterprise data warehouse, with Hadoop or with a combination of both. I'm not going to side with the people who seem to imply or state that Hadoop is the only way to do this stuff. You can do the vast majority of what you do in Hadoop in your enterprise data warehouse now. Hadoop's advantages versus traditional closed-source enterprise data warehouse systems are that it is open source. It's free, as in free puppy. You can play with it. But there is going to be a lot of work involved and a lot of hidden costs. Hadoop really is the nucleus of the next generation enterprise data warehouse that is evolving over the next five to 10 years.