In the business intelligence implementation lifecycle, the traditional line of business applications generally starts from OLTP sources and ends with dashboards. Data warehousing processes that draw from relational database management system (RDBMS)
Tidal wave of unstructured data
Nowadays, huge volumes of unstructured data are generated, and this cannot be restricted to standardized text inputs for the RDBMS. For example, every corporate entity has applications that generate data such as domain competency forum discussions, email and chat conversations, and other such content. These sources are a potential knowledge base, useful for analytics, but produced without any particular enterprise-wide initiative or implementation, and often beyond the scope of RDBMS. A majority of industry experts would agree that a healthy knowledge base can be created by blending the IT systems in the day-to-day working tools and procedures that people use in an organization. The challenge here is to have powerful database platforms that can warehouse and analyze information which is too fluid in nature and too huge in volume for any RDBMS. A classic example of the growth in unstructured data is Facebook, which at some point in time was heard to be adding 12 TB of compressed data and scanning 800 TB of compressed data per day. Could RDBMS handle this?
Impact of the No SQL distributed database paradigm on RDBMS
Due to the massive volume and the fluid format of data generated at a tremendous pace every day, enabling analytics using traditional warehousing RDBMS technology such as the Kimball or Inmon methodologies is not feasible. To deal with data of such nature and volume, rather than RDBMS, the requirement demands a database, data warehousing and analytics system that can:
- Eliminate the need to cast the data in particular data types that hold data in normalized forms.
- Eliminate the overhead of preparing the data by transforming it using ETL processes to fit in data warehouses.
- Eliminate the need for remodeling data to fit the fixed dimensional model suited for OLAP engines.
With the gradual adoption of the Not-only SQL (No SQL) database platform, the spectrum has broadened beyond RDBMS, with a new generation of distributed databases suitable for addressing challenges posed by unstructured data and big data. These databases differ from RDBMS based on their storage and access methodology, which is now tuned to dealing with challenges in specific areas. Examples of such databases are Cassandra (columnar databases), MongoDB (document databases), Dex (graph databases), Amazon SimpleDB (key-value stores), and others. Technologies such as Hadoop combined with such distributed databases on the cloud are being widely adopted by organizations. Yahoo is believed to have a Hadoop implementation running on more than 100,000 CPUs running on over 40,000 computers.
>> Read about Yahoo’s indigenously developed open source BI tool
An interesting hypothetical application of data discovery would be extracting intelligence from an unstructured data source such as Wikipedia to study the interconnected world of technology. If Wikipedia was stored in databases suited for unstructured data, then using Hadoop or similar technologies, a powerful analytics solution could generate an intelligent report using innovative infographics.
Sustainability of RDBMS
Data grows from minuscule levels to the scale of federated data centers with varied pace, depending on the nature of business. To manage data, RDBMS is the logical starting point for any start-up business or application with typical OLTP needs. To shield RDBMS from obsolescence and to comply with future analytical needs emanating from challenges posed by big data, RDBMS vendors have begun incorporating features such as column stores, in-memory processing engine, MPP appliance solutions and cloud-based relational databases, into the RDBMS offerings. The acquisition of AsterData by Teradata, GreenPlum by EMC, Netezza by IBM, Sybase by SAP and Vertica by HP, are all perceived as moves towards adding columnar database capabilities to RDBMS solutions, similar to Oracle Exadata.
Plain-vanilla RDBMS capabilities are inadequate for the hosting and processing of different varieties and growing volumes of organizational data. BI professionals foresee migration from RDBMS or gradual mutation of RDBMS as just a matter of time, to meet the storage and analytical challenges posed by unstructured and big data. And this is the driving reason for the ever-decreasing focus on RDBMS.
About the Author: Siddharth Mehta works as an associate manager and a technical architect for BI software projects at Accenture Services. He is a recipient of Microsoft’s Most Valuable Professional award, and has written extensively on Microsoft BI software on his blog. Prior to Accenture, Mehta was with Capgemini