News

BI architecture choices to aid performance: EDW vs. modular BI system

Mark Whitehorn, Contributor

Many business intelligence (BI) programs are being expanded and scaled out to reach more of the end users within organizations. That’s great, of course – a tangible sign of BI’s growing importance to companies.

    Requires Free Membership to View

But it also raises the stakes for the IT and BI managers who have to ensure that the performance of BI systems doesn’t suffer as more and more people start to use them.

Let’s start by looking at how the corporations that run the world’s biggest BI systems tackle the challenge of maintaining acceptable BI performance levels – or improving them – while putting BI tools in the hands of more business users. Many very large organizations create an enterprise data warehouse (EDW) in which the operational data extracted from business applications and other source systems is stored in one database, on one machine, typically as a set of relational tables. Often, all of the analytical querying done by end users runs on this system as well (see enlargeable diagram below).

In a sense, this BI architecture puts all of your scalability eggs into one basket. But if the processing capacity of the single central machine can be doubled or quadrupled on demand, then it’s a perfectly reasonable and safe way of planning for broader usage and increased BI performance needs. Such systems are out there, although they tend to be very high-end – we’re talking major global corporations here.

Generally speaking, these systems are based on massively parallel processing (MPP) architectures, so that the “one machine” is actually a set of interconnected processing nodes. More nodes can be added simply by plugging them in to the system, and the best examples will basically guarantee performance scalability. In such cases, if your system delivers 0.8-second query response times to 200 users when running on four nodes, you can be certain that expanding it to eight nodes will provide 0.8-second response times to 400 users. The bad news? The expense. Your costs may vary depending on the specific technology and system design, but such installations can be expected to start at multiple millions of dollars and go up from there.

So, what about the rest of us? What other options are available to the majority of BI system designers and BI project teams? First and foremost, you can build modularity, and hence a level of scalability, into your BI architecture from the start. In this model, the data also is held in a central data warehouse, but the analytical work is performed in smaller data marts, where data is pre-aggregated and stored in the form of online analytical processing, or OLAP, cubes (see enlargeable diagram below).

Pre-aggregating the data has several implications when compared to the EDW approach. For starters, it introduces a delay in availability while the data is aggregated, a process that can easily take several hours to complete. Not only is the delay irksome for end users, it also means that providing real-time BI capabilities becomes very difficult. Several vendors have worked hard to reduce the delay to minutes instead of hours; the problem with such approaches is that they hugely increase the complexity of BI systems. By comparison, an EDW is very expensive but conceptually simpler and better able to support real-time BI processes.

Both approaches usually can be made to provide the same levels of performance against the same volume of data – again, though there are significant differences in how that is done.

Let’s imagine, for example, that each of the data marts attached to a data warehouse is providing adequate analytical and BI performance to a group of 20 business users. Then one group expands to 30 people. In a simple modular approach, you’d duplicate the now-overworked data mart and connect 15 users to the original one and 15 to the duplicate. Essentially, we’re swapping in more hardware at the analytical layer as required.

In some ways, this is like adding nodes to an MPP-based EDW system, but the servers in this case are much less expensive than the nodes of an MPP system are. On the other hand, they are much more difficult to install: You have to carefully plan how to set up the servers, what happens when they fail, and so on. With MPP technology, you simply add nodes and the system software will ensure that they are used effectively.

A radical proposition: building your BI architecture around PCs
Once we start thinking about swapping data mart boxes in and out as needed, we can take another step and use PCs rather than servers to run them. Yes, I know that sounds radical, not to say anarchistic – but bear with me for a moment. We usually use servers to power data marts because they have a high level of internal redundancy – the disks are mirrored and the CPUs and power supplies are hot-swappable, so that within the server there is no single point of failure. The problem is cost; servers typically sell for three to five times the price of a comparable desktop system.

PCs aren’t internally redundant: If one part fails, the data mart dies (at least temporarily). But for the same price, or less, a server that supports 20 BI users could be replaced with two PCs offering equal processing power, each servicing 10 users. If and when one of the PCs fails, all 20 users can be switched to the other machine, which should be running no slower than the original server.

In addition, you probably can afford to have several spare PCs sitting around. As soon as a machine fails, you can start loading data onto one of the spares to recreate the data mart. This approach trades internal redundancy for more cost-effective external backups. I’ve seen it work, and the extra processing power that it brings can help enormously with BI performance and scalability because you have more system resources that can be tapped to cope with unexpected usage demands.

An obvious extrapolation from the model outlined above is to virtualize some or all of the data mart systems on a single large server, a grid, a server farm or a cluster. That can further reduce costs and provide a huge advantage in terms of load balancing. Of course, there’s the ultimate virtualization of all: the cloud. My best advice at the moment is, don’t go there. Cloud computing has huge promise for BI systems. But issues of data security, performance guarantees and scalability have yet to be fully addressed by cloud vendors. For now, the cloud is still too “virtual” to be a real solution to the problem of maintaining BI performance as the number of your BI users grows.

About the author: Dr. Mark Whitehorn specializes in the areas of data analysis, data modeling and business intelligence (BI). Based in the U.K., Whitehorn works as a consultant for a number of national and international companies and is a mentor with Solid Quality Mentors. In addition, he is a well-recognized commentator on the computer world, publishing articles, white papers and books. Whitehorn is also a senior lecturer in the School of Computing at the University of Dundee, where he teaches the masters course in BI. His academic interests include the application of BI to scientific research.