How many tables in an EDW?
Posted at 6/16/2009 06:37:00 AM
I have heard stories of enterprise data warehouses that have thousands of tables; I talked to one organization that claimed that they had 500,000 tables in their EDW! What on earth can require 500,000 or 50,000 or even 5,000 tables?
I make the argument that it is just bad design forced by the technical constraints of relational database managers.
If an EDW is a central repository of an organization’s information then it should be organized around the information entities of that organization. In any organization, there are real objects like customer, product, policy, account, vendor, etc that represent master data. In a large complex organization that is involved in multiple lines of business, there may be as many as 50 of this class of objects. Then there are logical objects (relationships or transactions) like orders, shipments, payments, invoices, etc. Again, in a complex organization there may be as many as 50 of these.
If we make the organization extremely complex and double the number of information entities, then there could be 200. Double it again and there are still only 400. Where do the thousands come from?
It is true that a table with 500 columns in an RDBMS used for analytics would be very difficult to work with. Indexing would be a major problem (imagine managing 100 indices on one table for just 20% indexed access) and I/O loads would be extreme. In a value-based structure there is no limit on the number of columns that will affect performance; indexing is always universal. The only limit on the number of columns is the limit of a human’s ability to comprehend the entire set of information. A table with 5,000 columns would probably be incomprehensible for people.
Likewise, redundant values are not much of a problem for column-based systems and no problem at all for the value-based structure. In the value structure, each value is stored once regardless of the number of uses or the location of use. Empty or null values also cause no problem in either column or value structures. Simply the absence of a value defines a null, so large, sparsely filled records are not a problem as they are in record-based systems.
Based on this, I propose that in a column or value structure, adherence to third normal form is unnecessary and a 5,000 table EDW has a serious design problem.
I make the argument that it is just bad design forced by the technical constraints of relational database managers.If an EDW is a central repository of an organization’s information then it should be organized around the information entities of that organization. In any organization, there are real objects like customer, product, policy, account, vendor, etc that represent master data. In a large complex organization that is involved in multiple lines of business, there may be as many as 50 of this class of objects. Then there are logical objects (relationships or transactions) like orders, shipments, payments, invoices, etc. Again, in a complex organization there may be as many as 50 of these.
If we make the organization extremely complex and double the number of information entities, then there could be 200. Double it again and there are still only 400. Where do the thousands come from?
- Normalization
- Reporting tables
- Summary tables
- Application oriented tables
- Cubes and other special structure
- Duplication
- Poor design
It is true that a table with 500 columns in an RDBMS used for analytics would be very difficult to work with. Indexing would be a major problem (imagine managing 100 indices on one table for just 20% indexed access) and I/O loads would be extreme. In a value-based structure there is no limit on the number of columns that will affect performance; indexing is always universal. The only limit on the number of columns is the limit of a human’s ability to comprehend the entire set of information. A table with 5,000 columns would probably be incomprehensible for people.
Likewise, redundant values are not much of a problem for column-based systems and no problem at all for the value-based structure. In the value structure, each value is stored once regardless of the number of uses or the location of use. Empty or null values also cause no problem in either column or value structures. Simply the absence of a value defines a null, so large, sparsely filled records are not a problem as they are in record-based systems.
Based on this, I propose that in a column or value structure, adherence to third normal form is unnecessary and a 5,000 table EDW has a serious design problem.
"Easy to Consume" BI
Posted at 5/26/2009 06:17:00 AM
Recently Claudia Imhoff wrote in her blog that BI should be “easy to consume,” and I most certainly agree. As she accurately points out, there are many forms of BI that are “easy to use,” but easy to consume is a different thing. We at illuminate have been talking about “time to analytics” and “time to answer.” Time to answer is defined as the time it takes from submitting the first question to the time the business issue is resolved. The ability to select and run the query that you want with one click is easy to use. Working your way through a problem with the computer helping you as you go is easy to consume.
The incremental query process is built around the idea that a person may not know the right question to ask until several learning questions have been answered. This “learn as you go” process is what we call an incremental query and what I would propose is exactly "easy to consume" BI.
The incremental query process is built around the idea that a person may not know the right question to ask until several learning questions have been answered. This “learn as you go” process is what we call an incremental query and what I would propose is exactly "easy to consume" BI.
Chaos
Posted at 5/18/2009 07:08:00 AM
If you never read the original Chaos report from the Standish Group you should stop looking at this screen and immediately read it. When you have finished, come back here and read on.
For those of you who read the report fifteen years ago and have forgotten it, this report, among other things, identifies reasons for failure of software projects. The original report in 1994 showed 16% of software projects succeeded, 53% were completed but with cost and time overruns and reduced deliverables and 31% were complete failures. In 2004 the Standish Group reported that this had improved to 29% succeeded and only 18% failed.
Yikes!!! In 10 years, as an industry, we have gone from 31% total failure to 18% total failure. I am really happy that we are not building airplanes!
One of the primary reasons for the limited improvement of the failure rate that we have been able to achieve is the reduction of project timelines. The Chaos report states, “Research at The Standish Group also indicates that smaller time frames… will increase the success rate.” On my scale, a project with a timeline of more than a year has almost no chance of success, while a project with a timeline of a few weeks has a very good chance of success.
What about a data warehouse project? Usually these are measured in months, often exceeding a year. How is the success rate? According to Bill Inmon’s estimate there is a failure rate of 70% – 80%. Other studies have reported failure rates as high as 90%. Given the overall failure rate of software projects from the Standish Group, it seems like data warehouse projects are the primary source of project failures in the IT industry.
Of course, one can respond that building a data warehouse is a big job and requires a lot of time. Where is most of the time spent? Requirements definition, schema design and optimizing indexing strategy.
Now, I would like to challenge the other database vendors to participate in benchmarks where these factors are evaluated rather than the essentially meaningless TPC-H tests. I think most people would agree that a moderate success is better than a large failure and these are the factors that will determine whether your data warehouse project will need sixty days to succeed or sixty weeks to fail.
For those of you who read the report fifteen years ago and have forgotten it, this report, among other things, identifies reasons for failure of software projects. The original report in 1994 showed 16% of software projects succeeded, 53% were completed but with cost and time overruns and reduced deliverables and 31% were complete failures. In 2004 the Standish Group reported that this had improved to 29% succeeded and only 18% failed.Yikes!!! In 10 years, as an industry, we have gone from 31% total failure to 18% total failure. I am really happy that we are not building airplanes!
One of the primary reasons for the limited improvement of the failure rate that we have been able to achieve is the reduction of project timelines. The Chaos report states, “Research at The Standish Group also indicates that smaller time frames… will increase the success rate.” On my scale, a project with a timeline of more than a year has almost no chance of success, while a project with a timeline of a few weeks has a very good chance of success.
What about a data warehouse project? Usually these are measured in months, often exceeding a year. How is the success rate? According to Bill Inmon’s estimate there is a failure rate of 70% – 80%. Other studies have reported failure rates as high as 90%. Given the overall failure rate of software projects from the Standish Group, it seems like data warehouse projects are the primary source of project failures in the IT industry.
Of course, one can respond that building a data warehouse is a big job and requires a lot of time. Where is most of the time spent? Requirements definition, schema design and optimizing indexing strategy.
Now, I would like to challenge the other database vendors to participate in benchmarks where these factors are evaluated rather than the essentially meaningless TPC-H tests. I think most people would agree that a moderate success is better than a large failure and these are the factors that will determine whether your data warehouse project will need sixty days to succeed or sixty weeks to fail.
Labels: Software Development
Colin White Podcast on The Correlation Database
Posted at 3/13/2009 12:54:00 PM
At the recent TDWI event in Las Vegas, Colin White, president and founder of BI Research and BeyeNETWORK blogger, sat down with Andy Fletcher of illuminate Solutions to talk about the correlation database (CDBMS). A widely recognized expert in database technology and frequent speaker at industry conferences, Mr. White found the correlation database technology "fascinating and different" compared with relational and column databases.Among the high points of the discussion:
- There is much less effort prior to load with the CDBMS than with relational or column databases, as the need for physical and logical design is completely eliminated.
- A CDBMS data warehouse can be depoyed in a fraction of the time of other technologies.
- Data is "all indexed, all the time, in all dimensions," enabling rapid performance even when doing train-of-thought analysis—asking questions unplanned at the outset of data exploration.
- A CDBMS is often deployed to supplement an existing enterprise data warehouse, to enable ad hoc analysis on a subset of data.
- Business analysts are the key users, pursuing in-depth analysis rather than standard reporting.
The Information Asset
Posted at 3/04/2009 07:22:00 AM
“Information is a corporate asset” is a mantra that many IT people have been chanting for years. Recognition of information as a real corporate asset would give the IT industry a new status they have longingly sought for decades. However, what kind of asset should it be?
A drill press is an asset and so is a pile of cash. A drill press can be used to efficiently perform one specific type of a manufacturing process and a pile of cash can be used to accomplish almost any business objective in any line of business. If information is rigidly structured, it looks a lot like the drill press. But if it can be accessed in any way, at any time for any purpose, then it is more like cash.

Obviously, information should be an asset like cash, but in most organizations it is much more like the drill press. An enterprise data warehouse that has been built on standard RDBMS structures is optimized for one set of functions and generally performs them well. However, these structures handle different and unplanned information access functions about as badly as a drill press would perform injection molding.
The value of the information asset is directly related to the flexibility of access the business has to that asset. The more aspects of flexibility the organization has to using the information, the higher its value will be. If a business person can see a fixed report that is produced each day and all other information comes from phone calls, the information asset has very limited value. If that user can see a dashboard that indicates a problem or opportunity and then immediately begin analyzing the detailed information behind the dashboard, the information has much more value.
However, if a new business opportunity comes up and the decision makers can find answers to all of the new questions associated with that opportunity in time to make an informed decision, then the information has almost unlimited value. This means providing answers to questions that no one thought of just one day previous, much less at a requirements definition meeting eighteen months ago. It means accessing data through paths that no one thought of in any design session. And it means getting answers to these unplanned questions at the speed of thought.
Only then will information begin to look like cash.
A drill press is an asset and so is a pile of cash. A drill press can be used to efficiently perform one specific type of a manufacturing process and a pile of cash can be used to accomplish almost any business objective in any line of business. If information is rigidly structured, it looks a lot like the drill press. But if it can be accessed in any way, at any time for any purpose, then it is more like cash.

Obviously, information should be an asset like cash, but in most organizations it is much more like the drill press. An enterprise data warehouse that has been built on standard RDBMS structures is optimized for one set of functions and generally performs them well. However, these structures handle different and unplanned information access functions about as badly as a drill press would perform injection molding.
The value of the information asset is directly related to the flexibility of access the business has to that asset. The more aspects of flexibility the organization has to using the information, the higher its value will be. If a business person can see a fixed report that is produced each day and all other information comes from phone calls, the information asset has very limited value. If that user can see a dashboard that indicates a problem or opportunity and then immediately begin analyzing the detailed information behind the dashboard, the information has much more value.
However, if a new business opportunity comes up and the decision makers can find answers to all of the new questions associated with that opportunity in time to make an informed decision, then the information has almost unlimited value. This means providing answers to questions that no one thought of just one day previous, much less at a requirements definition meeting eighteen months ago. It means accessing data through paths that no one thought of in any design session. And it means getting answers to these unplanned questions at the speed of thought.
Only then will information begin to look like cash.
Shared nothing, shared something, shared everything?
Posted at 2/16/2009 04:56:00 AM
The debate over shared computing resources has been going on for years and shows little sign of stopping. Massive shared nothing vendors like Netezza insist that any sharing destroys performance in query processing. Meanwhile, Microsoft, Oracle and others insist that sharing some or all computing resources improves query performance. Which is right? It just doesn’t matter.
How’s that for throwing gas onto the fire?
A few years ago Netezza published a paper (Netezza Performance Server® Appliance: An Architectural Comparison) that very accurately summed up the problems with sharing resources. In a single sentence, sharing resources is a problem when you need to move massive amounts of data constantly. In that paper they say “Unlike OLTP, data warehousing is all about data shuffling: moving large quantities of data through the system’s analysis and processing engine as efficiently as possible, with a minimum of internal thrashing.”
They go on to illustrate how a query will overload a shared system, “For example, a complex join or other complex query may require a number of processing steps. Consider the sheer inefficiency of delivering multiple enormous tables (billions of rows) off disk, across the network and into memory for processing by the DBMS—all to perform one step. The partial results then have to be moved (“materialized”) back to disk in a temporary storage location prior to bringing in another huge bundle of data for the next step.” It makes me tired just to think about that much work.
The problem that all parties in this debate share (even the shared nothing people are sharing this one) is that they start with the erroneous assumption that it is necessary to move lots of records. Let’s start with just one, correct assumption—a human is asking a question and they expect to get back an amount of information that is within human scope of understanding.
This means that the only required movement is a tiny amount of data into the system (the text of a query) and a small amount of data out (a screen or two of result data at most). All of the additional data movement that occurs between those end points is a result of the design of the data structure and the query process. And in any of the record-based structures, this will be a LOT of data movement.
With this assumption in mind, consider how the correlation database works. A query comes in and uses indexed access directly to the selection values. From that it gets a set of internal instance IDs which it uses to project the result set. When there are multiple complex joins, only instance IDs move across I/O resources as the intermediate steps are processed. The only actual data movement that occurs is the transfer of the final data values to the client.
In this environment, more processor power means more performance. The amount of physical data movement is tiny compared to the processing power required. Fortunately, the chip makers are doing an excellent job of building faster chips with multiple cores providing massive processing power in a small, inexpensive package. Using a CDBMS, even low end, commodity servers have enough I/O capacity to meet the needs of a large enterprise data warehouse with many users.
In the Netezza paper they state “The strength of SMP lies in its processing power; however, the architecture is limited in its ability to move large amounts of data as required in data warehousing”. That sounds a lot like the definition of the physical requirements of the CDBMS.
Their solution is to build an architecture that can move massive amounts of data quickly; the CDBMS solution is to stop moving massive amounts of data and that means lower cost and faster performance.
An interesting side effect of this structure is much “greener” computing. A really big commodity server with lots of disks might possibly use a few kilowatts of power and would easily manage a CDBMS data warehouse with ten terabytes of source data in it. An array of shared nothing computers processing the same amount of data in an RDBMS would use multiple times that much electrical power.
How’s that for throwing gas onto the fire?
A few years ago Netezza published a paper (Netezza Performance Server® Appliance: An Architectural Comparison) that very accurately summed up the problems with sharing resources. In a single sentence, sharing resources is a problem when you need to move massive amounts of data constantly. In that paper they say “Unlike OLTP, data warehousing is all about data shuffling: moving large quantities of data through the system’s analysis and processing engine as efficiently as possible, with a minimum of internal thrashing.”
They go on to illustrate how a query will overload a shared system, “For example, a complex join or other complex query may require a number of processing steps. Consider the sheer inefficiency of delivering multiple enormous tables (billions of rows) off disk, across the network and into memory for processing by the DBMS—all to perform one step. The partial results then have to be moved (“materialized”) back to disk in a temporary storage location prior to bringing in another huge bundle of data for the next step.” It makes me tired just to think about that much work.
The problem that all parties in this debate share (even the shared nothing people are sharing this one) is that they start with the erroneous assumption that it is necessary to move lots of records. Let’s start with just one, correct assumption—a human is asking a question and they expect to get back an amount of information that is within human scope of understanding.This means that the only required movement is a tiny amount of data into the system (the text of a query) and a small amount of data out (a screen or two of result data at most). All of the additional data movement that occurs between those end points is a result of the design of the data structure and the query process. And in any of the record-based structures, this will be a LOT of data movement.
With this assumption in mind, consider how the correlation database works. A query comes in and uses indexed access directly to the selection values. From that it gets a set of internal instance IDs which it uses to project the result set. When there are multiple complex joins, only instance IDs move across I/O resources as the intermediate steps are processed. The only actual data movement that occurs is the transfer of the final data values to the client.
In this environment, more processor power means more performance. The amount of physical data movement is tiny compared to the processing power required. Fortunately, the chip makers are doing an excellent job of building faster chips with multiple cores providing massive processing power in a small, inexpensive package. Using a CDBMS, even low end, commodity servers have enough I/O capacity to meet the needs of a large enterprise data warehouse with many users.
In the Netezza paper they state “The strength of SMP lies in its processing power; however, the architecture is limited in its ability to move large amounts of data as required in data warehousing”. That sounds a lot like the definition of the physical requirements of the CDBMS.
Their solution is to build an architecture that can move massive amounts of data quickly; the CDBMS solution is to stop moving massive amounts of data and that means lower cost and faster performance.
An interesting side effect of this structure is much “greener” computing. A really big commodity server with lots of disks might possibly use a few kilowatts of power and would easily manage a CDBMS data warehouse with ten terabytes of source data in it. An array of shared nothing computers processing the same amount of data in an RDBMS would use multiple times that much electrical power.
Load Speed in an EDW
Posted at 12/16/2008 12:04:00 PM
I have been seeing some amazing reports recently about load speeds from a number of database vendors. The most recent was a benchmark run by Vertica that showed a load rate of more than 5 terabytes per hour. My response to this is a simple question—how fast is fast enough?
To answer this question it is necessary to answer two other questions:
The benchmark source data has 30,048,001,906 order line records. To put that figure into perspective, Wal-Mart has nearly 8,000 stores. If the benchmark were simulating all of Wal-Mart sales, that works out to something like 3,750,000 detail transactions per store—and this is just testing load speed, not showing the total content of the enterprise data warehouse (EDW). If this is the initial load of the EDW, then whether the data is loaded in two hours or two days has little effect on the total time to deployment. If this is a daily incremental load, then the holiday season has been truly phenomenal for Wal-Mart. To generate transactions at this daily rate would require about 100 transactions for every living person in the entire United States.
The only realistic way that most organizations would need to get this much data loaded periodically would be by loading multiple copies of redundant data from stovepipe sources in poorly defined processes. Other than in a very small number of organizations, the need for this level of load speed is to continue support for poor designs and practices. What value does it bring an organization to continue to execute poorly conceived processes, only at higher speeds?
How soon can the data be used?
Supremely fast loading is of no benefit to the organization if the analytical power isn't there once the data is loaded. The benchmark above included no query performance testing, just load speed testing. This is fairly typical of benchmarks where there is a completely different architecture and process for load testing and query testing. What would the query response speed have been in this case? There is no way to know, but it clearly would be significantly slower than if the data had been loaded in a query-optimized manner. Of course, the load speed would then be significantly slower.
These kinds of benchmarks are really more a test of the hardware and the skill of the people configuring the system for testing. The test shows how well computer system X can move bytes off the disk, through the disk channels and the system bus to the processor and then back to another location on the disk system. The database manipulation portion is such a small factor that it barely has any effect on the total time, and this is because the heavy duty part of database processing was specifically excluded from the test.
Why not develop a standard database benchmark that measures “time to analytics?” That is, how fast can the system load and structure new data AND make it available to the user? The query portion of the test should include a set of queries that demonstrate reporting speed, analytical power and ad hoc analysis capability. If a single benchmark does not include all of the steps needed, from reading the first source record to delivering the last line of response to the queries, then it is pure marketing spin.
So, how fast is fast enough? Obviously there is no single answer to that question. I had a conversation with a user of a column database and I asked if they had a problem with load times. He answered that there was no problem at all, as the system could load the entire 2 million new records (perhaps 250 megabytes) they added per day without interfering with query activity. In another conversation I was told that ten gigabytes per hour was too slow, that they would never finish incremental loads on time. In both cases, the concern was that the data would not be ready for the user when it was needed, not how many gigabytes per hour could be loaded.
The real answer is based on your needs, and no benchmark will relate to that need in a meaningful way. If you are concerned about load speed, ask the vendor to run a test: see how much data can they load, structure and query in a limited-time test. Then you will have meaningful information.
To answer this question it is necessary to answer two other questions:
- What is being loaded?
- How soon can the data be used?
The benchmark source data has 30,048,001,906 order line records. To put that figure into perspective, Wal-Mart has nearly 8,000 stores. If the benchmark were simulating all of Wal-Mart sales, that works out to something like 3,750,000 detail transactions per store—and this is just testing load speed, not showing the total content of the enterprise data warehouse (EDW). If this is the initial load of the EDW, then whether the data is loaded in two hours or two days has little effect on the total time to deployment. If this is a daily incremental load, then the holiday season has been truly phenomenal for Wal-Mart. To generate transactions at this daily rate would require about 100 transactions for every living person in the entire United States.
The only realistic way that most organizations would need to get this much data loaded periodically would be by loading multiple copies of redundant data from stovepipe sources in poorly defined processes. Other than in a very small number of organizations, the need for this level of load speed is to continue support for poor designs and practices. What value does it bring an organization to continue to execute poorly conceived processes, only at higher speeds?How soon can the data be used?
Supremely fast loading is of no benefit to the organization if the analytical power isn't there once the data is loaded. The benchmark above included no query performance testing, just load speed testing. This is fairly typical of benchmarks where there is a completely different architecture and process for load testing and query testing. What would the query response speed have been in this case? There is no way to know, but it clearly would be significantly slower than if the data had been loaded in a query-optimized manner. Of course, the load speed would then be significantly slower.
These kinds of benchmarks are really more a test of the hardware and the skill of the people configuring the system for testing. The test shows how well computer system X can move bytes off the disk, through the disk channels and the system bus to the processor and then back to another location on the disk system. The database manipulation portion is such a small factor that it barely has any effect on the total time, and this is because the heavy duty part of database processing was specifically excluded from the test.
Why not develop a standard database benchmark that measures “time to analytics?” That is, how fast can the system load and structure new data AND make it available to the user? The query portion of the test should include a set of queries that demonstrate reporting speed, analytical power and ad hoc analysis capability. If a single benchmark does not include all of the steps needed, from reading the first source record to delivering the last line of response to the queries, then it is pure marketing spin.
So, how fast is fast enough? Obviously there is no single answer to that question. I had a conversation with a user of a column database and I asked if they had a problem with load times. He answered that there was no problem at all, as the system could load the entire 2 million new records (perhaps 250 megabytes) they added per day without interfering with query activity. In another conversation I was told that ten gigabytes per hour was too slow, that they would never finish incremental loads on time. In both cases, the concern was that the data would not be ready for the user when it was needed, not how many gigabytes per hour could be loaded.
The real answer is based on your needs, and no benchmark will relate to that need in a meaningful way. If you are concerned about load speed, ask the vendor to run a test: see how much data can they load, structure and query in a limited-time test. Then you will have meaningful information.