Queries From Hell
A blog about data warehousing, correlation databases, associative and incremental queries, value-based storage, metadata, on-the-fly indexing, automatic data-driven schemas, BI tools, data mining, visual mapping, pattern recognition, and the limitations of standard SQL in answering "queries from Hell."
Or, how to discover what you don't know you don't know.

Airplanes and Data Warehouses

Posted at 3/26/2008 12:19:00 AM
Recently I have been flying more than a pilot, and while idly passing time waiting until I could get my computer back out I thought about how similar the progression of flying is to the progression of data warehouses. A hundred years ago, a pilot could use a plane to go places where there were no direct roads. He could fly faster than a car could drive or a horse could run. Fifty years ago a technician could run a program in a computer that would store and retrieve information faster and more accurately than any person.

Airplanes got faster, better and cheaper. Soon ordinary people like me routinely flew around the world. Computers got faster, better and cheaper. Soon every person in their business and personal life was using them every day.

However, I had to travel from Barcelona, Spain to San Diego, California. To accomplish this simple task, I flew from Barcelona to Amsterdam. There I waited in an airport for a couple of hours before flying to Minneapolis. After a few hours there, I flew on to San Diego. The return to Barcelona was a similar trip.

With the computer, if I want to see information about sales for a day, I have to start by accumulating all of the daily transactions. Then I group them and sum the important statistics like volume and net prices. Then I need to build my star schema. Now I can use by BI tool to finally see the sales. If I want to see all of the sales of non-food products to female customers that were made on weekday mornings, it would be more difficult than a trip from an Inuit village in Alaska to a jungle village deep in the Congo basin.

The planes meet 80% of the travel needs well and the computers meet 80% of the information needs well. And better, faster, cheaper technology will not improve that ratio much for either technology. In fact, as people become more used to the technologies, the 80% portion is taken for granted and there is increased demand for the 20% part. Soon the ratio changes to 70/30 or 60/40.

New technologies like a personal transporter or a correlation database are needed to change the old cast-in-concrete methods. I don’t know where to find the personal transporter, but with a correlation database you can ask any question directly from the raw data as soon as it is loaded. You can change the direction of the questioning at any time. You can move from one question to the next without starting over and tracing out a new path. It gives you the ability to navigate through your information as you need to rather than through the “airports” that were built to designate paths for you. You can arrive quickly at your desired destination even if you weren't sure of it when you started your journey.

Add to Technorati Favorites      | Blogger Comments (0) |    

But how fast can you load? Part 2

Posted at 3/18/2008 12:46:00 PM
The key consideration in load time for a data warehouse is time to value: the elapsed time between having a raw data file ready to load into the data warehouse and being able to use that data for analysis.

A correlation database, using the value-based storage model, provides faster time to value than record-based relational or column-oriented columnar database approaches because there is no predefinition of business requirements needed, no need to predefine database schema, and no post-design of data cubes. The correlation database creates its own optimized index structure during data loading, which includes all relationships within the data. And it never needs restructuring.

So how easy is to load data and have it ready for analysis? This easy:


Add to Technorati Favorites      | Blogger Comments (1) |    

An incremental approach

Posted at 3/18/2008 09:47:00 AM
Here is a real query from hell in the making. One of our customers wanted to prepare a list of customers for a promotion. They wanted the usual selection criteria like geographic and demographic selections and they wanted at least 5,000 good names and not more than 6,000.

With the incremental approach, you ask for one qualification at a time and you can review the results between each step. The marketer who was preparing the promotion was able to select the customers from the area where the promo was going to take place. A quick look showed that there were far more customers there than they wanted to contact so another selection on a demographic code narrowed the result but there were still too many names.

Scanning through some of the results, it was quickly clear that there were many old customers who had not shopped recently. Now the marketer had the option to go after the current customers and focus the promo on retention or go for the old ones to try to get them back. Selecting the old ones resulted in a few more names than were really desired. Another quick scan through the data showed a wide range in average spending. An addition to the query selected the big spenders resulted in the right sized group and the promotion was launched. Total time for the discovery process was less than one hour.

Imagine attempting that with a typical DW! What was the original requirement? It wasn’t really known. If this had to be done as an IT project, it would never start since the requirements definition would take longer then the cycle between promotions.

If the marketer had access and skill in SQL, then the same process could be attempted. However, the probability of having all of the required fields indexed would be near zero. If the exploration and business decisions had taken another direction and they went after active customers, a different set of fields would be required. What are the chances that they would also be indexed?

Without a doubt, there is a query from hell somewhere in this business process.

Add to Technorati Favorites      | Blogger Comments (0) |    

Value Based Storage

Posted at 3/12/2008 12:31:00 AM
Just what is value based storage? This is a common question I often need to answer. It sounds like some kind of marketing jingle but, in fact, it is the underlying iLuminate storage structure technology that we are talking about.

The iLuminate internal storage looks dramatically different from either row based storage (Oracle, SQL Server, etc) or column based storage (Sybase IQ, Vertica, etc). Value Based Storage (VBS) stores the data values in sets (the "value pool") based on data type and each distinct value is stored only once. The original record definition is maintained by an indexing system that links to values the value pool. These two objects are enhanced by the extended metadata giving the iLuminate database its complete functionality.

This storage structure provides several benefits. The database is very small, in some cases even smaller than the raw data that went into it. In addition to simply being small, it concentrates the most active elements of the data into very compact sets. The value pool is often small enough to fit entirely in memory. As a result, the slowest part of database processing, transfer of data from disk to memory, is dramatically reduced.

The VBS also provides some data access features that can not be duplicated in any other structure. A search can be executed starting with a data value in the pool. By going from the value pool back to the index, it is possible to quickly locate every use of the value wherever is may be used in the logical record structures.

This structure also enables our incremental query capability. As the result of a query, the database returns a set of instance identifiers rather than a set of records. This is because there are no records, only pointers and values. With the response being a set of pointers, it is a simple matter to perform the next query step and then get the union or difference between the two sets of pointers for the result of the second query step. This process can be continued indefinitely with the result set shrinking or growing as the new results are merged with the old.

The extended metadata and the indexing structure and logic provide more features but that can be the subject of another discussion.

Add to Technorati Favorites      | Blogger Comments (0) |    

But how fast can you load?

Posted at 3/05/2008 07:06:00 AM
When I talk to people about the Illuminate CDBMS, the first reaction I usually get from them is, “That sounds great, but how fast can you load data?”

My usual answer is “How fast is fast enough?”

This is not a flippant answer, but an honest question that is intended to start a discussion on the meaning of load speed. The real measurement of load speed should be the time needed from starting the transfer of source data from the operational data stores, to the availability of the business information for the end user. In a multi-layer information infrastructure, that will include time for:

  • Importing raw data

  • Indexing the imported data

  • Running all aggregation processes and building intermediate layer access structures as needed

  • Further aggragations, rebuilding and re-indexing star schemas, OLAP cubes and any other customized, end use information delivery structures.

In an RDBMS environment, steps 1 and 2 are very fast since there is very little indexing on the foundation data layer. Depending on the hardware, communications methods and schema, this could be at a rate of perhaps 2 to 10 gigabytes per hour or even more in extreme cases with extreme hardware.

Steps 3 and 4 are the most unpredictable since every organization will have differing amounts of work in each layer. However, it is safe to say that this will be the longest part of the load process. A factor of ten times the duration of steps 1 and 2 would not be unusual. An overnight load process that loads for thirty minutes would then need five more hours to complete all of the remaining preparation steps.

In the illuminate environment, the load is a single step with some aggregations needed in limited cases. When data is loaded into iLuminate, the load process indexes everything. This means that step 2 in this case is zero.

Because everything is indexed, there is no star schema or OLAP cube building required. The foundation layer with 100% indexing already provides the full functionality of those structures. Also, due to the indexing, aggregations will not be needed as soon as they are with an RDBMS. The iLuminate database will minimize the I/O needed for the aggregation so the computational speed of the computer is the determining factor for aggregations. Small aggregations can be done as a part of the query process without impacting the user response time.

While an exact comparison can not be made because of variability of hardware, schema design and the number of information layers, an example comparison can demonstrate the conceptual difference between loading with a conventional system and loading with iLuminate.


So the answer to the original question is that iLuminate loads very fast. More fast enough for almost any enterprise.

Add to Technorati Favorites      | Blogger Comments (0) |    

Analytical Queries???

Posted at 3/03/2008 06:50:00 AM
Analysis – A breaking up of a whole into its parts to find out their nature (Websters New World Dictionary).

It seems strange to talk about analytical queries when the entire analysis must first be defined in order to create the SQL query. The analysis is really in the preparation, and the actual query becomes just another report generation process. If the analysis is not exactly what is needed, there is no way to know that until the report is done. Then the analysis/report process starts again or, all too frequently, the process is abandoned without a result.

The incremental query moves the analysis into the query process where it really belongs. The person who needs the result of the analysis can ask questions until the analysis is complete and the answer has been found.

For example, a retailer wants to prepare a list of customers for a promotion. They want the usual criteria like geographic and demographic selections and they want at least 5,000 good names and not more than 6,000.

The initial geographic and demographic questions are asked and there are 25,000 customers qualified. Now, the retailer can just take the first 5,000 names, or the quality of the list can be improved through some analysis, narrowing the list to the best 5,000—not simply the first 5,000.

With the incremental query process of Illuminate, they continue by asking for only customers who had previously purchased a specific type of product. Now the 25,000 names have been reduced to 10,000 better-qualified customers. Add another query step to select only those who shop on weekends. Now they have the 5,000 customers who precisely fit the profile that they want.

With the incremental approach, you ask for one qualification at a time and you can review the results between each step. The process just described would take a minute or two. In an RDBMS environment, reaching the same result would require either an extensive analytical process to define the final query or multiple sessions of: query design, examination of results and query redesign until the final result is reached.

Due to the cost of the old approach, the typical process will be to stop with the first query step and business user will just take 5000 off the list. The quality of the promotion will be compromised—but if the promotion planning is done on a weekly basis, there is no time for the alternative.

Result? - The business suffers due to the limits of technology.

Add to Technorati Favorites      | Blogger Comments (0) |