Queries From Hell
A blog about data warehousing, correlation databases, associative and incremental queries, value-based storage, metadata, on-the-fly indexing, automatic data-driven schemas, BI tools, data mining, visual mapping, pattern recognition, and the limitations of standard SQL in answering "queries from Hell."
Or, how to discover what you don't know you don't know.

What's in a Name?

Posted at 5/15/2008 06:04:00 AM
Naming of data objects has always been an interesting issue. Long ago, programmers named files and fields arbitrarily, at their own whim. Later, most IT departments standardized naming rules. Now there are industry standard naming conventions like International Standard ISO/IEC 11179-5 and complete naming systems like the Dublin Core Metadata Initiative. Yet most business users today could not even begin to interpret the meaning of the names used for the hundreds or even thousands of tables and the multiple thousands of columns in an enterprise information infrastructure. These names identify the information they use every day but the names are so unintelligible that no one could interpret them without a written standard.
Why should objects that are used every day by the business user have names that are incomprehensible?
The primary reason is that data objects are structured to optimize computer performance rather than to reflect the business structures. A common thing like a customer record quickly becomes split into five or ten or more tables in third normal form and business information users can no longer find anything.
The best solution to date has been to build supplementary data structures especially for user access and name all objects in those structures according to a centrally managed naming control system. The business users then only see a data mart with a small subset of all data objects and are usually able to learn the meaning of that subset of names.
A better solution for naming is to keep just one data structure for each real world object and give the attributes of that structure names that align with the business names. In this environment, every business entity is represented in a single table and each attribute occurs once and has a meaningful name. Each customer has one customer record and each product has one product record. All information access systems use these tables and access the same data using the same names. The data marts, star schemas and OLAP cubes of conventional systems are just ordinary queries in this environment.
This concept, as simple as it may seem, has a couple of issues that keep it from being implemented. The first issue is computer performance. The massive redundancy created by thi9s structure would be crippling in an RDBMS environment. The solution for this issue is simply to use the Value Based Structure in the CDBMS. In this structure redundant data storage does not cause any problem and normalizing provides no gain.
The second problem is application design. Applications are still designed to interact with data stores the same way they did thirty years ago. New data storage capabilities provide functions that didn’t exist thirty years ago and applications should be upgraded to use the best that is available. However, that is a discussion for another day.
Add to Technorati Favorites

Add to Technorati Favorites      | Blogger Comments (0) |    

When is data mining not data mining?

Posted at 4/19/2008 07:51:00 AM
Product affinity is one of those processes that used to be relegated to the statisticians in the back room. The process was so time consuming and expensive that it wasn’t done except for very high-profile situations. Also, the results frequently proved to be of little value. On one discovery project I found an amazing affinity between bananas and dog food. When I told the client, he laughed and said, “Check for bananas and toilet paper.” Sure enough there was a strong affinity there as well. It seems that bananas have an affinity with almost everything in the store.

With high costs, and results like "bananas and everything," affinity was one of the more obscure data mining processes. Now though, we can do product affinity as a simple query. To find the products that sell with a selected product, we select the product, ask for the baskets that contained the product and then select all of the products sold in those baskets. Ordering the results by units sold gives us a list of the top items sold with our selected product.

Using the incremental query feature of the iCorrelate query screen, the real value of affinity can be extracted from tons of raw data. For example, when we get the baskets that contain our selected product, we can select only those baskets from a specific time period or a store or region or only the baskets from weekends or mornings. Whatever behavior characteristic we are interested can be used to get the affinity of a selected subset of all baskets.

We can also extend the affinity beyond baskets if customer information is available. When we have the desired baskets, we ask for the customers who purchased those baskets. Then asking for the baskets related to those customers, then the items in those baskets, we get product affinity at the customer level. Rather than market basket analysis, we are doing customer purchase analysis. Selecting only the baskets from a promotion provides another analysis of promotion effectiveness.

The old data mining process for product affinity had limited value and high cost. The incremental query method however, has low cost and high value making it an excellent tool for product managers, promotion planners and other business people who need to analyze shopping behavior.

Add to Technorati Favorites      | Blogger Comments (0) |    

An inexpensive data warehouse?

Posted at 4/09/2008 08:24:00 AM
What makes a data warehouse system expensive or affordable? Let’s take a quick look at two examples; one based on an RDBMS that is free and another based on a CDBMS that sells for $100K. To keep it simple, I’ll use 100 gigabytes of raw data as the input to these systems and assume a small number of users directly on the warehouse and a larger number of users with OLAP screens, dashboards using star schemas, etc. The costs of the front end systems will be essentially the same for both systems so they are not included.

Design – RDBMS: Six calendar months, about $150,000 in labor cost.
Get user requirements defined, design level one schema, design level two schema, design all required OLAP cubes, star schemas and other information access structures, prepare load, indexing and aggregation processes, prepare physical design.

Design – CDBMS: Three calendar days, about $750 in labor cost
Define aggregations, prepare load and aggregation processes.


Startup – RDBMS: 20 calendar days, about $15,000 in labor cost
Load data, index level one. Build and index level two. Extract data for access structures, build and index structures. Validate all levels for accuracy and completeness.

Startup – CDBMS: Four calendar days, about $1000 in labor cost
Load data and validate for accuracy and completeness


Operation – RDBMS: One full time person, about $75,000 per year
Weekly load and update and rebuild of all access structures. Design and build of new access structures as needed. Periodic reorganization of level one.

Operation – CDBMS: Less than 1/10 of one full time person
Weekly loads.

It looks like even "free" gets expensive if you can’t reduce the time required to design, build and maintain the data warehouse. Of course, this comparison ignores the opportunity cost of not having the information available for six months. And this sample is a small data warehouse. The cost of implementing a data warehouse is roughly proportional to the amount of data and the number of users. A system with a terabyte of data and a thousand users would be much higher. The cost of creating the warehouse with both the RDBMS and the CDBMS would increase by at least a factor of ten.

Hmmm, ten times $165,000 or ten times $1,750; that seems like an easy decision.

Add to Technorati Favorites      | Blogger Comments (0) |    

Airplanes and Data Warehouses

Posted at 3/26/2008 12:19:00 AM
Recently I have been flying more than a pilot, and while idly passing time waiting until I could get my computer back out I thought about how similar the progression of flying is to the progression of data warehouses. A hundred years ago, a pilot could use a plane to go places where there were no direct roads. He could fly faster than a car could drive or a horse could run. Fifty years ago a technician could run a program in a computer that would store and retrieve information faster and more accurately than any person.

Airplanes got faster, better and cheaper. Soon ordinary people like me routinely flew around the world. Computers got faster, better and cheaper. Soon every person in their business and personal life was using them every day.

However, I had to travel from Barcelona, Spain to San Diego, California. To accomplish this simple task, I flew from Barcelona to Amsterdam. There I waited in an airport for a couple of hours before flying to Minneapolis. After a few hours there, I flew on to San Diego. The return to Barcelona was a similar trip.

With the computer, if I want to see information about sales for a day, I have to start by accumulating all of the daily transactions. Then I group them and sum the important statistics like volume and net prices. Then I need to build my star schema. Now I can use by BI tool to finally see the sales. If I want to see all of the sales of non-food products to female customers that were made on weekday mornings, it would be more difficult than a trip from an Inuit village in Alaska to a jungle village deep in the Congo basin.

The planes meet 80% of the travel needs well and the computers meet 80% of the information needs well. And better, faster, cheaper technology will not improve that ratio much for either technology. In fact, as people become more used to the technologies, the 80% portion is taken for granted and there is increased demand for the 20% part. Soon the ratio changes to 70/30 or 60/40.

New technologies like a personal transporter or a correlation database are needed to change the old cast-in-concrete methods. I don’t know where to find the personal transporter, but with a correlation database you can ask any question directly from the raw data as soon as it is loaded. You can change the direction of the questioning at any time. You can move from one question to the next without starting over and tracing out a new path. It gives you the ability to navigate through your information as you need to rather than through the “airports” that were built to designate paths for you. You can arrive quickly at your desired destination even if you weren't sure of it when you started your journey.

Add to Technorati Favorites      | Blogger Comments (0) |    

But how fast can you load? Part 2

Posted at 3/18/2008 12:46:00 PM
The key consideration in load time for a data warehouse is time to value: the elapsed time between having a raw data file ready to load into the data warehouse and being able to use that data for analysis.

A correlation database, using the value-based storage model, provides faster time to value than record-based relational or column-oriented columnar database approaches because there is no predefinition of business requirements needed, no need to predefine database schema, and no post-design of data cubes. The correlation database creates its own optimized index structure during data loading, which includes all relationships within the data. And it never needs restructuring.

So how easy is to load data and have it ready for analysis? This easy:


Add to Technorati Favorites      | Blogger Comments (1) |    

An incremental approach

Posted at 3/18/2008 09:47:00 AM
Here is a real query from hell in the making. One of our customers wanted to prepare a list of customers for a promotion. They wanted the usual selection criteria like geographic and demographic selections and they wanted at least 5,000 good names and not more than 6,000.

With the incremental approach, you ask for one qualification at a time and you can review the results between each step. The marketer who was preparing the promotion was able to select the customers from the area where the promo was going to take place. A quick look showed that there were far more customers there than they wanted to contact so another selection on a demographic code narrowed the result but there were still too many names.

Scanning through some of the results, it was quickly clear that there were many old customers who had not shopped recently. Now the marketer had the option to go after the current customers and focus the promo on retention or go for the old ones to try to get them back. Selecting the old ones resulted in a few more names than were really desired. Another quick scan through the data showed a wide range in average spending. An addition to the query selected the big spenders resulted in the right sized group and the promotion was launched. Total time for the discovery process was less than one hour.

Imagine attempting that with a typical DW! What was the original requirement? It wasn’t really known. If this had to be done as an IT project, it would never start since the requirements definition would take longer then the cycle between promotions.

If the marketer had access and skill in SQL, then the same process could be attempted. However, the probability of having all of the required fields indexed would be near zero. If the exploration and business decisions had taken another direction and they went after active customers, a different set of fields would be required. What are the chances that they would also be indexed?

Without a doubt, there is a query from hell somewhere in this business process.

Add to Technorati Favorites      | Blogger Comments (0) |    

Value Based Storage

Posted at 3/12/2008 12:31:00 AM
Just what is value based storage? This is a common question I often need to answer. It sounds like some kind of marketing jingle but, in fact, it is the underlying iLuminate storage structure technology that we are talking about.

The iLuminate internal storage looks dramatically different from either row based storage (Oracle, SQL Server, etc) or column based storage (Sybase IQ, Vertica, etc). Value Based Storage (VBS) stores the data values in sets (the "value pool") based on data type and each distinct value is stored only once. The original record definition is maintained by an indexing system that links to values the value pool. These two objects are enhanced by the extended metadata giving the iLuminate database its complete functionality.

This storage structure provides several benefits. The database is very small, in some cases even smaller than the raw data that went into it. In addition to simply being small, it concentrates the most active elements of the data into very compact sets. The value pool is often small enough to fit entirely in memory. As a result, the slowest part of database processing, transfer of data from disk to memory, is dramatically reduced.

The VBS also provides some data access features that can not be duplicated in any other structure. A search can be executed starting with a data value in the pool. By going from the value pool back to the index, it is possible to quickly locate every use of the value wherever is may be used in the logical record structures.

This structure also enables our incremental query capability. As the result of a query, the database returns a set of instance identifiers rather than a set of records. This is because there are no records, only pointers and values. With the response being a set of pointers, it is a simple matter to perform the next query step and then get the union or difference between the two sets of pointers for the result of the second query step. This process can be continued indefinitely with the result set shrinking or growing as the new results are merged with the old.

The extended metadata and the indexing structure and logic provide more features but that can be the subject of another discussion.

Add to Technorati Favorites      | Blogger Comments (0) |