Use counts to analyze textual information

When people think about OLAP, they think about analyzing numbers—financial data or sales data, for example. However, many OLAP solutions exist outside traditional numeric applications. Some of the most valuable analysis solutions don't start with numeric information. You can analyze any database—with or without numeric information—by using counts. A count can measure the quantity of transactions such as sales, calls in a support center, or defects in a product. Count analysis can be a very revealing exercise because it can show trends over time or relationships between numbers (e.g., the count of sales in Europe versus the count of sales in the United States). Let's examine count analysis in detail, then look at an example of how you can analyze data that has no numeric information.

You can easily aggregate textual information as counts, and you can use counts to derive trends or percentages. For example, say you have a simple text database that includes an entry for each article in SQL Server Magazine. The database would include fields containing such information as title, author, topic, applicable SQL Server version, and date published. Each of these fields is a candidate to become a dimension. The primary measure might be number of articles. Another dimension could be SQL Server version, in which the count would tally articles about SQL Server 2000 within the SQL Server 2000 dimension member and SQL Server 7.0 articles within the SQL Server 7.0 dimension member. To analyze the mix of content in the magazine, you could derive a new dimension member (a calculated member) in the SQL Server version dimension that measures the ratio of SQL Server 2000 articles to SQL Server 7.0 articles. Charting this new dimension member over time would reveal how the ratio of SQL Server 2000 articles to SQL Server 7.0 articles has changed. I suspect you'd see a spike of SQL Server 2000 content around the time of the initial release of SQL Server 2000, then a drop followed by gradual growth as the SQL Server 2000 product matures.

The previous example demonstrates how you might use a text database analysis; count analysis is also appropriate for databases that contain numeric information. For example, Analysis Services' FoodMart database includes a measure called Sales Count, which is a simple count of all the sales transactions in the database and is useful for determining statistical averages such as Sales Average (Store Sales divided by Sales Count). Sometimes a count measure isn't useful, but the average (or other calculated measures such as ratio) derived from the count is. In such a situation, you can create a count measure but mark it in Analysis Manager as hidden. A hidden count measure is available only for creating calculated measures—it isn't available for use directly.

Count measures can be easy or complex. In the FoodMart database, setting up the Sales Count measure is simple. You just create a new measure based on the fact table's primary key and set the measure's type to count. But let's look at a more complex example. Say a table called fact contains line items (entries) for invoices. An invoice can contain one or more entries. So you probably want a count measure that counts invoices, not entries. To count invoices, you want to count only the groups of entries that make up an invoice.

One way to solve this problem is to create a distinct count measure based on the fact table's invoice number column. This measure will give you the invoice count values you want, but distinct count measures have two serious limitations. First, each cube can contain only one distinct count measure. Second, Analysis Services doesn't aggregate distinct counts through dimension levels as it does other measures. The distinct count values are predetermined and stored at each cell with other measures, so you can't create new aggregations during an MDX query's execution. In an MDX query or a calculated member definition, using a function such as

Aggregate( \[USA\], \[Mexico\] )

won't work with a distinct count measure selected; determining the result of the function would require rescanning the fact table because the cube doesn't contain enough information to determine the function's result. Analysis Services can't rescan the source table, but even if it could, the process would be prohibitively slow. The effect of this restriction is that distinct count measures don't generally work well with other calculated members or sets.

A second solution is to create an extra column in the source table to store an invoice count. Fill one entry for each invoice with a value of 1; fill all other instances of the invoice count field with values of 0. You can then create an Invoice Count measure that's a sum of this invoice count column. This solution works as long as you select in the cube a cell that encompasses a group of entries that make up complete invoices. If your cell includes only some of the entries in an invoice, the invoice count column might not include the entry that contains the 1 value and thus would produce a sum of 0 instead of 1 for that invoice.

A third solution is to use a combination of the two approaches. Create an invoice distinct count measure, an invoice sum count measure, and an invoice count calculated measure that helps you determine which of the other two measures to use based on the cell that's selected. The invoice distinct count measure will return the correct answer when only some of the entries in an invoice are selected, and the invoice sum count will work in all other situations. The invoice sum count also gives you the benefit of working when custom combinations of members are selected. This invoice count example shows that, in real-world situations, count measures can get complicated because the count might depend on a distinct combination of a group of fact table columns.

However, count-based analysis can be valuable. Consider the following solution, which was created for the prosecuting attorneys office of a county criminal justice department. Such an organization frequently has to respond to requests for crime statistics. The department maintains the criminal case filings in a custom application that uses DB2 on an AS/400 system, but the application wasn't designed for analysis, so it can only look up individual cases or report about a series of cases according to simple criteria.

A typical inquiry for this database might be "Has the number of burglaries increased or decreased in the past 5 years?" The DB2 data contains all the information you need to respond to such an inquiry. The data—all textual—includes the date of the incident, the defendant, the type of crime, the judge, the prosecuting attorney, and the result of the case. The source data contains no dollar amounts, counts, or any other numeric information. Figure 1 shows the dimensions and dimension levels in the OLAP cube that you'd use to respond to the inquiry.

The fact table consists of incident filings from each case. Because each case can have more than one filing, the fact table includes two count measures. One measure, Count of Incidents, gives the total number of incident filings, and the other measure, Count of Cases, is the number of cases. The Count of Cases measure is set up as a distinct count measure on the Case number column of the fact table.

So, to answer the question about the trend of burglaries, you'd create a query that displays the Count of Incidents for each type of burglary for each of the past 5 years. You'd also select a status of Guilty to be sure you weren't counting false charges. You'd use the Count of Incidents rather than the distinct count because a charge of burglary could be just one of many charges in a case.

After the cube in this example was constructed, a calculated measure was added to return the conviction rate as a percentage. The formula for this measure is the count of guilty incidents divided by the total count of incidents. In MDX, the formula is

(\[Count of Incidents\], \[Status\].\[All\].\[Guilty\])
 / (\[Count of Incidents\], \[Status\].\[All\])

Unfortunately, the prosecuting attorneys office doesn't track the name of the defending attorney. If the defending attorneys' names were included, you could analyze the relationship between the conviction rate and the combination of prosecuting attorney and defending attorney. But as it is, the cube can help you determine whether the combination of a particular prosecuting attorney and a particular judge affects conviction rate. Such an analysis could suggest a judge's bias toward or against a particular attorney.

Besides revealing this type of controversial finding, this cube can help authorities analyze many types of crime trends. For example, you might ask what types of crimes are seasonal. If you integrate local weather data, you could determine how temperature, precipitation, or moon phase affects crimes such as car theft. Even without weather data, you can uncover such information as which crimes are most common during the holidays and what the conviction rate is for holiday crimes.

You can summarize nearly any kind of database by using counts. If the data already includes numeric information, then count information is complementary, offering a new way to analyze the same information. Counts also combine well with numeric information to create averages or ratios. If you want to count only fact table records in your summarized information, the implementation is simple. But if you're counting derived information such as invoices, you can use one or more of the techniques I've presented here. Thanks to Michael Patterson, a senior consultant at ProClarity, for providing some details of the criminal cases cube for this article.