Most large businesses have an untapped asset: their warehouses of data about individual customer purchases. Market research firms, for example, will pay for this information because they forecast future buying trends based on demographic analysis of past activity. But while this market could be lucrative, privacy concerns make most enterprises reluctant to sell their customer data.

Privacy is important. Organizations have an ethical obligation to respect their customers' privacy. Organizations also have a legal obligation associated with recent legislation aimed at protecting individual privacy in the computer age. (For information about recent privacy legislation, see Sean Maloney's article "For Your Eyes Only," page 15, InstantDoc ID 42615.) Consumers are becoming increasingly aware that their purchases and online activity leave trails of personal information scattered throughout cyberspace. They fear that companies, criminals, or even the government could knit this disparate data together to form a profile of them that's too personal for comfort.

Companies don't necessarily have to lock up or destroy their customer data. But they do have to be diligent about ensuring that the data they provide doesn't permit identification of their customers. Several well-known techniques in the business intelligence (BI) industry let organizations disseminate data about customers while protecting individual customer privacy. These techniques fall into four main areas:

  • Data-access control
  • Microaggregation
  • De-identification
  • Release-data modification
  • With data-access control, you control end users' access to individual records. In the microaggregation approach, you aggregate groups of individual records into an accurate summary of a population while hiding the records of specific individuals. De-identification means that you can release personal records only after scrubbing field data of any identifying information. And modifying release data, an extension of de-identification, means scrubbing the entire data set in a prescribed way so that you can release individual data with quantifiable measures of privacy. The first two techniques apply when you release only statistical summary data; the last two apply when you disseminate individual records. Let's examine each of these techniques and look at some examples of how organizations have used them successfully.

    Limited Access

    Data-access control is an appropriate strategy when an organization wants to host its data on its own Web site and let external researchers access statistical aggregates of the data. In this approach, the organization can use a tool such as SQL Server Analysis Services to create an OLAP cube of the data, then sell online access to this data. Analysis Services lets you specify access-control limits that restrict access to particular cube cells.

    However, restricting access to particular cells isn't enough. Suppose an organization allows access to aggregate data that summarizes individual cell information. You can roll this data up to several dimensions (for example, purchases by ZIP code or purchases by age group), but even if you restrict access to the particular cell for an individual, a user can still query several dimensions and infer the characteristics of a particular cell. For example, individual cells might represent outliers—individual cases that are extreme enough (such as a family with a dozen children) to skew the summary data and unusual enough that users can deduce the characteristics of a particular cell. You need to develop an intelligent monitor to detect these cases. A monitor records data previously sent to a client and limits new query results to ensure that the accumulated data won't let users reliably infer individual data. The monitor stands between the user and Analysis Services, intercepting requests and modifying rows that are returned.

    The National Agriculture Statistics Service (NASS) database serves as an example of how to implement an intelligent monitor. (See "Privacy Resources," page 14, for the NASS URL and more information about privacy designs.) The database holds survey data about various farms' chemical fertilizer usage. The system provides query tools that let researchers discover accurate aggregate statistical data about geographical areas while preventing researchers from inferring information about specific farms. NASS maintains privacy by aggregating several smaller geographical units into larger ones to make it more difficult to infer the contribution of tiny clusters of farms to aggregate totals. Because the system allows ad hoc queries, it also monitors the successive data sent to clients to ensure that the aggregate data is sufficiently general to prevent identification of individual farms.

    Privacy isn't an absolute—you can characterize it only in a probabilistic sense, by quantifying the likelihood that a set of information will let users correctly guess who an individual is. Researchers who view a set of demographic data about many individuals can still identify one individual if only one person has that particular combination of characteristics. If two people have that particular combination of characteristics, the system is a little more private—a researcher can correctly guess half the time which individual is which. In the case of NASS, the system defines the privacy constraint by ensuring that it doesn't disclose data about small counties that contain only a few farms. The monitor aggregates these counties into adjacent areas to make the risk of identification smaller. In the NASS implementation, the monitor performs this aggregation dynamically; as the user makes successive queries, it limits the information that the application can disclose based on the previous information the user received. (The reference links in the "Privacy Resources" box provide the exact algorithm.) It ensures that researchers can't use the complete set of information that you provide to reliably infer data about specific farms.

    Think Small

    NASS hosts its database system on its own Web site, so it can control the sequence of statistics that it provides for successive queries. An organization that sells a complete statistical data set has a harder problem. Once you release a complete data set, an end user can subject it to exhaustive analysis. This means that the originating organization has to aggregate individual information into higher-level aggregates, a technique called microaggregation. The paper "A Comparative Study of Microaggregation Techniques" by Josep M. Mateo-Sanz and Josep Domingo-Ferrer Questuo (see "Privacy Resources") summarizes various microaggregation techniques that privacy researchers have developed. In general, these techniques analyze a source data set of individual records and generate a set of aggregated records. Each aggregated record is a statistical summary of a cluster of individuals. The intent is to create a data set that is accurate for analysis but that doesn't contain any information that would let users infer individual data.

    No ID Required

    An organization can use the first two techniques when selling access to statistical information that it hosts or when selling data in bulk. Sometimes, however, only individual data is useful. For example, medical research requires diagnosis and treatment information about specific people. The problem in this case is how to provide detailed information about individuals while making sure that researchers can't determine exactly who those people are.

    One solution is de-identification, in which you scrub individual data records of identifying data fields. The Health Insurance Portability and Accountability Act (HIPAA) defines 18 classes of information that can identify individuals and sets policies about how to modify or entirely suppress each of them to preserve medical privacy. Obviously, you need to remove specific identifiers such as Social Security numbers. You need to generalize birth dates to birth years and, because you'll have fewer individuals over age 90, combine everyone over age 90 into one birth year. In addition, you can provide only the first three digits of a ZIP code. This restriction is part of a general geographic constraint that limits locality specification to areas with at least 20,000 individuals. All these data changes are designed to give end users a "fuzzy" view of each individual—specific enough to be valuable to researchers, but vague enough to prevent identification of particular people.

    Who Was That Masked Man?

    De-identification is a good start, but to have real confidence that you're preserving your customers' privacy, you need to take further steps. When disseminating data, the originating organization must realize that the end user might have other sets of data about the same individuals, so it's essential to ensure that the end user can't link your disseminated data with other data sets. Users might easily have access to public information such as Social Security death indexes, voter registration records, and motor-vehicle data. The Carnegie Mellon Data Privacy Lab Web site provides case studies and online demonstrations of how users can make the link between disclosed information and public records. These studies demonstrate that organizations that rely purely on de-identification can inadvertently compromise their customers' privacy.

    Techniques for modifying released data usually begin with record de-identification. For example, Dr. Latanya Sweeney of the Data Privacy Lab has developed a technique called k-Anonymity. k-Anonymity is one of a class of algorithms designed to transform data sets in a way that lets you make quantifiable statements of privacy. The general approach is to first apply standard de-identification techniques, which essentially consist of generalizing individual characteristics such as age and location. Next, the k-Anonymity algorithm further generalizes individual characteristics to ensure that, for any one individual, at least k-1 others have the same set of key individual characteristics. This means that if users relate the released data to any conceivable set of external data, at least k individuals will always match a row of released data. Here, k serves as a quantitative measure of privacy.

    Under Cover

    Organizations can profit by disseminating information from their databases to researchers, but privacy concerns are real. These industry techniques enable organizations to ensure data privacy for their customers. But using these techniques requires detailed planning and analysis. You need to select the right technique based on whether you're delivering summary data or individual records and whether individual records have fields that end users could link to other, public information about individuals. The Carnegie Mellon Data Privacy Lab is an excellent place to begin evaluating the techniques in detail.

    Privacy Resources
    "Achieving k-Anonymity Privacy Protection"
    Using Generalization and Suppression"

    by Dr. Latanya Sweeney:

    "A Comparative Study of Microaggregation Techniques"
    by Josep M. Mateo-Sanz and Josep Domingo-Ferrer Questuo:

    The Carnegie Mellon Data Privacy Lab:

    Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule:

    HIPAA Final Privacy Rule:

    The National Agriculture Statistics Service (NASS):

    NASS Privacy Design: