Maintain accurate data on mobile populations

In the process of designing and developing databases, I sometimes encounter problems that I can't solve by creating table structures alone. Tracking people across the space-time continuum, for example, demands a more complex solution. I confronted this challenge when a client asked me for advice about developing a statewide method of assessing student and school performance in grades K-12.

Establishing the Requirements


Let's briefly review the background for this case study. One of the project's business requirements was to track students as they moved among school districts within the state—an effort that hadn't been successful in the past. Tracking needed to be accurate because the state had instituted school and school-district report cards to which extra funding was tied. In addition, both state and federal governments relied on accurate student numbers to establish appropriate basic funding levels, and the federal government relied on those numbers to fund programs for special-needs children.

The state also had to meet a related business requirement: avoid counting students more than once. Double counting is a big problem—especially in districts that include mobile populations—because it renders statistics inaccurate. Double counting occurs because some parents are reluctant to divulge their family's previous location, students move in from out of state, and students leave the state before graduating from high school.

Initially, you might think that maintaining accurate numbers of students at each school would be an easy task—you'd just require that every student's Social Security number be on file, then you'd track students by those numbers. After all, the methods used today to issue Social Security numbers to children of school age are quite reliable, according to publications of the Social Security Administration. (For a sample publication, see http://www.ssa.gov/policy/congcomm/testimony_072299.html.) In the real world, however, even this simple solution doesn't work all the time. For example, parents can still misrepresent legal documents, and new state residents and immigrants can become confused and give the wrong information about their children. Because of obstacles such as these, you can't successfully track every student through every school, but you can take steps to mitigate the problem.

Developing the Model


Instead of relying solely on Social Security numbers, you need to develop a multifaceted approach to tracking students within a state. Assuming that a state has established a centralized database that every school can access for student-identity information, you can use the positive identification issued to each student (the Social Security number, in this case) to set up supplemental checks to confirm the student's identity. Code or user processes in the form of triggers and stored procedures or code at a client- or mid-tier level can enable these supplemental checks, and table architecture should support the checks.

Because the first step in tracking students—setting up a centralized database—is beyond the scope of this article, I assume that a database is already in place. The next step is to establish use of the positive identification for each student. Figure 1, page 48, is a PowerDesigner 9.0 conceptual data model that I used in developing the tracking process. In Figure 1, I created four entities. The CHILD entity represents a person of the age required to attend school in grades K-12. SCHOOL is a geopolitical entity, usually a building or collection of buildings, where children are educated in grades K-12. SCHOOL_DISTRICT is a political entity composed of a collection of schools. The district sets policy, administers the schools, and acts as an intermediary between the schools and the state Department of Education. REGISTERED_STUDENT represents a child enrolled in a specific school. This entity is used for the duration of the child's academic experience.

The relationships among these entities are as follows: A SCHOOL_DISTRICT might administer zero to many SCHOOLs, whereas a SCHOOL belongs to only one SCHOOL_DISTRICT. A SCHOOL might include zero to many REGISTERED_STUDENTs, whereas a REGISTERED_STUDENT attends only one SCHOOL at a time. A CHILD can become a REGISTERED_STUDENT zero or many times (depending on how many different schools the child has attended), whereas a REGISTERED_STUDENT refers to one CHILD.

The PowerDesigner 9.0 conceptual data model doesn't show foreign keys in each entity's attribute list. For example, you won't see ChildID or StudentID listed as attributes of REGISTERED_STUDENT until you convert the conceptual data model to a physical data model. (For more information about how to read Figure 1, see the sidebar "Understanding the Notation: Defining and Using Domains," page 50.

Identifying a Primary Key


Let's start with the CHILD table and identify candidates for the primary key. The Social Security number might be reliable, but is it a viable candidate for primary key of this table? A candidate for primary key must meet certain conditions:

  • No component of the primary key can be NULL.
  • The value of the primary key should be short, simple (no case sensitivity, special characters, or embedded spaces), nonidentifying (implied meanings can change over time), and unchanging.
  • SQL Server should be able to quickly process the primary key's data type (e.g., an integer).

Social Security numbers belonging to school-age children violate several of these requirements. Children new to this country might not have Social Security numbers when they're trying to register for school. Also, Social Security numbers, when formatted as they appear on a Social Security card, contain dashes (special characters), which under-trained data-entry staff can easily transfer to the centralized database. A Social Security number has implied meaning, although the meaning shouldn't change unless an administrative error was made when the Social Security card was issued. Parents might mistakenly (or intentionally) present the wrong Social Security number when enrolling a child for school. Although violations and system abuses like these are rare, the few exceptions would create major problems in a database design unable to accommodate such deviations. If the Social Security number is the primary key of the CHILD table and any of these violations occurred, you might see the following results:

  • A child would be unable to register for school.
  • Keystroke errors on data entry could cause one child's records to become confused with another child's records or could cause a perceived loss of student records.
  • A change of Social Security number could orphan records, leaving a child with an incomplete set of student records.
  • Manual-intervention procedures established to alleviate these problems could easily break down.

Naming Additional Attributes


Clearly, the Social Security number isn't a good candidate for the CHILD table's primary key, so I've created the surrogate ChildID primary key. However, the Social Security number can still help you track students across space and time when you couple it with other attributes in the CHILD table. Figure 1 shows examples of these other attributes, specifically GenderCode, BirthDate, BirthCountry, BirthCity, and BirthState. All these attributes are stored as values in the CHILD table. Parents who are registering their children for school have to present proof of identification such as a Social Security card, a birth certificate, or a transcript of attendance from the school last attended. Attributes such as gender, date of birth, and place of birth have a very low probability of change, discounting tampering with or forging of student documents.

Coupling the value of the Social Security number with a comparison of these gender- and birth-related attributes gives you an effective way to track students across space and time. When a child changes schools, the system compares the information in the school records provided by the child's parents with the Social Security number, gender, and birth values on the child's transcript as stored in the centralized database to positively verify the child's identity. If the automated system doesn't positively identify the child, the school district has to resort to an alternative procedure for verification such as interviewing the child and the child's parents.

Notice that I haven't included attributes such as ChildFirstName and ChildLastName in the identity verification procedure. Although names are very important local identifiers of a person, they're not a reliable way to track people. Duplicate names are commonplace, and names can change quickly and frequently. Parental custody is an invalid method of tracking because that, too, can change. A child's appearance alters drastically over short periods of time, so storing and comparing images isn't an effective technique for identification either. I also avoided biometric approaches to identification, such as fingerprints or finger images, cornea scans, and voice prints, because biometric identification technology is expensive and few school districts can afford such equipment.

Choose a Group of Permanent Attributes


The solution to tracking individuals across space and time is to create a multifaceted approach to matching identities by using easily gathered and readily stored attributes. Don't rely on a single-valued solution, such as the Social Security number, because it will prove too limited for successful tracking. Instead, identify a group of attributes that describe individuals in ways that don't change over time, and use that set of values to track them.