For medical researchers, a good database design is a lifesaver
Tracking the spread of infectious disease is a serious business. While researchers struggle to collect information and draw conclusions about a disease—such as where it originated and where it's likely to spread—doctors in the field are trying to save lives. Organizations such as the Centers for Disease Control and Prevention in Atlanta have massive software and hardware systems to assist in this task. But smaller organizations in the United States and elsewhere don't have the money to acquire and maintain huge, integrated databases and sometimes must devise their own—sometimes manual—systems for disease tracking. In this article, I describe a database template (one in a series of metamodels) that I developed to help automate this task.
Pooja works for a private organization in India that monitors outbreaks of various infectious diseases, including her current project, which is tracking polio outbreaks. For each polio case, she collects information that includes the geographic location of the patient, the severity of the disease, and the aftereffects on the patient. Pooja enters the data she receives from field sources into the organization's database. She then can pull reports for her managers, showing how extensive an outbreak is based on the number of cases reported for a geographic area. She can also track trends and do some predictive modeling to assess the probability of another outbreak.
Pooja was trying to normalize her database and was having trouble organizing the data so that she could efficiently and effectively pull the types of reports that she needs. She asked me to help her design a new, well-normalized database. She sent me some information about the data she's collecting. For privacy and security reasons, Pooja can't store participants' names or exact residence addresses in the database. She stores this information in another system. Pooja assigns each participant a unique identifier (UID), which lets her track a participant throughout the system. Each participant's record includes the following information:
- Classification as a member of a "community"—a distinct socioeconomic group.
- The geographic area where each participant lives, identified by state, district, and block. A state comprises many districts, and a district comprises many blocks.
- The participant's gender and date of birth.
- The participant's history with the disease, including date of onset, date first reported, and date first investigated.
- Symptoms of the disease, including presence or absence of fever and presence of any level of paralysis.
- Medications, dosage, date each dose was administered, any immediate result, and next examination date.
- A final diagnosis, as appropriate.
Figure 1 is the conceptual model of the database I designed for Pooja's work. The main user table is called Participant. Each row in the Participant table represents a person who is sick with an infectious disease and whose progress Pooja is tracking. In the Participant table, I've included gender and birth-date information, along with diagnostic information. The Participant_Symptom table includes each participant's symptoms, and each participant's treatments are in the Meds_Administered table.
I added one user table to this model that Pooja didn't list in her requirements—the Project table. This table will let Pooja organize her investigation into different projects and even use the same database to track diseases besides polio. With this ability, Pooja can document causes and effects of the outbreak and possibly identify secondary outbreaks of infectious disease.
To enforce data integrity, I've used validation tables—State, District, Block, Community, Symptom, and Medicine—which have enforced, mandatory relationships to user tables. Each validation table restricts the values that Pooja can use in the comparable attribute of the user table that references the validation table.
For instance, the State table validates the StateCode in the Participant table. (In a conceptual model, foreign keys aren't listed in the user table; they're represented by the relationship between the validation table and the user table.) Pooja will populate the State table with a set of valid values for state codes and state names. The client application that Pooja and her data-entry staff will use to insert and modify rows in the Participant table limits the values in the Participant.StateCode field to those that exist in the State table.
Using validation tables to limit data entry in the database is a good way to enforce data integrity because you can't enter invalid data into fields that a validation table defines and that mandatory relationships enforce. Preventing inaccurate data entry will help Pooja produce better reports because the data in the user tables will be consistent and accurate.
Pooja can now efficiently and effectively track polio outbreaks. With the meaningful reports she's able to generate, her supervisors can see the geographic areas that have been hit hardest by the disease, and experienced epidemiologists can even speculate where the disease is likely to spread. Pooja can use her database to track any number of other infectious diseases with little or no modification to the database. This database saves the researchers in Pooja's organization a lot of money by giving them an inexpensive in-house tool for performing their mission—and it might save some lives, too.