Like an architect, a good data modeler works with the client’s needs
Data modeling is a design discipline and a design activity. A data modeler is an architect who can bridge the gap between the business world and the technical world by correctly interpreting corporate data. Just as a good architect incorporates your living style into a home design or remodeling project, a good data modeler incorporates business requirements into the data model. In data modeling, as in all design disciplines, you rarely have one correct solution. Given a set of requirements, a good data modeler can readily identify incorrect or inadequate solutions and choose from many potentially workable solutions to find the one that is best for the company. A good data modeler on your team can help you develop a solution that truly works, much as a good architect can remodel your house to fit your family's needs and lifestyle.
To understand this concept, let's look at a simple but common example: a time-recording system. Depending on the business applications your time-recording system is supposed to integrate with, you can design it any number of ways. For this article, we'll look at a basic model, examine its inherent problems, and see how creating a better model can avoid those problems.
The most basic time-recording system contains entries for start time and end time, as the conceptual model in Figure 1 shows. Let's say you're going to record the length of time it takes an employee to do a task. In addition to the start time and the end time, you'll need to add to the EMPLOYEETASK table an employee name and a task to record who did what and how long it took.
This basic single-table model might seem easy to maintain and easy to generate reports from, but it's the least likely type of model to give you an accurate measurement of time spent on a project or task. The first and most obvious problem with this model is that you can record only the beginning and end of a time period. If your time recording system was going to be used on a factory floor, for example, this model might not be suitable. I doubt that the company accountant wants to pay for lunch and coffee breaks. You could simply deduct a half-hour or an hour from the time recorded for every employee, but assuming that all employees take the same amount of break time might lead to quarrels over accuracy and dissention in the employee ranks.
A second problem with this model is the use of variable character identifiers such as employee name (EmployeeName) and task name (TaskName). To understand the kinds of problems you might encounter, imagine a payroll report; to create this report, you need to add up the total time each employee works in a pay period. For example, employee Joe Smith is a machinist who works on the factory floor and has to sign in and out every day. On day one, Joe signs in for work as Joe Smith. On day two, he signs in as Joseph Smith. On days three and four, he signs in as Joe Smith again. On day five he signs in as J. Smith. Compiling Joe's hours is now complicated because he's used three different representations of his name. If you're compiling manually (visually scanning the daily records) and you know Joe, you'll be able to identify all five entries as belonging to Joe Smith. But if you're using a compilation software program, the program is likely to issue three paychecks: one for Joe Smith, one for Joseph Smith, and one for J. Smith.
The problems get more complex when you hire a second machinist named Joe Smith to work on the factory floor. To separate the time records for these two people, you have to institute use rules, such as asking the first Joe Smith to use only the name Joe Smith and the second Joe Smith to use the name Joseph Smith. Use rules are the least secure and least reliable method of operation because employees can easily overlook, ignore, or break them.
Yet another problem is that this model requires lots of storage space and can adversely affect processing time because it uses variable character identifiers such as EmployeeName and TaskName, which require more storage space and longer processing times than numeric identifiers or short, fixed-length codes. The longer the strings that your application has to read from the hard disk into memory, the greater the number of cycles needed to process the data and the longer the processing time. Plus, the process lookup that decodes the varchar data type causes additional overhead and performance degradation. In small databases, you might never notice a performance problem, but in larger databases with greater numbers of transactions per hour, the performance impact can be significant.
This data model gets even more complicated when you need to be more accurate when capturing time spent on a task, so you require each person to sign in and sign out multiple times during the work shift. Now when Joe Smith reports for work, instead of signing in at the beginning of his shift and signing out at the end, he must sign out when he takes a break and sign back in again when he resumes work. Each time he takes a break, he closes out the old time record (inserts a data and time value in EndTime). When he comes back from break, he starts a new time record (enters his name, the task, name, and a value in StartTime). Under the original scheme, Joe would have one time record for each work shift. Under the new scheme, Joe will have multiple time records for a shift.
Creating multiple time records for each person during a work shift will definitely give you a more accurate picture of who is working and for how long—and more accurate payroll numbers. We want to keep this change. However, you need to minimize or eliminate the complications of using variable character identifiers.
Notice that I use "variable character identifier" to talk about the EmployeeName and TaskName attributes in Figure 1. In this case, EmployeeName and TaskName are identifiers in the EmployeeTask table that refer to employees and tasks that should be listed in other tables in the corporate databank (i.e., the collection of databases that store company data—most companies have more than just one database.) If you can remember the very first article I wrote about data normalization (SQL by Design, "Why You Need Database Normalization," March 1999, InstantDoc ID 4887), you know that you can summarize the rules for properly normalizing data in a single thought: one table, one subject. Therefore, you should store all data about an employee—such as name, date of birth, and personal identifiers—in one table. You should store task data—such as task name, complexity, and a task identifier—in another table. If a task can be part of a project, you should store the project data in one table, store the task data in its own table, and relate the two tables to each other.
Figure 2 is a conceptual model that is more complex, but it matches our business requirements better than Figure 1's model. Employee data is in the EMPLOYEE table, tasks are in the TASK table, and projects are in the PROJECT table. A task may or may not be associated with a project, as the zero-to-many relationship between PROJECT and TASK indicates. The EMPLOYEETIME table is similar to Figure 1's EMPLOYEETASK table, with one major difference: I replaced the variable character identifiers in EMPLOYEETASK with simple integer identifiers for both employees and tasks in EMPLOYEETIME. EmployeeID, which is the primary key value for each employee in the EMPLOYEE table, is also a foreign key in EMPLOYEETIME. Likewise, TaskID, the primary key value for each task in the TASK table, is a foreign key in EMPLOYEETIME. This model eliminates the complexities inherent in long variable character identifiers.
If you're an old hand at modeling data, the process that I just walked you through was elementary and the end result intuitive. But, I chose this example for a reason: I wanted to illustrate the logic—the thought processes—that go into modeling data. I wanted to show you why we model data the way we do. At first glance, you might think that data models like the one in Figure 1 are more desirable because they're simple and easy to program against—at least at first. But they mask the inherent complexities of how people use data. If you think your situation warrants a simple data model—perhaps you store only a little data, you need only a minimal description for each person and task, or you need to capture only one start and end time per person for an ongoing task—a basic data model like the one in Figure 1 might work for you. However, Figure 1's data model isn't well normalized, and over time, you'll inherit all the problems associated with badly-normalized or under-normalized data.
Figure 2's well normalized data model is my preferred solution. Regardless of how many employees, tasks, or projects I might have or how many time periods I might need to record, the data model in Figure 2 will accurately accommodate them all. Moreover, as my business situation increases in complexity, I can modify the tables and relationships to reflect the changes, much as I would expand or remodel my home to accommodate changes over time in my living situation.