Christopher B. Browne's Home Page
cbbrowne@acm.org

8. Other Database Stuff - Database Design

8.1. Other Database Models

Historically, the other major database models have been:

The recent popularity of XML represents, in effect, a return to the hierarchical data model, where each element has attributes, enclosed by a hierarchy of parent elements.

Unfortunately, queries on hierarchical and network databases inherently involve writing a program to navigate from record to record through the hierarchy.

The other two major " models" one might think of are that of:

" Object databases" have also emerged, such as POET, Gemstone, and numerous other such schemes for providing persistent storage for " objects". These, too, tend to be expressions of some combination of " network" and " hierarchical" database models.

8.2. Relational Algebra

The relational database model is distinct from the preceding sorts of database models particularly in that it provides an algebra for describing the results of searches, rather than requiring that you write a program to navigate the database.

According to the seminal paper, [ A Relational Model of Data for Large Shared Data Banks (E.F. Codd)], one of the vital differences between the relational model and its predecessors is that it rejects the mandate to use indexes to impose some form of network or tree structure. In many such systems,

Application programs developed to work with these systems tend to be logically impaired impaired if the trees or networks are changed in structure.

In effect, the " shape " of your applications must conform exactly to the structure imposed by the navigation system.

The relational model, instead, defines sets of information, and provides operations for describing queries using whatever structure the data can support. If there is no convenient index, queries may run less quickly, but they can still be expressed by your programs.

The traditional theoretical operators are thus:

In all of this, you need to describe the relationships between the sets of data relations you are looking for, but you are not required to elaborate on how the search needs to take place, as is typically the case with with hierarchical and network models.

8.3. Normalization

Normalization of a database table involves making sure that it satisfies a set of constraints that are intended to reduce data redundancy as well as the chances of the information becoming inconsistent.

8.3.1. Why Normalize?

Why is this sort of process necessary?

  • To prevent data inconsistencies

    In a poorly structured database, it is possible to present inconsistent facts.

  • To minimize the amount of data that must be entered, validated, and stored.

    If a person's name is being stored in a whole bunch of places, there has to be some sort of process to put it there and to maintain it.

    If a fact, such as a person's name, is stored just once, then it only needs to enter the system once, and if it needs to be corrected, there is only one place where that change needs to be made.

It is common for people to propose denormalizing tables to "improve performance". They imagine that reassembling the data, via JOIN queries, will involve additional disk I/O, so that denormalized data, which avoids this " reassembly", will be faster.

This tends to be a fallacy. Normalization leads to the tables being smaller due to the elimination of redundant data. Today's increasingly efficient query optimizers cut the cost. Consider that the " denormalize" attitude sprung up when the popular DBMS systems were things like Oracle, which used a rule-based optimizers through '80s and early '90s.

And this all misses the real value of normalization, and the cost of not doing so, which is all about data integrity. In a normalized database schema, integrity constraints such as foreign keys can readily be defined by users to be enforced by the DBMS. In a denormalized schema, that capability is lost, and integrity must be managed within applications. Making direct changes to the database becomes hazardous, and the risks of having inconsistent data increase, massively.

There are five usual forms; each gets a little more obscure than the previous one.

8.3.2. First Normal Form

This eliminates any repeating groups by putting them into a separate table. If employees can have multiple children, you don't have fields CHILD1, CHILD2, and such; you instead have a table with fields EMPLOYEE ID / CHILD.

tutorial=# \d not_1nf Table "public.not_1nf" Column | Type | Modifiers --------+---------+----------- id | integer | name | text | child1 | text | age1 | integer | child2 | text | age2 | integer | child3 | text | age3 | integer | tutorial=# select * from not_1nf; id | name | child1 | age1 | child2 | age2 | child3 | age3 ----+--------+----------+------+--------+------+--------+------ 1 | Chris | | | | | | 2 | David | Jonathan | 1 | | | | 3 | Harold | Chris | 25 | David | 21 | Brad | 19 (3 rows)

What do you do if someone has more children than there are columns available? After all, " Harold" had four sons, not three. What do you do if you need to do queries relating to the children themselves? Programs working with " child" information will be considerably complicated by the need to extract it out of many fields. Another terrible aspect of the design of the table is the use of childrens' " ages"; that varies, from year to year, whereas their birth date does not.

Here is a vastly better schema:

tutorial=# \d employees Table "public.employees" Column | Type | Modifiers --------+---------+----------- id | integer | name | text | tutorial=# \d child Table "public.child" Column | Type | Modifiers -------------+--------------------------+----------- employee_id | integer | name | text | birthdate | timestamp with time zone | tutorial=# select * from employees; id | name ----+-------- 1 | Chris 2 | David 3 | Harold (3 rows) tutorial=# select * from child; employee_id | name | birthdate -------------+----------+------------------------------- 2 | Jonathan | 2002-06-13 00:00:00-04 3 | Chris | 1978-10-01 21:47:59.380895-04 3 | David | 1982-10-01 21:47:05.023232-04 3 | Brad | 1984-10-01 21:47:12.579425-04 (4 rows)

Here, there is no redundancy; an employee with no children has no entries in the " Children" table. An employee with many children may have many entries. The information about the children is also more useful, as it presents a birthdate, from which you may compute an age whenever needed.

The same approach might be used to represent a spouse relationship, albeit with the limitation that no more than one spouse is permitted by law at any given time!

In later examples, you will see that the later normalized forms can be exactly mapped onto the denormalized forms via doing queries. In the case of 1NF, it is actually downright inconvenient to present the multiple columns.

A reasonable " relational" way to express the data is via the query SELECT e.id, e.name, c.name AS child, (now() - c.birthdate) AS age FROM child c, employees e WHERE (e.id = c.employee_id);

tutorial=# select * from emp_children; id | name | child | age ----+--------+----------+--------------------------- 2 | David | Jonathan | 475 days 23:09:46.294683 3 | Harold | Chris | 9131 days 01:22:46.913788 3 | Harold | David | 7670 days 01:22:41.271451 3 | Harold | Brad | 6939 days 01:22:33.715258 (4 rows)

8.3.3. Second normal form

This eliminates fields with functional dependencies on a partial key by moving them to a separate table.

Consider the following " business rules":

  • Staff are assigned to activities within projects;

  • Managers and their departments are assigned to projects.

tutorial=# select * from assignments; project | activity | manager | department | staff_level ---------+----------+---------+------------+------------- ADM101 | MGR | 10 | ADMIN | 1.50 ADM101 | SEC | 10 | ADMIN | 1.25 ADM101 | AUD | 10 | ADMIN | 0.75 ADM101 | TEMP | 10 | ADMIN | 17.00 HR171 | MGR | 10 | ADMIN | 2.00 HR171 | SEC | 10 | ADMIN | 3.20 HR171 | AUD | 10 | ADMIN | 9.00 HR171 | INT | 10 | ADMIN | 35.00 OP101 | MGF | 11 | TECHOP | 2.00 OP101 | SEC | 11 | TECHOP | 3.00 OP101 | PGR | 11 | TECHOP | 6.00 OP101 | OPS | 11 | TECHOP | 7.00 (12 rows)

This obviously has no "repeated groups", but different parts of the table are dependent on different keys. Staff Usage is dependent on Project and Activity, whereas the fields Manager and Department are only dependent on the Project.

As a result, it makes sense to split this into two tables, one indicating Staff Usage, and the other indicating the assignment of employees to projects.

tutorial=# select * from project_staffing; project | activity | staff_level ---------+----------+------------- ADM101 | MGR | 1.50 ADM101 | SEC | 1.25 ADM101 | AUD | 0.75 ADM101 | TEMP | 17.00 HR171 | MGR | 2.00 HR171 | SEC | 3.20 HR171 | AUD | 9.00 HR171 | INT | 35.00 OP101 | MGF | 2.00 OP101 | SEC | 3.00 OP101 | PGR | 6.00 OP101 | OPS | 7.00 (12 rows) tutorial=# select * from manager_assignments; project | manager | department ---------+---------+------------ ADM101 | 10 | ADMIN HR171 | 10 | ADMIN OP101 | 11 | TECHOP (3 rows)

Looking at table project_staffing, while there are repetitions of both projects and activities, they do not depend on one another, and staffing levels are evidently dependent on both values, so there is no reason to try to normalize this table further.

On the other hand, there is some redundancy in the handling of manager_assignments. Managers are tied to projects, and managers are tied to departments, and there is something separable about this.

But summing up, given the two "child" tables that we have defined, it is easy to calculate the poorly normalized " table" we started with, with the following query. In fact, this example was actually worked together backwards based on this VIEW definition:

tutorial=# \d assignments View "public.assignments" Column | Type | Modifiers -------------+----------------------+----------- project | character varying(6) | activity | character varying(6) | manager | integer | department | character varying(6) | staff_level | numeric(6,2) | View definition: SELECT p.project, p.activity, m.manager, m.department, p.staff_level FROM manager_assignments m, project_staffing p WHERE (m.project = p.project);

This points to a vital point about normalization, namely that when you are defining what data is to be stored, you shouldn't do this simply based on what reports you plan to generate. Reports are surely useful, but the point of the database schema is to allow you to compute the contents of reports, not to express them.

8.3.4. Third normal form

This form eliminates dependencies on non-key fields by putting those fields into a separate table.

At this point, there is a one-to-one relationship between all the non-key fields and the entire key.

In the previous section, we had the manager_assignments table where there were separate dependancies:

  • Department was dependent on manager;

  • Project was dependent on manager.

We can separate this into two tables, joined together as follows:

tutorial=# select * from project_managers; project | manager ---------+--------- ADM101 | 10 HR171 | 10 OP101 | 11 (3 rows) tutorial=# select * from manager_departments; manager | department ---------+------------ 10 | ADMIN 11 | TECHOP (2 rows) tutorial=# \d manager_assignments View "public.manager_assignments" Column | Type | Modifiers ------------+----------------------+----------- project | character varying(6) | manager | integer | department | character varying(6) | View definition: SELECT a.project, a.manager, b.department FROM project_managers a, manager_departments b WHERE (a.manager = b.manager); tutorial=# select * from manager_assignments; project | manager | department ---------+---------+------------ ADM101 | 10 | ADMIN HR171 | 10 | ADMIN OP101 | 11 | TECHOP (3 rows)

8.3.5. Fourth and Fifth Normal Form

Fourth normal form separates facts stored in one table into separate tables when there are values in one column that depend on values in another column.

Fifth normal form breaks out more abstruse data redundancies not covered by the previous forms, usually involving multi-table joins.

Google
Contact me at cbbrowne@acm.org