Relational Databases (considered harmful relative to OO databases)
SQL for Web Nerds (an excellent Greenspun overview of the use of SQL for database applications)
This web page provides reasonable explanations of the Hierarchical and Network Database Models. From there, it gets a bit biased towards apologia for the vendor's product, but still has a lot of useful points about various more modern database models.
#2519 : Why should I consider using an auxiliary calendar table?
This is an article on the use of "calendar tables". The idea is that you create a table consisting of a whole series of consecutive days, with relevant additional attributes to classify how they are interesting, such as whether or not a given day is a weekday, weekend day, work day, or holiday.
You can store 5 years worth of days in a little over a thousand entries, fitting on a handful of pages, and use this to efficiently and elegantly provide answers to questions relating to things relating to working days.
In principle, it might be possible to express this via truly microscopic tables consisting of lists of holidays; in practice, enumerating all the days and excluding holidays as needed tends to be the "elegant way."
The article is oriented to Microsoft SQL Server usage; the availability of sequences, with PostgreSQL, allows even more compact usage...
Creating flatfile databases with OpenOffice.org and StarOffice 6
LibreOffice can be connected to SQL databases using ODBC as well as to LDAP data sources.
The procedure is pretty fiddly for PostgreSQL ; hopefully time will smooth the process...
An interesting article proposing that every developer have their own DBMS instance to work with, which mandates having excellent tools for managing large numbers of databases as well as for merging together DBMS schema information.
ChkDB - Open Source Database Checking
This set of tools allows defining sets of rules that may be used to validate the state of a database. When you run the set of rules, the system collects information on violations into its own set of tables.
OCP Training Guide : Oracle DBA from New Riders
This book looks to be better than either of the other publishers' guides.
OraFAQ.net - Oracle FAQ Web Site
Other Sybase resources...
System R was the first relational database, developed by IBM. The MULTICS system, MRDS was sold as a commercial product first.
SQSH Replaces Sybase's ISQL interactive query tool
This company produces database replication software. I know the "founding staff" as I worked with them in my days in Toronto.
They offer free subscriptions if you give them marketing information to make them feel good about their "free" subscriber base...
DeZign for databases - entity relationship diagram, visual database design
Oracle TimesTen - Real-Time Event Processing Software using in-memory database technology.
They claim greatly improved performance by having databases stored in RAM and by using their (not clearly defined) "T-Tree" database organization scheme.
Mariposa Distributed Database Management System
A package that allows DBMSes that are far apart to work together to process queries using an economic paradigm where sites buy and sell data and query processing services.
The company Cohera was founded to commercialize this, using the buzzword "Data Federation System;" they have since been acquired by PeopleSoft.
This somewhat parallels CORBA, where the Naming and Trading Services may be used to construct what they term as "federations" of services.
Relational Databases considered harmful (relative to object-oriented databases.)
Joe Celko is one of the top experts in SQL; he writes a column for DBMS Magazine called SQL for Smarties, and has written several books on SQL, including one based on that title.
One of the more clever ideas he has presented is that of Nested Set Trees. This is a way of handling trees that appears faster and more general than what vendors usually provide for.
It rather resembles the way one would construct a binary "heap" in an array; each node in the tree has (typically integer) fields for "left" and "right," and the range between those values represents the set of nodes subordinate to the current node.
Thus, if the tree reflects "who's boss," we might look up (boss) Bill Smith, and find, for him, that left = 123, right = 175 and then search for people in that group with salaries over $75K that work for him via the query SELECT * FROM ETREE WHERE LEFT BETWEEN 123 AND 175 AND SALARY GE 75000.
This can much quicker than the (oft-used) "adjacency schemes" because it accomplishes the whole query in one fell swoop, whereas with "adjacency," you normally have to essentially walk the tree by hand, submitting a query at each node to find the children, which can get very expensive.
Open Source XML Repository for RDBMS - using the Nested Set model to store XML data.
Open Source Development Lab Data Base Test 1 (OSDL-DBT-1)
See also SourceForge.net: Project Info - OSDL Database Test Suite
This book describes how one may store data in a database in a somewhat encrypted form, thus allowing data to remain private even to people that have permission to have access to the database.
This allows collecting really confidential data for quasi-public purposes. For instance, a national rape crisis organization may need to share statistical information across the country, but keep identities confidential.
A project proposing to create a would-be successor language to SQL.
Optimistic Locking
If you have a situation where pre-locking objects is expensive, and you rarely actually need the locks, this technique can save much of the cost...
An Optimistic Locking Technique for Concurrency Control in Distributed Databases
A method called optimistic method with dummy locks (ODL) is suggested for concurrency control in distributed databases. It is shown that by using long-term dummy locks, the need for the information about the write sets of validated transactions is eliminated and, during the validation test, only the related sites are checked. The transactions to be aborted are immediately recognized before the validation test, reducing the costs of restarts. Usual read and write locks are used as short-term locks during the validation test. The use of short-term locks in the optimistic approach eliminates the need for the system-wide critical section and results in a distributed and parallel validation test. The performance of ODL is compared with strict two-phase locking (2PL) through simulation, and it is found out that for the low conflict cases they perform almost the same, but for the high conflicting cases, ODL performs better than strict 2PL.
An online course in SQL
When Not to Normalize your SQL Database
A somewhat controversial essay describing cases where you might not want to fully normalize a database. It is essential to be careful about such essays; while they may be correct, they tend to have some important underlying context that must not be missed.
The point isn't that "performance problems are generally solved by denormalizing".
To the contrary, naive implementers (the ones that tend to see and follow these sorts of essays) tend to use this to justify creating really terrible database schemas, imagining that "denormalizing makes things faster." This is nonsense, and they usually get the worst of all worlds, namely nonperformant systems filled with logically corrupt data.
The reality is that you need to start by "properly understanding" normalization, and taking advantage of it to the degree possible to minimize the amount of data stored.
However, you may discover that there are particular operations in the system that require extensive manipulation of (for instance) aggregate information, where storing that in fully normalized form leads to doing a lot of extra aggregation work. In such cases, having customized ways to pre-compute those aggregates has essential performance benefits.
Knowing that you are doing the right thing requires having an appreciation of what should be normalized as well as what shouldn't.
This suggests an approach where each piece of data has a unique column name, save for those columns that are used as foreign key references. The interesting part is that this means you may store all of the column names (that indicate new data) into a table, making sure they are unique, and analytically determine if the database is suitably normalized.
I disagree very slightly; I would expect large databases to have some patterns where similar transaction tables or similar "object child" tables would be expected to include dates that would be named using common column names indicating that they store analagous information.
Database folk get very excited about race conditions (myself included!). This sort of thinking is very important for totally automated processes, such as replication systems, where data changes are induced by arbitrary sets of things going on in parallel.
But, as this essay suggests, the same is not true for business processes. It is mighty unusual for business processes to have race conditions in them. If they appear to, it is quite likely that not enough analysis has been done of the process.
Intuit/Quickbase "Ultimate Web Guide to SQL Database Language"
Documents all sorts of aspects of the use of SQL.
Historically, the other major database models have been:
Hierarchical data model
In this model, data is organized in a " tree " fashion, with a hierarchy of parent and child segments.
The notion of " table" would involve having all related records as children of a "parent". There is a 1:N mapping between record types; each " child" has exactly one " parent".
The classic example of such a system is IBM's IMS, also called Information Management System, created in the 1960s.
Network data model
In the 1970s, shortcomings were found in the hierarchical model, notably in that some " children" needed to be associated with more than one " parent". The classic application of this sort is the Bill Of Materials, where subsystems are reused over and over again. Records may be " members" or " owners" in a variety of "record sets".
This was codified by the 1971 Conference on Data Systems Languages (CODASYL), and was widely used in many COBOL-based systems.
The recent popularity of XML represents, in effect, a return to the hierarchical data model, where each element has attributes, enclosed by a hierarchy of parent elements.
Unfortunately, queries on hierarchical and network databases inherently involve writing a program to navigate from record to record through the hierarchy.
The other two major " models" one might think of are that of:
ISAM database systems
These involve some variation on the network model, where software must be written to "navigate " through some hierarchy of records.
You have two choices:
To walk, sequentially, through the set of records, or
To search for a key value based on an index, and then walk through the records based on the ordering provided by the index.
Nonsequential searches
The traditional Unix dbm hashed database scheme allows you to either walk records in what amounts to a random order, or to search for record matches based on some exact key.
" Object databases" have also emerged, such as POET, Gemstone, and numerous other such schemes for providing persistent storage for " objects". These, too, tend to be expressions of some combination of " network" and " hierarchical" database models.
The relational database model is distinct from the preceding sorts of database models particularly in that it provides an algebra for describing the results of searches, rather than requiring that you write a program to navigate the database.
According to the seminal paper, [ A Relational Model of Data for Large Shared Data Banks (E.F. Codd)], one of the vital differences between the relational model and its predecessors is that it rejects the mandate to use indexes to impose some form of network or tree structure. In many such systems,
In effect, the " shape " of your applications must conform exactly to the structure imposed by the navigation system.Application programs developed to work with these systems tend to be logically impaired impaired if the trees or networks are changed in structure.
The relational model, instead, defines sets of information, and provides operations for describing queries using whatever structure the data can support. If there is no convenient index, queries may run less quickly, but they can still be expressed by your programs.
The traditional theoretical operators are thus:
SELECT
This operator is used to allow you to limit the results of a query to include only those relations satisfying certain restrictions, such as SIN numbers starting with a " 4", people with brown eyes, or other such things.
select * from employees where birthdate < now() - '30 years';
PROJECT
A relation may contain a large number of domains, but you may only be interested in a few of them. PROJECT allows you to pick out merely the interesting ones.
In SQL, this is also extended (in much the same way projections work in linear algebra) to include transformations on fields. For instance, you might " project" a name by changing it to upper case, and perhaps combining it with other fields, as with the query:
select upper(firstname) || ' ' || upper(surname) from employees;
JOIN
JOIN allows combining tables based on some shared criterion. For instance, we might want to find children of employees, thus:
select * from employee e, children c where e.id = c.parentid;
UNION, INTERSECTION, EXCEPT
These provide additional set operations that, along with SELECT, PROJECT, and JOIN, allow virtually any sort of manipulation of the data.
In all of this, you need to describe the relationships between the sets of data relations you are looking for, but you are not required to elaborate on how the search needs to take place, as is typically the case with with hierarchical and network models.
Look at the Rant ... with reference to what "relationally complete" means...
Normalization of a database table involves making sure that it satisfies a set of constraints that are intended to reduce data redundancy as well as the chances of the information becoming inconsistent.
Why is this sort of process necessary?
To prevent data inconsistencies
In a poorly structured database, it is possible to present inconsistent facts.
To minimize the amount of data that must be entered, validated, and stored.
If a person's name is being stored in a whole bunch of places, there has to be some sort of process to put it there and to maintain it.
If a fact, such as a person's name, is stored just once, then it only needs to enter the system once, and if it needs to be corrected, there is only one place where that change needs to be made.
It is common for people to propose denormalizing tables to "improve performance". They imagine that reassembling the data, via JOIN queries, will involve additional disk I/O, so that denormalized data, which avoids this " reassembly", will be faster.
This tends to be a fallacy. Normalization leads to the tables being smaller due to the elimination of redundant data. Today's increasingly efficient query optimizers cut the cost. Consider that the " denormalize" attitude sprung up when the popular DBMS systems were things like Oracle, which used a rule-based optimizers through '80s and early '90s.
And this all misses the real value of normalization, and the cost of not doing so, which is all about data integrity. In a normalized database schema, integrity constraints such as foreign keys can readily be defined by users to be enforced by the DBMS. In a denormalized schema, that capability is lost, and integrity must be managed within applications. Making direct changes to the database becomes hazardous, and the risks of having inconsistent data increase, massively.
There are five usual forms; each gets a little more obscure than the previous one.
This eliminates any repeating groups by putting them into a separate table. If employees can have multiple children, you don't have fields CHILD1, CHILD2, and such; you instead have a table with fields EMPLOYEE ID / CHILD.
tutorial=# \d not_1nf Table "public.not_1nf" Column | Type | Modifiers --------+---------+----------- id | integer | name | text | child1 | text | age1 | integer | child2 | text | age2 | integer | child3 | text | age3 | integer | tutorial=# select * from not_1nf; id | name | child1 | age1 | child2 | age2 | child3 | age3 ----+--------+----------+------+--------+------+--------+------ 1 | Chris | | | | | | 2 | David | Jonathan | 1 | | | | 3 | Harold | Chris | 25 | David | 21 | Brad | 19 (3 rows)
What do you do if someone has more children than there are columns available? After all, " Harold" had four sons, not three. What do you do if you need to do queries relating to the children themselves? Programs working with " child" information will be considerably complicated by the need to extract it out of many fields. Another terrible aspect of the design of the table is the use of childrens' " ages"; that varies, from year to year, whereas their birth date does not.
Here is a vastly better schema:
tutorial=# \d employees Table "public.employees" Column | Type | Modifiers --------+---------+----------- id | integer | name | text | tutorial=# \d child Table "public.child" Column | Type | Modifiers -------------+--------------------------+----------- employee_id | integer | name | text | birthdate | timestamp with time zone | tutorial=# select * from employees; id | name ----+-------- 1 | Chris 2 | David 3 | Harold (3 rows) tutorial=# select * from child; employee_id | name | birthdate -------------+----------+------------------------------- 2 | Jonathan | 2002-06-13 00:00:00-04 3 | Chris | 1978-10-01 21:47:59.380895-04 3 | David | 1982-10-01 21:47:05.023232-04 3 | Brad | 1984-10-01 21:47:12.579425-04 (4 rows)
Here, there is no redundancy; an employee with no children has no entries in the " Children" table. An employee with many children may have many entries. The information about the children is also more useful, as it presents a birthdate, from which you may compute an age whenever needed.
The same approach might be used to represent a spouse relationship, albeit with the limitation that no more than one spouse is permitted by law at any given time!
In later examples, you will see that the later normalized forms can be exactly mapped onto the denormalized forms via doing queries. In the case of 1NF, it is actually downright inconvenient to present the multiple columns.
A reasonable " relational" way to express the data is via the query SELECT e.id, e.name, c.name AS child, (now() - c.birthdate) AS age FROM child c, employees e WHERE (e.id = c.employee_id);
tutorial=# select * from emp_children; id | name | child | age ----+--------+----------+--------------------------- 2 | David | Jonathan | 475 days 23:09:46.294683 3 | Harold | Chris | 9131 days 01:22:46.913788 3 | Harold | David | 7670 days 01:22:41.271451 3 | Harold | Brad | 6939 days 01:22:33.715258 (4 rows)
This eliminates fields with functional dependencies on a partial key by moving them to a separate table.
Consider the following " business rules":
Staff are assigned to activities within projects;
Managers and their departments are assigned to projects.
tutorial=# select * from assignments; project | activity | manager | department | staff_level ---------+----------+---------+------------+------------- ADM101 | MGR | 10 | ADMIN | 1.50 ADM101 | SEC | 10 | ADMIN | 1.25 ADM101 | AUD | 10 | ADMIN | 0.75 ADM101 | TEMP | 10 | ADMIN | 17.00 HR171 | MGR | 10 | ADMIN | 2.00 HR171 | SEC | 10 | ADMIN | 3.20 HR171 | AUD | 10 | ADMIN | 9.00 HR171 | INT | 10 | ADMIN | 35.00 OP101 | MGF | 11 | TECHOP | 2.00 OP101 | SEC | 11 | TECHOP | 3.00 OP101 | PGR | 11 | TECHOP | 6.00 OP101 | OPS | 11 | TECHOP | 7.00 (12 rows)
This obviously has no "repeated groups", but different parts of the table are dependent on different keys. Staff Usage is dependent on Project and Activity, whereas the fields Manager and Department are only dependent on the Project.
As a result, it makes sense to split this into two tables, one indicating Staff Usage, and the other indicating the assignment of employees to projects.
tutorial=# select * from project_staffing; project | activity | staff_level ---------+----------+------------- ADM101 | MGR | 1.50 ADM101 | SEC | 1.25 ADM101 | AUD | 0.75 ADM101 | TEMP | 17.00 HR171 | MGR | 2.00 HR171 | SEC | 3.20 HR171 | AUD | 9.00 HR171 | INT | 35.00 OP101 | MGF | 2.00 OP101 | SEC | 3.00 OP101 | PGR | 6.00 OP101 | OPS | 7.00 (12 rows) tutorial=# select * from manager_assignments; project | manager | department ---------+---------+------------ ADM101 | 10 | ADMIN HR171 | 10 | ADMIN OP101 | 11 | TECHOP (3 rows)
Looking at table project_staffing
, while there are
repetitions of both projects and activities, they do not depend on one
another, and staffing levels are evidently dependent on both values,
so there is no reason to try to normalize this table further.
On the other hand, there is some redundancy in the handling of
manager_assignments
. Managers are tied to projects, and
managers are tied to departments, and there is something separable
about this.
But summing up, given the two "child" tables that we have defined, it is easy to calculate the poorly normalized " table" we started with, with the following query. In fact, this example was actually worked together backwards based on this VIEW definition:
tutorial=# \d assignments View "public.assignments" Column | Type | Modifiers -------------+----------------------+----------- project | character varying(6) | activity | character varying(6) | manager | integer | department | character varying(6) | staff_level | numeric(6,2) | View definition: SELECT p.project, p.activity, m.manager, m.department, p.staff_level FROM manager_assignments m, project_staffing p WHERE (m.project = p.project);
This points to a vital point about normalization, namely that when you are defining what data is to be stored, you shouldn't do this simply based on what reports you plan to generate. Reports are surely useful, but the point of the database schema is to allow you to compute the contents of reports, not to express them.
This form eliminates dependencies on non-key fields by putting those fields into a separate table.
At this point, there is a one-to-one relationship between all the non-key fields and the entire key.
In the previous section, we had the manager_assignments
table where there were separate dependancies:
Department was dependent on manager;
Project was dependent on manager.
We can separate this into two tables, joined together as follows:
tutorial=# select * from project_managers; project | manager ---------+--------- ADM101 | 10 HR171 | 10 OP101 | 11 (3 rows) tutorial=# select * from manager_departments; manager | department ---------+------------ 10 | ADMIN 11 | TECHOP (2 rows) tutorial=# \d manager_assignments View "public.manager_assignments" Column | Type | Modifiers ------------+----------------------+----------- project | character varying(6) | manager | integer | department | character varying(6) | View definition: SELECT a.project, a.manager, b.department FROM project_managers a, manager_departments b WHERE (a.manager = b.manager); tutorial=# select * from manager_assignments; project | manager | department ---------+---------+------------ ADM101 | 10 | ADMIN HR171 | 10 | ADMIN OP101 | 11 | TECHOP (3 rows)
Fourth normal form separates facts stored in one table into separate tables when there are values in one column that depend on values in another column.
Fifth normal form breaks out more abstruse data redundancies not covered by the previous forms, usually involving multi-table joins.