1 – Normalization
Database normalization is the process of organizing data. There are a bunch of rules governing how you should do it, when you should undo it, and how you can’t do it. My hope is to lay out what normalization means in Texas English, which means clear and simple. I rely on you, the reader, to let me know when its not clear and how it can be more clear by treating this as a living document.
There are now, as of last year, seven or eight normal forms. They began when E.F. Codd first proposed the relational model in his A Relational Model for Large Shared Data Banks paper. There’s also a concept of Domain-Key Normal Form (DKNF). According to some, this belongs between fifth and sixth normal forms. DKNF comes to use by way of Ronald Fagin, in his A Normal Form for Relational Databases That is Based on Domains and Keys paper.
This blog page will try to cover and blend these two ideas to help gain perspective on how you may model your databases. Ultimately, if you’re very interested in the topic you should consider spending some money to buy Logic and Databases: The Roots of Relational Theory by C.J. Date.
Normalization Definitions
Summary
Database normalization attempts to organize data in such a way as to prevent SQL statements from creating insertion, update, or deletion anomalies. As a practice third normal form (3NF) is often considered normalized because most 3NF tables are free of insertion, update or delete anomalies. The key word is most, not all.
Therefore, normalization design attempts to achieve the Highest Normal Form (HNF) possible. A table is in HNF whether it meets or fails to meet any normal form definition. A Zero Normal Form ) (0NF) exists and it is equivalent to an Unnormalized Normal Form (UNF), basically a table with repeating rows. UNF is synonymous with 0NF and means that a table contains one or more repeating groups. It is probably important to note that a table may be in UNF and HNF at the same time. Likewise, UNF and HNF are equivalent to 0NF.
Normalization is the process of organizing data into tables that act as single subjects when acted upon individually or through external relationships. A single subject is also known as a domain. You act on data by querying it or transacting against it. You query data by writing a SELECT
statement. You typically transact against data by writing an INSERT
, UPDATE
, or DELETE
statement.
Terms and Definitions
Terms and definitions are very important in any topic. The normalization process has its share of terms. Here are a few that you should understand before reading this document.
Dictionary ↓
Word | Definition |
Anomaly | A deviation from the common rule, type, arrangement, or form; or an incongruity or inconsistency. |
Attribute | Describes a column in a table. It comes from relational algebra, where a column is an attribute and a row is a tuple. |
Binary relationship | A reciprocal set of relations between two things, in databases the two things are tables, views (semi-permanent result sets), or temporary result sets.. |
Candidate key | A unique key that you may choose as a primary key. |
Column | Describes a vertical element in a table. It comes from spreadsheets, where a column defines the vertical axis of data. A column is a single element of a data structure that is found in every row. Columns always have a value in a data structure when the structure constrains its creation to demand one. Database let you allow or disallow a null value when you create a table (or structure). |
Composite key | A key that is made up of two or more columns. It is possible that this term can be applied to many different keys, and that it is interchangeable with a compound key. You will see compound key used more frequently. |
Compound key | A key that is made up of two or more columns. It is possible that this term can be applied to many different keys, and that it is interchangeable with a composite key. This is typically the more widely used word. |
Data structure | Describes the definition of a type of data, like an integer or string and the collection of a group of various pre-defined data types into a group. The latter is the best corollary to a record in a file system, or a row in a database. You can make a numerically indexed array of any base data type (often described as scalar or primitive), or compound data type, which effectively creates a 2-dimensional structure known as a database table. |
Field | Describes a column in a table. It comes from file systems, which predate databases. A field is either a positionally fixed or delimited element in a list of values. Fields are always found but may be null or empty values. |
File system | Describes the use of files as a data repository, where each file contains rows of data organized as data structures. Procedural programming languages access the files based the programmer’s knowledge of their definitions, which is normally maintained in a definitional file or document. |
Foreign key | A key that maps to a value in a primary key list, where the list exists in the same or another table. |
Functional dependency | Means an attribute or column depends on exactly one other unique attribute or set of attributes. The unique attribute is a single column natural key chosen as a primary key, while the unique set of attributes is a compound natural key, likewise chosen as a primary key. You write the functional dependency: A → B Columns that have a mandatory reliance on another column or set of columns are typically a non-key column or set of columns on a primary key found in the current row. A foreign key column is also functionally dependent on a primary key in the current table or other table. |
Key | A column that contains a value that identifies or helps in conjunction with other key columns to identify a row as unique. |
Many-to-many | A non-specific relationship between two tables, where one row in one table may map to one to many rows in the other and vice versa. You map these two tables by using a third table that holds a foreign key from both in the same row. The third table is known as an association or translation table. Both of the original tables have a one to many relationship to the association table, and both relationships resolve through the association table. |
N-ary relationship | A non-specific relationship between three or more tables, where one row in one table may have a many-to-many-to-many relationship between one or both of the other tables. You map these three or more tables by using another table that holds a foreign key from all of them in the same row. The other table is known as an association or translation table. Typically, all of the original tables have a one-to-many relationship to the association table, and all relationships resolve through the association table. |
Natural key | A unique key that identifies a row of data, or instance of data. A natural key is automatically a candidate key that you may choose as a primary key. All other columns in the table should enjoy a direct and full functional dependency on the natural key. If you adopt a surrogate key for joins, the surrogate key plus the natural key should become a unique index to speed searches through the table. |
Nominated key | A unique key that you may choose as a primary key, and it is also known as a candidate key. The only subtle difference that I’ve found is that some people use nominated to indicate the candidate key they’ve tentatively chosen before making a final decision. |
Non-key | A column that contains a descriptive value that doesn’t identify or help identify a row as unique but provides a characteristic to a row of data. All non-key columns should have a full functional dependency on the natural key, or primary key. |
Non-specific relationship |
A logical reciprocal set of relations between two things where no row in either set has a possible intersection with the other. A non-specific relationship is also known as a binary many-to-many relationship. These are logical relationships that convert to two physical relationships known as specific relationships. Specific relationships are either one-to-one or one-to-many binary relationships. Non-specific relationships are resolved by two one-to-many relationships and an association set. The association set holds rows of foreign keys that point respectively to both sets. Each row in the association table lets you resolve the relationship between a row in one and a row in the other through an INNER JOIN . |
Object instance |
An object instance is a data set inserted into a defined object type. This can occur at runtime, or in the context of databases through an INSERT statement. An Oracle database may contain nested object instances when a column relies on an object type, which are known as standalone objects.
|
Object type | An object type in the context of a database is a data structure, or the definition of a table. Definitions of tables are stored in the database catalog and built upon pre-existing data types. Some databases support User-Defined Types (UDT). Where UDTs are available the data structure may use them when they’re defined before the object type. Object types are a generalization of tables user-defined types in an Oracle database. |
One-to-many | A specific relationship between two table, where one row in one table maps to one to many rows in another table. You map these two tables by using the primary key of one table as a foreign key in the other. This makes the table that holds the foreign key functionally dependent on the primary key in the other table. The one side of the relationship is always the independent row, and it always donates a copy of its primary key to the dependent row. |
One-to-one | A specific relationship between two table, where one row in one table maps to one and only one row in another table. You map these two tables by using the primary key of one table as a foreign key in the other. This makes the table that holds the foreign key functionally dependent on the primary key in the other table. While a one-to-one relationship allows you to choose either as the independent row, it is important that you identify the business relationship of the two tables and make the primary task element the independent row. The independent row donates a copy of its primary key to the dependent row. |
Partial dependency | A partial dependency exists when the primary key is a compound key of two or more columns, and one or more columns depends on less than all of the columns in the compound primary key. |
Primary key | A candidate key that you chose to serve as the primary key. |
Record | Describes a horizontal element in a table. It comes from file systems, which predate databases. A record is a row of data, or an instance of a defined data structure. As such, it is row inside a file. |
Row | Describes a horizontal element in a table. It comes from spreadsheets, where a row defines the horizontal axis of data. A row is also an instance of the data structure defined by a table definition, and the nested array of a structure inside an ordinary array. |
Specific relationship | A reciprocal set of relations between two things where one row in a result set finds one row in another result set. Another example is where one row in a result set finds many matches in another result set. These binary relationships are respectively one-to-one and one-to-many. Specified relationships have equijoin or non-equijoin resolution. The first matches values, like the process in a nested loop, and the second matches values through a range or inequality relationship. Equijoins typically have a primary and foreign key, and the one-side holds the primary key while the many side holds a foreign key. In the specialized case of a one-to-one relationship, you must choose which table holds the primary key that becomes a functional dependency as a foreign key in the other. |
Superkey |
A key that identifies a set of rows, like a gender column that lets you identify male or females in your data model. |
Surrogate key | A key that identifies uniqueness for rows, like an automatic numbering sequence. It is superior solution to a natural key because you create indexes by using the surrogate key column followed by the primary key column(s). If you discover more about the domain later and need to add a column to the natural key, you need only drop the index and recreate with the new list of columns. |
Transitive dependency | A column that depends on another column before relying on the primary key of the table. It may exist in tables with three or more columns that are in second normal form. |
Tuple | Describes a row in a table. It comes from relational algebra, where a column is an attribute and a row is a tuple. |
Unique key | A column or set of columns that uniquely identifies a row of data. |
User-defined type | A data type defined by the user in a schema (Oracle) or database (MySQL and Microsoft SQL Server). |
Normalization Anomalies
Anomalies can occur at various levels of normalization but the majority occur when you have tables that are less than third normal form. The frequent types of anomalies are illustrated by referencing a second normal form table. You can find formal problem and solution resolution of this second normal form table later in the blog page.
Insertion Anomalies ↓
There are two major types of insertion anomalies. They both occur when a table contains more than one subject. The following example table and data support the discussion of insertion anomalies in the context of a first normal form table.
The table includes two subjects. One is a rental and the other an item. The item information is highlighted in cyan, and item columns have a partial dependence on the compound primary key.
- If you want to insert a new item, like a just released DVD, you must create a row that leaves all non-item related columns empty. However, the database level
NOT NULL
constraints prevent such an insertion. Disabling the database constraints would compromise the integrity of the model to support row with rental and item data. - If you want to insert a new rental but don’t have the item available, you must first create the item information to insert a row. Like the first anomaly, disabling the constraints makes the insert possible but it disables the model’s integrity.
Update Anomalies ↓
The potential update anomalies may occur when you make any change to an item related column because more than one rental row may hold a copy of the item data. You must therefore update all rows that contain the same item data. This means there is a likelihood of modifying rows incorrectly, and creating inconsistent copies of item data between rows in the sample table.
If you decompose the table into two tables (rental
and item
), the lossless-join property ensures that any row of the rental
table can identify its relationship to the smaller item
table. Also, the dependency preservation property ensures that you can maintain a constraint on the original relationship. This means that you can put a foreign key in the rental
table that is fully dependent on the item
table’s primary key.
Deletion Anomalies ↓
A deletion anomaly could occur with this table when you archive rentals that occurred more than six months ago and certain items have not been rental since that time. The deletion of the rental information would remove these items from your inventory. You would decompose the table (also known as a relation) into two tables as describe in the update anomalies section above. It would fix this deletion anomaly too.
Normalization Concepts
Normalization concepts show the conceptual or generalized idea and practice for normalization. The illustrations show you how to make columns atomic and to avoid repeating rows groups for first normal form. Then, they illustrate how to conceptually eliminate multiple subjects dependencies from tables beyond first normal form.
1st Normal Form (1NF) ↓
First Normal Form (1NF) exists when all columns, or attributes, of a table have a single data type and there are no repeating rows of data, which means rows are unique. This requires that:
- Column data types should be atomic, which means columns shouldn’t have repeating groups, comma delimited groups, or other subatomic parts. This raises a question whether compund variables like Oracle’s arrays, lists, and objects violate first normal form.
- Column names are unique in tables and arbitrarily ordered, which means their order doesn’t impose any constraint on the table.
- Rows in a table have no implicit or explicit ordering required for their access and use.
- Columns shouldn’t contain a repeating group in the context of a candidate key. This means that when you repeat all columns but one or a subset of columns, your model fails to meet the definition of first normal form.
Moreover, first normal form modeling finds the non-repeating rows and moves them from a base table to their own table. It removes all trace of the non-repeating rows from the base table. It also puts a foreign key in the new table that points back to the primary key of the base table. The foreign key in the new table is functionally dependent on the primary key in the base table.
What are atomic data types? ↓
The question of what constitutes an atomic value in first normal form is interesting because Oracle’s database engine runs as an Object Relational Database Management System (ORDBMS); also known as Extended Relational Database Management System. Are atomic column values limited to scalar variables? Scalar variables only hold one thing at a time, like an Integer.
If so, this would exclude Oracles’ specialized types. An alternative may ask that question differently, like “Are atomic column values scalar or compound variables?” Compound variables could be a comma separated list in a string data type, or either a SQL lists or Oracle’s User Defined Type (UDT).
The idea of comma separate lists can be dismissed as violating first normal form but the others aren’t so easy to toss out. While it might be heresy, I’d accept Oracle SQL lists or Oracle’s UDT (stored as object constructors with actual parameters) for two reasons. They are atomic in the scope of the database engine and most effectively transferred to active status inside a run-time view, which is then joined or acted upon by ordinary SQL commands. On the contrary point, lists are invariably ordered and that means they can’t by themselves be in first normal form. They become navigable ordered structures that you place into an aggregate table with a TABLE
function call. Does the structured ordering of inside an atomic type disqualify it from being in first normal form? I’d argue that the nature of the thing stored as no effect on how it is accessed by SQL, and therefore it meets first normal form regardless of how you navigate it once you query it.
Examples & Solutions: ↓
A table is in first normal form when all columns, or attributes, have a single data type and there are no repeating rows or repeating groups. The first two elements of this idea are generally understood and applied almost intuitively. The idea of a single data type is clear because most people understand any column can only have one data type, at any given time. The idea of uniqueness makes sense because you generally only want to act on one copy of anything at a time.
The third point is the one that is sometimes misunderstood. The idea of not being able to have repeating groups can confuse people because they often look at tables as a group of unique rows. Repeating groups indicate that you’ve got a multiple subject table design. The repeating rows indicate the base table is on the one side of a one-to-many relationship. The columns with non-repeating rows typically become a new table because they are on the many side of a one-to-many binary relationship.
Problem Description ↓
The example in this section uses a sample address
table to demonstrate how you move a table from UNF, or a HNF of nothing, to 1NF. This is the data definition of a version of the address
table. It has a design that lets you violate first normal form whenever the street_address
column requires more than a single entry or more than one row.
This type of table design lets you violate the atomic rule of first normal form because the street_address
column may need to accommodate one or more street address values. Typically, many developers see no harm in putting multiple address lines into a column as comma separated values. The data would look like the following:
Another type of table design lets you violate two rules of first normal form. It modifies the previous example by adding a second column to the primary key. This makes the primary key a compound or composite key. The following provides an example of such a table.
The second column allows for multiple rows to have an internal relationship. The second column in the primary key lets you order rows. The ordering of rows violates the first normal form rule that there will be no implicit or explicit ordering of rows. This type of design lets all columns except the ordering column of the compound primary key and street_address
column repeat. It violates the no repeating group rule of first normal form. The data set is also a mirror to what you see when joining two tables that have a one-to-many relationship. The columns on the one-side repeat while the columns on the many-side are unique.
In the preceding illustration the non-repeating columns and rows are highlighted. They illustrate that together they’re really a separate subject that should be removed to another table. The solution section covers approaches that resolve these design errors.
Problem Solution ↓
Solving first normal form problems can be a challenge. Sometimes it is more difficult than second or third normal form because at first normal form there are options. The conceptual solutions are: (1) remove the columns and put them in another table; (2) flatten the table, which is a form of denomalization; and (3) in those databases where they are supported implement a nested table or list.
Relational Solution ↓
Removing columns to another table is the best solution and practice because it offers you the most future flexibility. This is the frequently recommended alternative for online transaction processing (OLTP) systems.
If you query the data with an INNER JOIN
, the data will look very similar to the original problem data. This query demonstrates the problem.
SELECT a.contact_id , a.address_id , sa.address_line_id , sa.street_address , a.city , a.state_province , a.postal_code FROM address a INNER JOIN street_address sa ON a.address_id = sa.address_id; |
It generates the following as a result from the join. You should note that the one-side of the one-to-many binary relationship is repeating while the many side is unique. The process of repeating the result in the query return set is somewhat like but different than a Cartesian Product. It displays the projected columns of the join in each row, which means the one-side repeats in the return set. You need to perform a nested join to keep only a unique set of the one side in a procedural programming language. As a side note, it’s unfortunate that SQL doesn’t support a FLOWR statement like XQuery because it would eliminate the need of a procedure programming language.
Flattening (Data Warehousing) Solution ↓
Flattening a non-repeating column into a series of columns requires creating a column for every unique row that may occur. The following is a definition of such a table:
The data in a flattened model appear like this:
Object Relational Solution ↓
Implement a non-repeating column as a list. You first create a scalar collection type, and then implement the collection in an address table. You create a SQL scalar collection type of variable length strings with the following syntax:
CREATE TYPE street_list IS TABLE OF VARCHAR2(30); / |
After you create the type, you can implement it as a column data type. This particular SQL data type is not an array but a list. A list requires a specialized clause for storage of the nested table, as shown in the following create table statement.
CREATE TABLE address_book ( address_book_id NUMBER , street_address STREET_LIST , city VARCHAR2(30) , state VARCHAR2(2) , postal_code VARCHAR2(10)) NESTED TABLE street_address STORE AS street_table; |
The table description does not qualify the specialized storage instructions. You have to inspect the data catalog information to see how the database manages these.
You can insert rows of data but you must use constructors to insert into a nested table structure. The following enters two rows of data:
INSERT INTO address VALUES ( 1, street_list('203 North La Salle','Suite 1400'), 'Chicago', 'IL', '60601'); INSERT INTO address VALUES ( 2, street_list('203 North La Salle','Suite 1800'), 'Chicago', 'IL', '60601'); |
A query of the data in the table yields:
You can find more about joining and querying nested tables in this prior blog post.
Setup Scripts ↓
The scripts for first normal form are more numerous than others. You’ll find them preceded by brief descriptions, and should consider working through them sequentially.
UNF address
table with a non-atomic street_address
column ↓
-- Conditionally drop the table. BEGIN FOR i IN (SELECT TABLE_NAME FROM user_tables WHERE TABLE_NAME IN ('ADDRESS')) LOOP EXECUTE IMMEDIATE 'DROP TABLE '||i.table_name||' CASCADE CONSTRAINTS'; END LOOP; END; / -- Create the address table. CREATE TABLE address ( address_id NUMBER CONSTRAINT pk_ad_1 PRIMARY KEY , contact_id NUMBER CONSTRAINT nn_ad_1 NOT NULL , address_type NUMBER CONSTRAINT nn_ad_2 NOT NULL , street_address VARCHAR2(60) CONSTRAINT nn_ad_3 NOT NULL , city VARCHAR2(30) CONSTRAINT nn_ad_4 NOT NULL , state_province VARCHAR2(30) CONSTRAINT nn_ad_5 NOT NULL , postal_code VARCHAR2(20) CONSTRAINT nn_ad_6 NOT NULL); -- Insert a row with a comma delimited street address. INSERT INTO address VALUES ( 2001, 2001, 1008,'1111 Broadway, Suite 500, MS-5045','Oakland','CA','94604'); -- Query the data. SELECT street_address , city , state_province , postal_code FROM address; |
UNF address
table with atomic columns but a repeating group ↓
-- Conditionally drop the table. BEGIN FOR i IN (SELECT TABLE_NAME FROM user_tables WHERE TABLE_NAME IN ('ADDRESS')) LOOP EXECUTE IMMEDIATE 'DROP TABLE '||i.table_name||' CASCADE CONSTRAINTS'; END LOOP; END; / -- Create the address table. CREATE TABLE address ( address_id NUMBER , address_line_id NUMBER , contact_id NUMBER CONSTRAINT nn_ad_1 NOT NULL , address_type NUMBER CONSTRAINT nn_ad_2 NOT NULL , street_address VARCHAR2(60) CONSTRAINT nn_ad_3 NOT NULL , city VARCHAR2(30) CONSTRAINT nn_ad_4 NOT NULL , state_province VARCHAR2(30) CONSTRAINT nn_ad_5 NOT NULL , postal_code VARCHAR2(20) CONSTRAINT nn_ad_6 NOT NULL , CONSTRAINT pk_ad_1 PRIMARY KEY (address_id,address_line_id)); -- Insert data into table. INSERT INTO address VALUES ( 2001, 1, 2001, 1008,'1111 Broadway','Oakland','CA','94604'); INSERT INTO address VALUES ( 2001, 2, 2001, 1008,'Suite 500','Oakland','CA','94604'); INSERT INTO address VALUES ( 2001, 3, 2001, 1008,'MS-5045','Oakland','CA','94604'); -- Query the table. SELECT contact_id , address_id , address_line_id , street_address , city , state_province , postal_code FROM address; |
How to resolve partial dependencies to achieve 2NF ↓
-- Conditionally drop the table. BEGIN FOR i IN (SELECT TABLE_NAME FROM user_tables WHERE TABLE_NAME IN ('ADDRESS','STREET_ADDRESS')) LOOP EXECUTE IMMEDIATE 'DROP TABLE '||i.table_name||' CASCADE CONSTRAINTS'; END LOOP; END; / -- Create address and street_address tables. CREATE TABLE address ( address_id NUMBER CONSTRAINT pk_ad1 PRIMARY KEY , contact_id NUMBER CONSTRAINT nn_ad1 NOT NULL , address_type NUMBER CONSTRAINT nn_ad2 NOT NULL , city VARCHAR2(30) CONSTRAINT nn_ad4 NOT NULL , state_province VARCHAR2(30) CONSTRAINT nn_ad5 NOT NULL , postal_code VARCHAR2(20) CONSTRAINT nn_ad6 NOT NULL); CREATE TABLE street_address ( street_address_id NUMBER CONSTRAINT pk_sa1 PRIMARY KEY , address_id NUMBER CONSTRAINT nn_sa1 NOT NULL , address_line_id NUMBER CONSTRAINT nn_sa2 NOT NULL , street_address VARCHAR2(60) CONSTRAINT nn_sa3 NOT NULL , CONSTRAINT fk_sa1 FOREIGN KEY(address_id) REFERENCES address(address_id)); -- Insert data. INSERT INTO address VALUES ( 2001, 1, 1008,'Oakland','CA','94604'); INSERT INTO street_address VALUES ( 3001, 2001, 1,'1111 Broadway'); INSERT INTO street_address VALUES ( 3002, 2001, 2,'Suite 500'); INSERT INTO street_address VALUES ( 3003, 2001, 3,'MS-5045'); -- Query the data set to demonstrate repeating and non-repeating rows. SELECT a.contact_id , a.address_id , sa.address_line_id , sa.street_address , a.city , a.state_province , a.postal_code FROM address a INNER JOIN street_address sa ON a.address_id = sa.address_id; |
How to eliminate repeating column groups ↓
-- Conditionally drop the table. BEGIN FOR i IN (SELECT TABLE_NAME FROM user_tables WHERE TABLE_NAME IN ('ADDRESS')) LOOP EXECUTE IMMEDIATE 'DROP TABLE '||i.table_name||' CASCADE CONSTRAINTS'; END LOOP; END; / -- Create the address table. CREATE TABLE address ( address_id NUMBER CONSTRAINT pk_ad_1 PRIMARY KEY , contact_id NUMBER CONSTRAINT nn_ad_1 NOT NULL , address_type NUMBER CONSTRAINT nn_ad_2 NOT NULL , street_address1 VARCHAR2(20) CONSTRAINT nn_ad_3 NOT NULL , street_address2 VARCHAR2(20) CONSTRAINT nn_ad_4 NOT NULL , street_address3 VARCHAR2(20) CONSTRAINT nn_ad_5 NOT NULL , city VARCHAR2(30) CONSTRAINT nn_ad_6 NOT NULL , state_province VARCHAR2(30) CONSTRAINT nn_ad_7 NOT NULL , postal_code VARCHAR2(20) CONSTRAINT nn_ad_8 NOT NULL); -- Insert a row with atomic column values. INSERT INTO address VALUES ( 2001, 2001, 1008,'1111 Broadway','Suite 500','MS-5045','Oakland','CA','94604'); -- Query the data. SELECT street_address1 , street_address2 , street_address3 , city , state_province , postal_code FROM address; |
2nd Normal Form (2NF) ↓
Second Normal Form (2NF) exists when a table is already in first normal form and all non-key columns depend on all of the key columns, where the list of key columns goes from 1 to n. The list of key columns makes up the natural key of a table, or the list of columns that makes any row unique in a table. A table that has only a single column as the natural key, like a Vehicle Identification number, is automatically in second normal form.
A table that has two or more columns as a natural key may contain one or more non-key columns that has a partial dependency on one or a set of columns less than all the columns in the key (or primary key). This typically means you have created a table that contains two subjects. The following illustrations shows the architecture of the 1NF table before converting it to 2NF.
You divide the table into two tables to fix this problem. Second normal form modeling finds and moves columns from a base table when they have a partial dependence on part of a primary key. They are generally made into an independent table. A foreign key replaces the column or columns in the base table, and it points to or has a functional dependency on the primary key in the new table.
The foreign key in the base table becomes part of the primary key. The remaining non-key columns depend wholly on the natural key of the original or base table. The following drawing represents the simplified outcome of creating second normal form table.
Examples & Solutions: ↓
A table that has one or more columns with a partial dependency on one or more columns of a compound primary key isn’t in second normal form.
Problem Example ↓
You’ll notice that the following example table has a compound (or composite) primary key made up of the rental_id
and item_id
that is highlighted in yellow. There is also a group of columns highlighted in cyan that have a partial dependency on only the item_id
column of the primary key.
This partial dependency violates the rule of second normal form that all of the non-key columns are functionally dependent on the whole key. The whole key is the primary key. This violation indicates that the rental_item
table contains two subjects.
Problem Solution ↓
You should break the table into two tables where each focuses on a single subject. The following shows two new tables.
The item_id
column in the rental_item
table becomes a foreign key in the base table and a primary key in the new table. The new item
table contains all the non-key columns that had a partial dependency on the item_id
. You can see that the column or set of columns that have a partial dependency become their own table. This repeats how you resolved a repeating group to ensure a table met the first normal form rules. Both of these new tables are in at least second normal form.
Setup Scripts ↓
You run this script to create a table with a partial dependency that violates second normal form.
-- Conditionally drop the table. BEGIN FOR i IN (SELECT TABLE_NAME FROM user_tables WHERE TABLE_NAME = 'RENTAL_ITEM') LOOP EXECUTE IMMEDIATE 'DROP TABLE '||i.table_name||' CASCADE CONSTRAINTS'; END LOOP; END; / -- Create the 1st normal form table. CREATE TABLE rental_item ( rental_id NUMBER , item_id NUMBER , customer_id NUMBER CONSTRAINT nn_ri2 NOT NULL , check_out_date DATE CONSTRAINT nn_ri3 NOT NULL , return_date DATE CONSTRAINT nn_ri4 NOT NULL , item_barcode VARCHAR2(20) CONSTRAINT nn_ri5 NOT NULL , item_type NUMBER CONSTRAINT nn_ri6 NOT NULL , item_title VARCHAR2(60) CONSTRAINT nn_ri7 NOT NULL , item_subtitle VARCHAR2(60) , item_rating VARCHAR2(8) CONSTRAINT nn_ri8 NOT NULL , item_rating_agency VARCHAR2(4) CONSTRAINT nn_ri9 NOT NULL , item_release_date DATE CONSTRAINT nn_ri10 NOT NULL , CONSTRAINT pk_ri PRIMARY KEY (rental_id, item_id)); |
You run this script to create a tables that meet second normal form rules. The previous partial dependency is eliminated by taking the columns out of one table and creating a new table. This script also serves as your starting point in the 3rd normal form example.
-- Conditionally drop the table. BEGIN FOR i IN (SELECT TABLE_NAME FROM user_tables WHERE TABLE_NAME IN ('ITEM','RENTAL')) LOOP EXECUTE IMMEDIATE 'DROP TABLE '||i.table_name||' CASCADE CONSTRAINTS'; END LOOP; END; / -- Create the 2nd normal form tables. CREATE TABLE rental ( rental_id NUMBER CONSTRAINT pk_r1 PRIMARY KEY , item_id NUMBER CONSTRAINT nn_r1 NOT NULL , customer_id NUMBER CONSTRAINT nn_r2 NOT NULL , check_out_date DATE CONSTRAINT nn_r3 NOT NULL , return_date DATE CONSTRAINT nn_r4 NOT NULL); CREATE TABLE item ( item_id NUMBER CONSTRAINT pk_i1 PRIMARY KEY , item_barcode VARCHAR2(20) CONSTRAINT nn_i1 NOT NULL , item_type NUMBER CONSTRAINT nn_i2 NOT NULL , item_title VARCHAR2(60) CONSTRAINT nn_i3 NOT NULL , item_subtitle VARCHAR2(60) , item_rating VARCHAR2(8) CONSTRAINT nn_i4 NOT NULL , item_rating_agency VARCHAR2(4) CONSTRAINT nn_i5 NOT NULL , item_release_date DATE CONSTRAINT nn_i6 NOT NULL); -- Add foreign key constraints. ALTER TABLE rental ADD CONSTRAINT fk_r1 FOREIGN KEY(item_id) REFERENCES item(item_id); |
3rd Normal Form (3NF) ↓
Third Normal Form (3NF) exists when a table is already in second normal form and there are no transitive dependencies. A transitive dependency means a non-key column or set of columns are dependent on another column that generally isn’t part of the natural key.
Since all non-key columns should be wholly dependent on the primary key, a transitive dependency exists when a column’s functional dependency routes through another column or set of columns on its way to the primary key. This type of relationship indicates that the there are at least two subjects in a table, and appears as follows conceptually:
You should generally make the column or set of columns that acts as the intermediary column the natural primary key in a new table. The column or columns that had a transitive dependency on the intermediary column should also move to the new table. A foreign key to the new table should replace those columns in the original table, as shown in the following conceptual drawing:
Some exceptions qualify when a transitive dependency doesn’t disqualify a table from being in third normal form. The exceptions are:
- When the transitive dependency is on a column that is part of the natural key.
- When the transitive dependency is on a superkey, like a gender column.
- When the transitive dependency is trivial, which typically means the dependency doesn’t compel the addition of an extra row to a table.
Moreover, you break up the table into two parts to fix any transitive dependencies when moving from second to third normal form. This mirrors what you did to fix partial dependencies when moving from first to second normal form. The only difference between these two exercises is that a partial dependency relates to a part of the primary key, while a transitive dependency relates to a non-key column. The non-key column acts as a portal to the primary key of the second normal form table.
Therefore, third normal form modeling finds and moves columns from a base table when they have a transitive dependence on a column outside the primary key. These columns are generally made into an independent table. A foreign key replaces the columns removed from the base table. The foreign key depends wholly on the primary key of the new table.
Examples & Solutions ↓
A table that has one or more columns with a transitive dependency on one or more non-key columns isn’t in third normal form.
Problem Example ↓
You’ll notice that the following example table has a rating
that has a transitive dependency on the rating_agency
column. This isn’t a partial dependency because the rating_agency
column isn’t part of the primary key.
Problem Solution ↓
You fix this by putting the two columns in a new table, and making the column that is the transitive dependency a foreign key in the base table. Alternatively, you can add a surrogate key to the new table. If you add the surrogate key in the new table, remove both columns from the base table and add a foreign key column that points to the new surrogate key (primary key) column.
Instead of creating the new table with a natural key, the new table has a surrogate key. The old table therefore adds a foreign key of the same name that maps back to the surrogate key of the item_rating
table.
Setup Scripts ↓
While the second script in the prior section lets you create the item
table as your starting point, the next script lets you fix the transitive dependency in the previous item
table.
-- Conditionally drop the table. BEGIN FOR i IN (SELECT TABLE_NAME FROM user_tables WHERE TABLE_NAME IN ('ITEM','RENTAL')) LOOP EXECUTE IMMEDIATE 'DROP TABLE '||i.table_name||' CASCADE CONSTRAINTS'; END LOOP; END; / -- Create the 2nd normal form tables. CREATE TABLE rental ( rental_id NUMBER CONSTRAINT pk_r1 PRIMARY KEY , item_id NUMBER CONSTRAINT nn_r1 NOT NULL , customer_id NUMBER CONSTRAINT nn_r2 NOT NULL , check_out_date DATE CONSTRAINT nn_r3 NOT NULL , return_date DATE CONSTRAINT nn_r4 NOT NULL); CREATE TABLE item ( item_id NUMBER CONSTRAINT pk_i1 PRIMARY KEY , item_barcode VARCHAR2(20) CONSTRAINT nn_i1 NOT NULL , item_type NUMBER CONSTRAINT nn_i2 NOT NULL , item_title VARCHAR2(60) CONSTRAINT nn_i3 NOT NULL , item_subtitle VARCHAR2(60) , item_rating_id NUMBER CONSTRAINT nn_14 NOT NULL , item_rating VARCHAR2(8) CONSTRAINT nn_i5 NOT NULL , item_rating_agency VARCHAR2(4) CONSTRAINT nn_i6 NOT NULL , item_release_date DATE CONSTRAINT nn_i7 NOT NULL); CREATE TABLE item_rating ( item_rating_id NUMBER CONSTRAINT pk_ir1 PRIMARY KEY , item_rating VARCHAR2(8) CONSTRAINT nn_ir1 NOT NULL , item_rating_agency VARCHAR2(4) CONSTRAINT nn_ir2 NOT NULL); -- Add foreign key constraints. ALTER TABLE rental ADD CONSTRAINT fk_r1 FOREIGN KEY(item_id) REFERENCES item(item_id); ALTER TABLE item ADD CONSTRAINT fk_r1 FOREIGN KEY(item_rating_id) REFERENCES item_rating(item_rating_id); |
I’m a bit confused on the definition of attribute and tuple. The two have the exact same definition, but in the definition they appear to be two different things. This is the line I am referring to, “A column is an attribute, and a row is a tuple.” Could you please help clear up my confusion on the matter? Thanks.
Jordan
21 Jun 10 at 9:34 pm
An attribute is the mathematical definition of a placeholder in a matrix (or a table in a database). A matrix can hold 1 to many attributes in a row. In mathematics the row is actually called a tuple. The post is trying to expose you to the different ways to label elements in a table. The phrase attribute is synonymous with column, and row has the same relation to tuple.
michaelmclaughlin
21 Jun 10 at 9:39 pm
I loved the definition table – it would have been helpful the first day of class. I do have one suggestion though- don’t use the term you are trying to define in a defination (i.e. – primary key).
Chris Staber
20 Sep 10 at 12:17 pm
I’ll add a table by itself to Lesson #1. Thanks!
michaelmclaughlin
3 Oct 10 at 2:46 pm
This page states that there is “no zero normal form”. Is that correct? It seems that we have spoken of
0NF
andUNF
being the same thing in class.Hyrum Denney
26 Nov 10 at 4:16 pm
Under “Insertion Anomalies” you have written a word twice in this sentence *”The table includes includes two subjects.” I also agree with Chris, the definitions are awesome and I would have liked to have them earlier.
Sam Brubaker
2 Feb 11 at 11:34 pm
Under the Normalization Definitions Summary section, it is stated that there is no zero normalized form when there actually is.
Tyler Hull
18 Mar 11 at 10:59 am
Perhaps to keep things simpler in your blog you could put some of the extraneous details as footnotes. For example, the E. F. Codd reference.
Eli
4 May 11 at 8:28 pm
I think that there needs to be a slight modification to the “item” table in the Setup Script for 3NF. The script shows that you’ve setup a new third table called “item_rating” with the attributes item_rating_id (primary key), item_rating, and item_rating_agency. The latter two attributes needed to be removed from the “item” table in order to achieve third normal form (previously they were transitive dependencies in the base table). You’ve then altered the ‘item’ table and set a foreign key on the item_rating_id, but the original transitive dependencies are still in the ‘item’ table after everything has been shifted around for 3NF.
Tighe
29 Jan 13 at 12:00 pm
Hyrum, Thanks for pointing it out. It’s fixed now, and yes there’s a 0NF, which is basically a table with repeating rows.
michaelmclaughlin
25 Oct 14 at 6:40 pm