NULLs Data Model and SQL Smells

Nulls - An SQL Smell?
Nulls – An SQL Smell?

NULLs are a perennial problem. Nobody likes them. They confuse developers and users and many analysts do not really understand them.

The concept of NULL allows us to say that there are things that we do not know.

In his article on SQL Smells, Phil Factor associates several smells with NULLs. In this post I’ll explain how to avoid using NULLs and how to use them properly when they are necessary.

Why do we need NULLs at all? What is the benefit and what is the cost?

NULL has a simple meaning with wide-ranging and surprising consequences. NULL means the value is unknown. And this in turn means that the result of any calculation or concatenation which uses this value must also be unknown!

NULL is a feature of SQL. The benefit of allowing NULL or “three valued logic” (TRUE/FALSE/UNKNOWN) is that it allows a database to record that there are things we do not know. The cost of having them is that any calculation or concatenation which uses this value must also be unknown! This confuses many people.

Reasons for needing NULL

There are many reasons why we might not have data to put into a column. Thinking about why we are considering defining a column as NULLable will encourage us to consider alternatives.

The structural NULL – Permanent sub-types

Customer with Sub-types which may cause NULLs
Customer with Sub-types which may cause NULLs

Sometimes we want to combine two entities into a single “super-type” table. There are attributes of Person will never be used for Business (and vice-versa). These missing values will need NULLs.

The structural NULL – Lifecycle subtypes

Order with Lifecycle sub-types which may cause NULLs
Order with Lifecycle sub-types which may cause NULLs

Something similar can happen if we combine all the steps of a entities lifecycle into a single table. The attributes of the later stages will always be empty (NULL) until that stage is reached. These later steps often contain dates or times.

In both cases involving sub-types it may be possible to splitting the sub-types into separate tables. Consider whether it is worth the effort and make sure you avoid the pitfalls of sub-types in SQL.

Data that will never be there

Attributes and Values for an "Address"
Attributes and Values for an “Address”

Entities like “Address” are frequently modelled with attributes like “AddressLine_”. In many cases there will never be values for the later lines. They will not be mandatory in the user interface, but do they need to be NULL? Consider whether allowing them to default to “spaces” or an empty string, would be better and whether it would have any bad effects.

Things which should never allow NULL

Always decide whether you expect an attribute to have a value. Don’t leave it to chance.

There are some things which should hardly ever allow NULL. This includes all keys and identifiers. Avoid allowing short titles or descriptions to be NULL (For long descriptions allowing NULLs is understandable).

Unavoidable NULLs

Attributes and Values for a "Person" Entity - Sometimes NULL is hard to avoid
Attributes and Values for a “Person” Entity – Sometimes NULL is hard to avoid

There are some attributes where allowing NULL is hard to avoid. Life insurance and pensions companies may need a “date of death” for their customers! Having a column with allows NULL is often the easiest way of handling this.

You should resist the temptation to use “magic dates” or inappropriate data-types in order to avoid allowing NULL. The consequences are far worse than the problem.

Summary

NULLs are a problem, nobody likes them but they are necessary. Many problems with NULLs can be avoided by two rules:

  • Remember that NULL means “unknown value” and this has consequences.
  • Ask “_why_ don’t we have this data?”

In many cases NULLs can be avoided by data modelling – that means the analyst has to do work in the Conceptual or Logical Model.

Where next?

The next article is about another smell: having the same name for different things!

Indexes, Analysts and SQL Smells

SQL Indexes - A Goldilocks problem for analysts or an SQL Smell
SQL Indexes – A Goldilocks problem for analysts or an SQL Smell

Database Indexes are something which a lot of analysts ignore as being “too technical”. This is a pity.

Several SQL Smells in Phil Factor’s article point at possible bad decisions. Thinking in the  Logical Model can improve these decisions.

Choosing the correct indexes is a typical “Goldilocks Problem”: not too few, not too many, just the right number! Bad or inadequate requirements will contribute to designers making bad decisions. Phil Factor describes having the wrong indexes as an SQL Smell!

What is an Index and what does it do for us?

If you are an analyst, you may not know exactly what an index is. In non-technical terms, an index provides a quick way for the database manager to find the rows it needs in a table. There are several sorts of index.

An index is a “thing” in its own right. An index takes up space in the database. Updating an index costs effort. The main benefit of an index is that it makes select or read operations faster.

Types: Unique, Non-Unique, Clustered

There are three main kinds of index: Unique, Non-Unique and Clustered.

Imagine that we have a very simply database consisting of 3 tables:

  • Customer (not shown in the diagram)
  • Order
  • OrderLine
Two tables: Order and OrderLine - Where to put the indexes?
Two tables: Order and OrderLine – Where to put the indexes?

Unique Indexes:

Candidate Unique Indexes on an "Order" SQL table
Candidate Unique Indexes on an “Order” SQL table

In the Order table we can see three columns which might be used to identify the order

If you are not familiar with GUIDs, they are a way of assigning identifiers or “keys”. They are worth finding out about. It would be unusual to expect a human being to type in a GUID. An “OrderNum” (which paradoxically might contain letters!) would be more convenient for the users.

We expect all three: OrderId, GUID and OrderNum, to be unique. Therefore, all three are candidates for Unique Indexes. If an application attempts to create a duplicate value in a column which has a unique index, then the database manager will raise an error and reject the transaction.

Non-Unique Indexes for Foreign Keys

Candidate Indexes on an "OrderLine" SQL table
Candidate Indexes on an “OrderLine” SQL table

In the OrderLine table you can see two columns which identify things in other tables: OrderId and ProductId. These are Foreign Keys. In this case we cannot say they are unique, but they are candidates for non-unique indexes.

We could also have used OrderNum or OrderGUID as Foreign Keys into Order.

It is good practice for the rows in the OrderLine table to have a unique identifier. There are two common ways of doing this.

  • We can assign an OrderLineId (which is unique across the whole table) or
  • we can use the combination of OrderId and OrderLineNum which together would identify a row.

In this example, both OrderLineId and the OrderId and OrderLineNum combination are (seperate) candidate unique indexes.

Clustered Indexes

The order of the rows in an SQL table is specified by the Clustered Index. Each table can have only one clustered index. The clustered index must be unique.

People often make the “primary key” the clustered index, but it is worth considering other options. In the example, OrderLines can be added to an order after it has been created.

Using the OrderId, OrderLineNum index as the clustered index would make the database store all the “lines” for one order together (whenever they were added to the order). That may be more efficient for retrieval. Phil Factor identifies two smells with the choice of clustered indexes.

Non-Unique Indexes for Searching

Candidate Indexes on a "Customer" SQL table - A Non-Unique index on Name would help searches
Candidate Indexes on a “Customer” SQL table – A Non-Unique index on Name would help searches

Columns which will be used for searching should be considered candidates for a non-unique index.

The role of Analysts in choosing Indexes and Index types

Indexes are usually specified in the “Physical Model”. The analyst can help the database designer make the right decisions, by applying a little thought. The analyst should not try to pre-empt the designers decisions. They should aim to assist by identifying relevant “candidates”.

"<yoastmark

Summary

Indexes enforce business rules like uniqueness in an SQL database. They influence database performance. Considering candidate indexes in the Logical Model and even the Conceptual Model will help database designers make better decisions.

Where next?

The next article is about the smell of nothing, or “Nulls”. Nulls present problems for developers and database designers.

Check Constraint to limit values – SQL Smell?

Do Check Constraints Cause an SQL Smell?
Do Check Constraints Cause an SQL Smell?

In his article on SQL Smells, Phil Factor is not in favour of using Check Constraint to limit the values in columns.

In my previous post I explained why Phil Factor recommends referential integrity. I am going to explain this apparent contradiction.

What are CHECK CONSTRAINTs and what do they do for us?

Using CHECK CONSTRAINT to Restrict Allowed Values
Using CHECK CONSTRAINT to Restrict Allowed Values

Like REFERENCES, the CHECK CONSTRAINT also says what are values can be stored in a column. The logical expression must be true for the value to be allowed.

The logical expression can be limits or even a list of permitted values (as in the illustration).

What is the problem with Check Constraints?

This sounds like a marvellous idea! The benefit is clear. Constraints will exclude invalid data from the database, even when it is loaded using a utility (eg BULK INSERT). We can define constraints which will protect the database from bad data. So what is the problem?

Why does Phil Factor regard this as an SQL Smell? The answer is that using Check Constraints in this way embeds constant values in the structure of the SQL database!

The database structure is “locked down” in commercial systems. Only authorised people are allowed to make changes to the structure. Phil Factor wants to avoid these changes.

When should we use Check Constraints?

This criticism does not mean that we should never use Check Constraints. The ideal constraint will not change in the lifetime of the database.

Possible candidates for using CHECK CONSTRAINTs to Restrict Values
Possible candidates for using CHECK CONSTRAINTs to Restrict Values
  • For attributes which represent “classifications” and “types” we should note how many different values we are expecting, and how frequently the allowed values change. Short lists which change very rarely may be acceptable.
  • On the other hand, consider re-designing a CHECK CONSTRAINT as a FOREIGN KEY by adding an additional table to contain the valid values. This has the benefit of making adding a new value a simple data change!
  • Do not use Check Constraints to enforce arbitrary limits.

Check Constraint and Requirements

We can identify candidates for Check Constraint when we construct the Conceptual Model. We should note:

  • The number of options
  • The expected frequency of change.
  • That information will enable us to make an informed decision about how to validate the values of that column.

Unfortunately, some of the examples using Check Constraints perform the checks against arbitrary values. These examples will work technically but copying them may cause the problems Phil Factor wants us to avoid.

Summary

Check Constraints provide a way of validating data values. They are appropriate for checking against values which do not change.

For lists, lookup-tables with Foreign Key Constraints may be better.

Do not use Check Constraints against arbitrary values or values which change frequently.

Where next?

The next article covers “Indexes”. I will explain how an Analyst can influence some design decisions.

Not using Referential Integrity – An SQL Smell!

Not Using Referential Integrity - SQL Smell?
Not Using Referential Integrity – SQL Smell?

In his article on SQL Smells, Phil Factor describes identifies two “smells” associated with Constraints in SQL. He regards not using Referential Integrity as an SQL Smell.

In this post I will explain what a Constraint is, and how one kind of Constraint provides Referential Integrity.

What are CONSTRAINTs and what do they do for us?

A Constraint limits the values which are allowed in a column. The SQL database manager will use constraints to validate data. There are several types of constraint.

Phil Factor thinks that not enforcing Referential Integrity is usually bad (and I agree).

Referential Integrity and the FOREIGN KEY REFERENCES Constraint

The FOREIGN KEY REFERENCES constraint on a column makes it a Foreign Key to another table. The Foreign Key must be present as a Primary Key in the referenced table. This is called Referential Integrity.

In the illustration, it is not possible for the Order table to contain a CustomerID which does not exist in the table Customer.

Enforcing Referential Integrity with REFERENCES
Enforcing Referential Integrity with REFERENCES

This is very powerful. The constraint makes the database manager check every change. Statements which would violate the Foreign Key constraint are rejected.

Performing the validation costs effort.

Phil Factor points out an additional benefit. The database manager may be able to use the Foreign Keys to improve performance of some queries.

Why wouldn’t you use Referential Integrity?

The arguments most frequently used for not implementing referential integrity are:

  • Performance: I have already mentioned the “cost” of checking the constraints. This argument:
    • Prejudice against logic in the database
    • A desire to minimise the “load” on the database manager.
  • Dirty Data: An “upstream” system may provide data which violates the constraint. Accepting this argument may mean importing faulty data, and the associated problems into your system!

“Staging tables” may be a better solution to this problem.

Both of these arguments are valid in some circumstances. There are times alternatives are better. Ask the following questions:

  • How are you going to perform the validation?
  • Which system components are going to perform the validation?
  • Do you expect the alternative solution to perform better?
  • Will the alternative solution be better in some way?

Not having answers to these questions is not really acceptable. Try and make a rational decision, based on numbers.

How Referential Integrity and Requirements interact

The Conceptual Data Model identifies business entities and relationships. Those relationships define the referential integrity requirements.

Not using referential integrity implies that:

  • The system is going to allow invalid data, or
  • The system is going to validate the data in some other way.

Summary

Foreign Key Constraints exclude some invalid data from the system. Orphaned records (eg non-existent Customers) are impossible.

Exceptions to this rule require a rational justification.

Where next?

In the next article I will look at “Check Constraints” which are a different way of ensuring valid data.

SQL Hierarchies – SQL Smells

SQL Hierarchy - Hierarchies are a problem in SQL - An SQL smell
SQL Hierarchy – Hierarchies are a problem in SQL – An SQL smell

Phil Factor describes “Storing the hierarchy structure in the same table as the entities that make up the hierarchy”, and hierarchies as an SQL Smell. It is one of the longer titles in his article!

If you are not familiar with this construction, it is when a table refers back to itself using a “one-to-many” relationship.

Hierarchies are everywhere in the real world: product categories, nested geographic areas and organisation structures are all reasons we use hierarchies. Hierarchies are essential in order to model real business requirements, but they are difficult to get right. They are difficult to explain to business people and they can be difficult to specify properly to developers. When you decide you need them, then you should look at them closely.

I go a stage further than Phil Factor and say that all requirements for hierarchies should be reviewed, because they can be tricky.

How does a single table represent a hierarchy?

Product Hierarchy as a Single SQL table
Product Hierarchy as a Single SQL table

The figure shows a single table (called “Product_Category”). Each row can “point at” its parent using a Foreign Key in the “ParentCategoryId” column. In the example: Bikes, Components, Clothing and Accessories are all top-level categories. Mountain Bikes, Road Bikes and Touring Bikes are all sub-categories of Bikes. You can see how the row for Mountain Bikes “points at” Bikes.

This is the construction Phil Factor regards as an “SQL Smell” in his article. He says:

  • This approach “mixes relationships and values” (a philosophical objection), and
  • He states that the “closure table” pattern is more suitable for modelling real-life hierarchies.

I agree with Phil Factor on both counts. I’m not entirely against using the single table hierarchy, because it can be very simple and effective, but we need to recognise its limitations. People use the single table hierarchy because they are seduced by its apparent elegance and because they are not aware of any alternatives.

How the “closure table” represents a hierarchy in two tables

Two ways of modelling a hierarchy in SQL tables - single self-referencing table and closure table
Two ways of modelling a hierarchy in SQL tables

A two table “closure table” design is my (and Phil Factor’s) preferred way of storing hierarchies. In this pattern the information about the categories is held in the Category table and the information about the relationships between categories is held as foreign keys in the Product_Category_Closure table.

I like the “closure table” pattern because it is extremely flexible. There is an excellent technical description of how to use it here.

I am still cautious about the “closure table”, because we need to make sure we understand the requirements for the hierarchy we are designing.

Hierarchies as a Requirements Smell for Analysts

Phil Factor identifies the “Single Table Hierarchy” as an SQL Smell. I would go further and identify Hierarchies in general as a Requirements Smell. We can spot hierarchies at the earliest stages of creating the Conceptual Model of the database.

The issue is not “having Hierarchies” in your model. The issue is that both the “Single Table” and “Closure Table” patterns may seduce us into thinking that we have understood what the Business wants, or what the System needs to do, because we have added a table or two into our model.

Hierarchies are an area where the model of the data structure we use in the system can have a real impact (for good and bad) on what the system is capable of.

  • It is not really reasonable to expect business people to understand the nuances of choosing one Conceptual Model over another.
  • Neither is it right to expect Developers or Database designers to decide the importance of things like sequencing of categories, or rules about “depth”.
  • This is especially a problem if the Analyst has created a Model which contains a single table linked to itself and used that as a short-hand for “we need some sort of hierarchy here”, but has not investigated it or specified what it is.

Consequently, there is a danger here that we can either get into “analysis paralysis”, investigating hierarchies, on the one hand, or creating a database with an over-simplified or over complex solution for the hierarchy on the other.

How an analyst should approach hierarchies

In order to manage the risks of “analysis paralysis” or an inappropriate design, I suggest the following approach:

  • Recognise that the Business People may not fully understand the consequences of what they are agreeing to.
  • Make sure you illustrate the definition of the hierarchy with realistic examples. It won’t be wasted effort. The examples will be good for reflecting back the Requirements to the Business and will be useful for explaining the Requirements to whoever is designing the database. The examples will also be useful as the basis for test data.
  • The Business may review their requirements after they have seen the hierarchy implementation. That makes hierarchies a likely candidate for iterative development whether we want it or not!

Summary

Hierarchies are normal and essential Business Requirements. The “single table hierarchy” pattern may over-simplify Requirements, and even the “Closure Table” pattern may seduce us into specifying a design which we do not fully understand.

Hierarchies are a Requirements Smell. If the business needs hierarchies we should not try to eliminate them. Hierarchies deserve particular attention. They should be recognised as a potential risk in the project plan.

Where next?

“Hierarchies” are a problem with understanding real requirements and then converting a Conceptual Model into a Logical Model. They are an unavoidable Requirements Smell.

In the next article I’m going to look deeper than usual into table definitions. I’m going to consider the SQL Smells and Requirements Smells around – “SQL Constraints – for Referential Integrity and for Column Values”.