Same column name in different tables- SQL Smells

Two Columns: Same Name, Different Data-type = Nasty! SQL Smell
Two Columns: Same Name, Different Data-type = Nasty! SQL Smell

Here’s another of Phil Factor’s SQL Smells. “Using the same column name in different tables but with different data-types”.

At first glance this seems harmless enough, but it can cause all sorts of problems. All of them are avoidable. If you are an analyst, make sure you are not encouraging this to happen. If you creating the Physical Model or DDL for a database, “Just don’t do it!”

As Phil Factor says, this is: “an accident waiting to happen.”

Two “rights” can make a “wrong”!

The problem here is not “using the same column name in different tables”. That is perfectly ok. Similarly, using “different data-types for different columns” cannot be wrong. That’s exactly what you should expect to do.

The problem is doing both at the same time. The issues are: practical and technical.

The Practical Problem of the same column name with different data-types

Any human user is likely to think that the same name refers to the same type of thing. They won’t check that the definitions of both “names”. No amount of “procedures” or “standards” will make them do anything different.

Sooner or later this will cause one of the technical problems.

The Technical Problems from the same column name with different data-types

Technical problems will occur when a value moves from one column to the other, or when comparing the two columns. Data may be truncated and those data transformations cost effort.

These problems may not manifest themselves immediately. The consequences will be data-dependent bugs and poor performance.

The Solution to this Issue

This smell and the associated problems can be avoided by following some simple rules:

  1. If two columns refer to the same thing (like a foreign key and a primary key), make sure they are the same data type.
  2. If two columns refer to different things, then give them distinct names. (Do not resort to prefixing every column name with the table name. That’s horrible)
  3. Having columns with have different names and the same data-type is perfectly OK.

Summary

“Using the same column name in different tables with different data-types” in an SQL database is simply “an accident waiting to happen.” It is easily avoided. Don’t do it and don’t do anything to encourage it.

Where next?

The next article is about the smells which come from dates and times.

Check Constraint to limit values – SQL Smell?

Do Check Constraints Cause an SQL Smell?
Do Check Constraints Cause an SQL Smell?

In his article on SQL Smells, Phil Factor is not in favour of using Check Constraint to limit the values in columns.

In my previous post I explained why Phil Factor recommends referential integrity. I am going to explain this apparent contradiction.

What are CHECK CONSTRAINTs and what do they do for us?

Using CHECK CONSTRAINT to Restrict Allowed Values
Using CHECK CONSTRAINT to Restrict Allowed Values

Like REFERENCES, the CHECK CONSTRAINT also says what are values can be stored in a column. The logical expression must be true for the value to be allowed.

The logical expression can be limits or even a list of permitted values (as in the illustration).

What is the problem with Check Constraints?

This sounds like a marvellous idea! The benefit is clear. Constraints will exclude invalid data from the database, even when it is loaded using a utility (eg BULK INSERT). We can define constraints which will protect the database from bad data. So what is the problem?

Why does Phil Factor regard this as an SQL Smell? The answer is that using Check Constraints in this way embeds constant values in the structure of the SQL database!

The database structure is “locked down” in commercial systems. Only authorised people are allowed to make changes to the structure. Phil Factor wants to avoid these changes.

When should we use Check Constraints?

This criticism does not mean that we should never use Check Constraints. The ideal constraint will not change in the lifetime of the database.

Possible candidates for using CHECK CONSTRAINTs to Restrict Values
Possible candidates for using CHECK CONSTRAINTs to Restrict Values
  • For attributes which represent “classifications” and “types” we should note how many different values we are expecting, and how frequently the allowed values change. Short lists which change very rarely may be acceptable.
  • On the other hand, consider re-designing a CHECK CONSTRAINT as a FOREIGN KEY by adding an additional table to contain the valid values. This has the benefit of making adding a new value a simple data change!
  • Do not use Check Constraints to enforce arbitrary limits.

Check Constraint and Requirements

We can identify candidates for Check Constraint when we construct the Conceptual Model. We should note:

  • The number of options
  • The expected frequency of change.
  • That information will enable us to make an informed decision about how to validate the values of that column.

Unfortunately, some of the examples using Check Constraints perform the checks against arbitrary values. These examples will work technically but copying them may cause the problems Phil Factor wants us to avoid.

Summary

Check Constraints provide a way of validating data values. They are appropriate for checking against values which do not change.

For lists, lookup-tables with Foreign Key Constraints may be better.

Do not use Check Constraints against arbitrary values or values which change frequently.

Where next?

The next article covers “Indexes”. I will explain how an Analyst can influence some design decisions.

Not using Referential Integrity – An SQL Smell!

Not Using Referential Integrity - SQL Smell?
Not Using Referential Integrity – SQL Smell?

In his article on SQL Smells, Phil Factor describes identifies two “smells” associated with Constraints in SQL. He regards not using Referential Integrity as an SQL Smell.

In this post I will explain what a Constraint is, and how one kind of Constraint provides Referential Integrity.

What are CONSTRAINTs and what do they do for us?

A Constraint limits the values which are allowed in a column. The SQL database manager will use constraints to validate data. There are several types of constraint.

Phil Factor thinks that not enforcing Referential Integrity is usually bad (and I agree).

Referential Integrity and the FOREIGN KEY REFERENCES Constraint

The FOREIGN KEY REFERENCES constraint on a column makes it a Foreign Key to another table. The Foreign Key must be present as a Primary Key in the referenced table. This is called Referential Integrity.

In the illustration, it is not possible for the Order table to contain a CustomerID which does not exist in the table Customer.

Enforcing Referential Integrity with REFERENCES
Enforcing Referential Integrity with REFERENCES

This is very powerful. The constraint makes the database manager check every change. Statements which would violate the Foreign Key constraint are rejected.

Performing the validation costs effort.

Phil Factor points out an additional benefit. The database manager may be able to use the Foreign Keys to improve performance of some queries.

Why wouldn’t you use Referential Integrity?

The arguments most frequently used for not implementing referential integrity are:

  • Performance: I have already mentioned the “cost” of checking the constraints. This argument:
    • Prejudice against logic in the database
    • A desire to minimise the “load” on the database manager.
  • Dirty Data: An “upstream” system may provide data which violates the constraint. Accepting this argument may mean importing faulty data, and the associated problems into your system!

“Staging tables” may be a better solution to this problem.

Both of these arguments are valid in some circumstances. There are times alternatives are better. Ask the following questions:

  • How are you going to perform the validation?
  • Which system components are going to perform the validation?
  • Do you expect the alternative solution to perform better?
  • Will the alternative solution be better in some way?

Not having answers to these questions is not really acceptable. Try and make a rational decision, based on numbers.

How Referential Integrity and Requirements interact

The Conceptual Data Model identifies business entities and relationships. Those relationships define the referential integrity requirements.

Not using referential integrity implies that:

  • The system is going to allow invalid data, or
  • The system is going to validate the data in some other way.

Summary

Foreign Key Constraints exclude some invalid data from the system. Orphaned records (eg non-existent Customers) are impossible.

Exceptions to this rule require a rational justification.

Where next?

In the next article I will look at “Check Constraints” which are a different way of ensuring valid data.

SQL Hierarchies – SQL Smells

SQL Hierarchy - Hierarchies are a problem in SQL - An SQL smell
SQL Hierarchy – Hierarchies are a problem in SQL – An SQL smell

Phil Factor describes “Storing the hierarchy structure in the same table as the entities that make up the hierarchy”, and hierarchies as an SQL Smell. It is one of the longer titles in his article!

If you are not familiar with this construction, it is when a table refers back to itself using a “one-to-many” relationship.

Hierarchies are everywhere in the real world: product categories, nested geographic areas and organisation structures are all reasons we use hierarchies. Hierarchies are essential in order to model real business requirements, but they are difficult to get right. They are difficult to explain to business people and they can be difficult to specify properly to developers. When you decide you need them, then you should look at them closely.

I go a stage further than Phil Factor and say that all requirements for hierarchies should be reviewed, because they can be tricky.

How does a single table represent a hierarchy?

Product Hierarchy as a Single SQL table
Product Hierarchy as a Single SQL table

The figure shows a single table (called “Product_Category”). Each row can “point at” its parent using a Foreign Key in the “ParentCategoryId” column. In the example: Bikes, Components, Clothing and Accessories are all top-level categories. Mountain Bikes, Road Bikes and Touring Bikes are all sub-categories of Bikes. You can see how the row for Mountain Bikes “points at” Bikes.

This is the construction Phil Factor regards as an “SQL Smell” in his article. He says:

  • This approach “mixes relationships and values” (a philosophical objection), and
  • He states that the “closure table” pattern is more suitable for modelling real-life hierarchies.

I agree with Phil Factor on both counts. I’m not entirely against using the single table hierarchy, because it can be very simple and effective, but we need to recognise its limitations. People use the single table hierarchy because they are seduced by its apparent elegance and because they are not aware of any alternatives.

How the “closure table” represents a hierarchy in two tables

Two ways of modelling a hierarchy in SQL tables - single self-referencing table and closure table
Two ways of modelling a hierarchy in SQL tables

A two table “closure table” design is my (and Phil Factor’s) preferred way of storing hierarchies. In this pattern the information about the categories is held in the Category table and the information about the relationships between categories is held as foreign keys in the Product_Category_Closure table.

I like the “closure table” pattern because it is extremely flexible. There is an excellent technical description of how to use it here.

I am still cautious about the “closure table”, because we need to make sure we understand the requirements for the hierarchy we are designing.

Hierarchies as a Requirements Smell for Analysts

Phil Factor identifies the “Single Table Hierarchy” as an SQL Smell. I would go further and identify Hierarchies in general as a Requirements Smell. We can spot hierarchies at the earliest stages of creating the Conceptual Model of the database.

The issue is not “having Hierarchies” in your model. The issue is that both the “Single Table” and “Closure Table” patterns may seduce us into thinking that we have understood what the Business wants, or what the System needs to do, because we have added a table or two into our model.

Hierarchies are an area where the model of the data structure we use in the system can have a real impact (for good and bad) on what the system is capable of.

  • It is not really reasonable to expect business people to understand the nuances of choosing one Conceptual Model over another.
  • Neither is it right to expect Developers or Database designers to decide the importance of things like sequencing of categories, or rules about “depth”.
  • This is especially a problem if the Analyst has created a Model which contains a single table linked to itself and used that as a short-hand for “we need some sort of hierarchy here”, but has not investigated it or specified what it is.

Consequently, there is a danger here that we can either get into “analysis paralysis”, investigating hierarchies, on the one hand, or creating a database with an over-simplified or over complex solution for the hierarchy on the other.

How an analyst should approach hierarchies

In order to manage the risks of “analysis paralysis” or an inappropriate design, I suggest the following approach:

  • Recognise that the Business People may not fully understand the consequences of what they are agreeing to.
  • Make sure you illustrate the definition of the hierarchy with realistic examples. It won’t be wasted effort. The examples will be good for reflecting back the Requirements to the Business and will be useful for explaining the Requirements to whoever is designing the database. The examples will also be useful as the basis for test data.
  • The Business may review their requirements after they have seen the hierarchy implementation. That makes hierarchies a likely candidate for iterative development whether we want it or not!

Summary

Hierarchies are normal and essential Business Requirements. The “single table hierarchy” pattern may over-simplify Requirements, and even the “Closure Table” pattern may seduce us into specifying a design which we do not fully understand.

Hierarchies are a Requirements Smell. If the business needs hierarchies we should not try to eliminate them. Hierarchies deserve particular attention. They should be recognised as a potential risk in the project plan.

Where next?

“Hierarchies” are a problem with understanding real requirements and then converting a Conceptual Model into a Logical Model. They are an unavoidable Requirements Smell.

In the next article I’m going to look deeper than usual into table definitions. I’m going to consider the SQL Smells and Requirements Smells around – “SQL Constraints – for Referential Integrity and for Column Values”.

Inappropriate data-types – SQL Smells

Inappropriate Data-types cause SQL Smells!
Inappropriate SQL Data-types cause SQL Smells!

People choose inappropriate data-types. This isn’t surprising. There are lots of SQL data-types, so people make inappropriate choices. Phil Factor names “Using inappropriate data-types” as a smell in his article on SQL Smells.

I’m going to concentrate on dates and numbers in this post. I will explain why people choose inappropriate data-types. I will also describe an approach which will encourage you to choose the right ones.

This will be a superficial treatment. I’m going to look at the problem from a high level. Dates and numbers can suffer from detailed technical problems as well.

Why do people choose inappropriate data-types?

Confusion of Storage, Interchange and Presentation forms of data causes selection of inappropriate data-types
SQL: Storage, Interchange and Presentation forms of data

Phil Factor identifies the main reason for this problem as:

“Confusing how data will be presented with how it will be stored”.

I agree. Here are some reasons Analysts choose inappropriate data-types:

  • We approach problems from the outside and take the users’ point of view. We should consider presentation. Inside the database, data should be stored in an appropriate form.
  • The same argument applies for interfaces and interchange formats. Interface requirements should not determine the way data is stored internally. Interfaces are still important. Where possible, standard interchange formats should be used.
  • Spreadsheets have made us lazy. You don’t have to think about the “data-type” when you key something into a cell. Format and validation are often added afterwards.
  • There are “folk memories” about problems with data in old file-based systems. These systems did not have the rich range of data-types of modern databases and languages.

Consequences of using inappropriate data-types

Inappropriate data types can have serious consequences for a system we are building. Some of these problems are not obvious. Many of these problems apply to all systems. Some of these problems become more important with larger databases.

  • Having the wrong format makes validation harder. It prevents the database engine from checking the content and increases the risk of “garbage data” getting into the system.
  • The possible presence of garbage data makes error handling throughout the system harder.
  • Storing data in “display format” imbeds that format deep inside the system.
  • “Dates” are associated with useful functions which simplify program design.
  • Inappropriate data-types can change how data is sorted. This influences how indexes work and cause performance issues.
  • The “correct” data-types are usually very space efficient. Using alternatives can waste space in the database for no benefit.

Let’s look at some specific examples:

Inappropriate data-types for Numbers, especially currency

SQL: Storing numbers as character strings - inappropriate data-types
SQL: Storing numbers as character strings

It is possible to present numbers as strings, even including the decimal and thousands separators and any related currency symbols.

Interchange files can contain numbers as text, because it is convenient.

Numbers stored as strings are harder to validate.

Numbers stored as strings are sorted differently to numbers stored as numbers. If you doubt me, then try the experiment of illustrated in the Figure: “Number versus Character sorting” in your favourite spreadsheet.

SQL: Number versus Character sorting. Consequences of inappropriate data-types
SQL: Number versus Character sorting

Inappropriate data-types for Dates

"<yoastmark

Some early databases did not handle dates very well. This encouraged designers to do-it-themselves with varying degrees of success.

It is possible to represent a date as an integer. Such a “date” will sort as you expect, but needs its own validation and will not help you with date arithmetic.

It is also possible (and unfortunately common) to store dates in character fields. In most cases this is simply “an accident waiting to happen”!

All these do-it-yourself options are vulnerable to the problem that Americans tend to specify dates “mm-dd-yy” and Europeans (including the British) tend to specify dates “dd-mm-yy”. There is nothing we as analysts can do about this except to make sure that the test data for any system always includes a date with “13th of the Month”!

Benefits of using the appropriate data-types

The benefits of using the appropriate data-types far outweigh any perceived costs. Most of the “cost” is simply being aware that there are options and then not choosing the inappropriate data-types!

Using the appropriate data-types will:

  • Help protect you from “garbage data” (The database will reject an incorrect leap year 29th of February!)
  • Sort as you (and the business) expect without the need to work out the details.
  • Allow you to specify the presentation separately from the storage. Many languages and presentation frameworks have these facilities built in.
  • Take up less storage space.
  • Make your system perform better!

How to prevent appropriate data-types

Choosing the Appropriate Data-type - Problem Prevention. Preventing inappropriate data-types
Choosing the Appropriate Data-type – Problem Prevention

People choose inappropriate data-types in the transition from the “Conceptual Model” to the “Logical Model” (or possibly the “Physical Model”). We have selected the entities and attributes the system needs, but we have chosen inappropriate data-types.

The solution is to separate different aspects of the data and the decisions we need to make in our minds.

Here is the approach I recommend:

In the Conceptual Model:

  • Decide what data you need (the “attribute”) and decide what kind of data it is. Do this from a “non-computer” point of view.
  • If it is a number, say what it counts or measures and what the units are.
  • Treat “money” as a unit.
  • For dates and times, label them loosely as “date”, “time” or “date-time”.
  • Record any “limits”. The database designer may use them in detailed design.
  • Say how the users of the system will expect to see it presented. You will use this information in the user-interface design.

In the Logical Model:

  • Decide what kind of “bucket” the database should put it into. A database professional may help you with this.
  • If it is a number, and a precise value, say it is an Integer (of some kind). For decimals, say how many decimal places you need.
  • Look hard at dates and times. Do you mean “date” or “time of day”? Do you need an “elapsed time” or a “point in time”?

In the Physical Model:

  • Decide exactly what SQL Data-type you are going to use. Many of the basic data-types have alternatives. There are several types of “Integer” and quite a lot of “Date and Time” types.
  • This is a good time to talk to a database professional.

Summary

There are two main reasons for choosing inappropriate data-types in SQL:

  • Concentrating too much on how data will be presented, rather than how it will be stored
  • Making decisions about the Physical Model prematurely

Using inappropriate data-types can have wide-ranging harmful effects on your database and system.

Avoid the problems by following a simple process:

  1. Concentrate on what data the system needs in the Conceptual Model.
  2. Outline how that data should be stored in the Logical Model.
  3. Confirm the exact SQL data-type in the Physical Model.

This does not have to be difficult or time consuming. It fits perfectly well with “Agile” development.

Where next?

“Inappropriate data-types” was a problem with converting a Conceptual Model into a Logical Model. In the next article I’m going to look at the SQL Smells and Requirements Smells around – “Using Hierarchies”.