In his article on SQL Smells, Phil Factor is not in favour of using Check Constraint to limit the values in columns.
In my previous post I explained why Phil Factor recommends referential integrity. I am going to explain this apparent contradiction.
What are CHECK CONSTRAINTs and what do they do for us?
Like REFERENCES, the CHECK CONSTRAINT also says what are values can be stored in a column. The logical expression must be true for the value to be allowed.
The logical expression can be limits or even a list of permitted values (as in the illustration).
What is the problem with Check Constraints?
This sounds like a marvellous idea! The benefit is clear. Constraints will exclude invalid data from the database, even when it is loaded using a utility (eg BULK INSERT). We can define constraints which will protect the database from bad data. So what is the problem?
The database structure is “locked down” in commercial systems. Only authorised people are allowed to make changes to the structure. Phil Factor wants to avoid these changes.
When should we use Check Constraints?
This criticism does not mean that we should never use Check Constraints. The ideal constraint will not change in the lifetime of the database.
For attributes which represent “classifications” and “types” we should note how many different values we are expecting, and how frequently the allowed values change. Short lists which change very rarely may be acceptable.
On the other hand, consider re-designing a CHECK CONSTRAINT as a FOREIGN KEY by adding an additional table to contain the valid values. This has the benefit of making adding a new value a simple data change!
Do not use Check Constraints to enforce arbitrary limits.
Check Constraint and Requirements
We can identify candidates for Check Constraint when we construct the Conceptual Model. We should note:
The number of options
The expected frequency of change.
That information will enable us to make an informed decision about how to validate the values of that column.
Unfortunately, some of the examples using Check Constraints perform the checks against arbitrary values. These examples will work technically but copying them may cause the problems Phil Factor wants us to avoid.
Check Constraints provide a way of validating data values. They are appropriate for checking against values which do not change.
For lists, lookup-tables with Foreign Key Constraints may be better.
Do not use Check Constraints against arbitrary values or values which change frequently.
The next article covers “Indexes”. I will explain how an Analyst can influence some design decisions.
Referential Integrity and the FOREIGN KEY REFERENCES Constraint
The FOREIGN KEY REFERENCES constraint on a column makes it a Foreign Key to another table. The Foreign Key must be present as a Primary Key in the referenced table. This is called Referential Integrity.
In the illustration, it is not possible for the Order table to contain a CustomerID which does not exist in the table Customer.
This is very powerful. The constraint makes the database manager check every change. Statements which would violate the Foreign Key constraint are rejected.
Performing the validation costs effort.
Phil Factor points out an additional benefit. The database manager may be able to use the Foreign Keys to improve performance of some queries.
Why wouldn’t you use Referential Integrity?
The arguments most frequently used for not implementing referential integrity are:
Performance: I have already mentioned the “cost” of checking the constraints. This argument:
Prejudice against logic in the database
A desire to minimise the “load” on the database manager.
Dirty Data: An “upstream” system may provide data which violates the constraint. Accepting this argument may mean importing faulty data, and the associated problems into your system!
“Staging tables” may be a better solution to this problem.
Both of these arguments are valid in some circumstances. There are times alternatives are better. Ask the following questions:
How are you going to perform the validation?
Which system components are going to perform the validation?
Do you expect the alternative solution to perform better?
Will the alternative solution be better in some way?
Not having answers to these questions is not really acceptable. Try and make a rational decision, based on numbers.
How Referential Integrity and Requirements interact
The Conceptual Data Model identifies business entities and relationships. Those relationships define the referential integrity requirements.
Not using referential integrity implies that:
The system is going to allow invalid data, or
The system is going to validate the data in some other way.
Foreign Key Constraints exclude some invalid data from the system. Orphaned records (eg non-existent Customers) are impossible.
Exceptions to this rule require a rational justification.
In the next article I will look at “Check Constraints” which are a different way of ensuring valid data.
Have you heard of the “Entity Attribute Value” (EAV) Model or pattern? You may have, even if you don’t recognise the name.
The Entity Attribute Value pattern allows someone to add extra attributes to some entities. You know for sure that you have the EAV if you have entities or tables called:
You can substitute words like: property or feature instead of attribute. That’s right – you have an entity called attribute!
Phil Factor identifies the EAV as a potential “SQL Smell” in this article . I regard the Entity Attribute Value model as a “Requirements Smell” too. There are legitimate uses for EAV, but having an EAV may also indicate a problem. The problem my be with your Conceptual Model, or with the way it has been turned into a Logical Database Design.
My previous article dealt with tables which were very wide (have lots of columns). The EAV has the potential to produce a table (the “attribute_value” table) which is very narrow (typically only 3 or 4 columns) and is very long (with lots of rows).
Good Reasons to use an Entity Attribute Value (EAV) Model
The Entity Attribute Value model is not necessarily wrong. There are legitimate reasons for having an EAV:
Conscious modelling of data abstractions: EAVs can be used to take your requirements to a more abstract level. They are commonly found deep inside of “modelling tools”, CASE tools and other software packages which are intended to be configurable.
Consciously making the Conceptual Model “Extensible”: Allowing “user-defined” (or administrator defined) attributes for things is another legitimate use.
Anticipating very sparse data: EAV is a way of handling lots of NULLable columns. It uses space efficiently, but at the cost of more complex processing.
Bad Reasons to use an Entity Attribute Value (EAV) Model
It’s “cool”! I confess. I have done this. When I first discovered the EAV model, I tried to apply it everywhere. This is not a good idea.
Laziness is a very bad reason for using the EAV model. The argument goes something like: “we’re not sure what attributes the users need, so we will allow them to define their own”. The problem with this approach is that these “user-defined” attributes are hard to validate and process design becomes significantly harder.
Benefits of using an Entity Attribute Value (EAV) Model
Creates opportunities for re-use: Using the EAV model can create opportunities for reusing code. All those user-defined attributes are maintained by the same code. It can work very well with abstract Object-Oriented design.
Can make for very elegant and compact code: Well thought-out EAV code can be compact and elegant. This is one reason why you will find EAV models inside many packages.
Can make the data very compact: Using an EAV model can reduce the number of tables you need and the space the data takes up.
Disadvantages of using an Entity Attribute Value (EAV) Model
Data validation becomes harder: The value on the “Attribute_Value” table tends to be stored in a data-type like “varchar”. This makes validation of the data harder. Of course you can start to add validation yourself, but this is adding complexity.
Code becomes abstract and hard to understand: Code written to use an EAV always has to go through extra steps compared to having the column you want directly on the table.
Data becomes abstract and hard to understand. One solution to this is to add SQL views or a layer of code which transforms the abstract data in the EAV into something closer to what the business users are expecting.
The application may need “seed data” which is almost part of the code. This is what happens in some packages.
The application may require a complex “configuration” process. Again, this is what you find in some packages. You have to select which values
Performance: EAV requires 2 joins to get a value and the attribute name. This has performance implications.
Where the complexity and performance impact come from
We can imagine a simple EAV model where the “Entity” contains a single attribute called “Name”. Retrieving the values of “Name” is straightforward. Using the EAV model we can create user-defined attributes called: “Type, Colour, Length and Width”. We can record values for these attributes for any row in the entity table. It is hard to validate the data.
We can retrieve the value of a user-defined attributes using a JOIN. To get the “name” of attribute will require a second join. This can get messy!
Using the values in the EAV tables to identify rows in the Entity table is possible. This may present a challenge to the designers.
Summary – Does the Entity Attribute Value (EAV) Model smell bad?
I agree with Phil Factor. Think hard about whether you should use the EAV model. Using EAV inappropriately can have bad effects.
The Entity Attribute Value model may indicate the requirements are not understood. That is always a bad thing.
In the next article I’m going to look at another “SQL Smell”. Phil Factor calls this one “Polymorphic Association”.
One of the “SQL smells” Phil Factor identifies in his article is the presence of “God Objects” in your Database or design. I agree with him, except that I would call them “very wide tables”. If you find them, then you may have a problem with the Conceptual Model you are using, or possibly t you should be considering using a different tool. In other words, you have a problem with your requirements. You have a “Requirements Smell”.
How many columns make a “God Object” or wide table?
Let’s start with the obvious question: How many columns make a “God Object” or wide table? The maximum number of columns you are allowed to have in a table varies with database manager. For example:
What the actual numbers are can depend on a lot of technical things. One hundred is still a big number.
Database management software will handle wide tables up to their limits. As with most things, when you approach the limit you will start to encounter difficulties, but that is missing the point. Even 100 columns may indicate a problem.
Why are “God Objects” or wide tables a problem?
The reasons with “God Objects” or wide tables cause an SQL Smell are technical, practical and what you might term business, or even philosophical problems. I’m a Business Analyst, so I’m going to start from the “Conceptual” end, with the Requirements for the database, and then look at the problems which these tables may cause in Development and then when the system is in operation. Also remember, that if we eliminate problems at the conceptual end, then we’re not going to encounter them further on. Wide tables are most certainly a problem with starts at the “Conceptual Model” stage.
”Conceptual Model” or philosophical problems
Each row in a relational table is supposed to represent something. The “something” may be a concrete object in the real world, or it may be something abstract like a contract or a transaction. Would you be able to explain to the users of your system, or your business owners what a single row represents? If not, you are likely to encounter problems.
Thinking about the columns in this wide table, each column is contains a value. How are you going to present or update those values? 1000 fields would make for a very busy screen. Even some sort of graphical representation is likely to be complex. Do your users really need to see all this data together? While there isn’t a rule which says that the whole of an entity has to be presented on a single screen, or as a single report, it has to represent something. Finally, every column in a row provides one value for one thing at one time. Is that really so in your wide table?
Problems during development
“God objects” or wide tables encourage handling one big lump of data. That in turn is going to encourage the creation of complicated code. Maybe life would be easier for everyone if the data and the process descriptions were much more focused.
If you are in an Analyst role, then think about how you are going to explain what should (and should not) be happening with all these columns.
Remember, SQL tables have no concept of “grouping” of the columns. The columns have an order, but it is not something you should be relying on. If you can form columns into groups, then you should probably consider “normalizing” them into other tables.
Problems in operation
“God objects” or wide tables can cause problems when the system is being used. The volume of data each row contains may cause performance problems when rows are read from the table, when rows are updated and when new rows are created.
Why do we get “God objects”?
Wide tables often start from trying to convert large and complex paper forms or spreadsheets straight into table designs. It seems like a good idea at first, but it can get bogged down in unexpected complexity.
Think about your least favourite paper form, especially if it runs to several pages – maybe it’s a tax return or something similar. Obviously the physical form represents something. If you were specifying a system to work with it, then you would be tempted to have a single table where each row represented a single form, there was a column for every question and each cell contained one person’s answer to a question. It would be just like an enormous spreadsheet. Some early commercial computer systems were like that. They worked but they were inflexible.
One clue that something is going wrong (apart from the number of columns) is the number of columns which need to allow “NULL” values. How many times does “Not Applicable” appear when you are filling in the paper form?
How do we solve the problem of the wide table?
The answer is to think about what all these columns mean and then start applying Data Modelling or normalization techniques to break the data into more manageable and useable chunks. If you can from groups of columns then those groups may be candidate entities and therefore candidate tables.
If you need to use the order of similar columns then maybe you should be considering a different table design like the “Entity Attribute Value” (MVP) Pattern. But beware, because that can give rise to a bad smell too!
Excuses for “God Objects” and wide tables
Nothing in Information Technology is ever clear-cut. There are usually grey areas. One person may regard a table as too wide and another may regard it as OK. There is always room for some discussion. There are times when using a table that is a little wider than we would normally like is acceptable. Here are some of the reasons (or maybe that should be excuses) that you may here for wide tables.
It gets all the work done in one place, so that other programs can use the data. I don’t really buy this one. I suspect that someone is guessing what these other programs need. If the guess is wrong then someone is going to have to re-design the big, wide table. I continue to maintain that having discrete data and performing discrete actions is better.
Here is a specific case I found where someone wanted to retrieve data from 2000 sensors. This is a case where using something other than a relational database might be better in the first instance. Depending on the details it might also be a case where using the Entity Attribute Value (EAV) model is appropriate as well.
We are being given the data in the wide form from another system. This excuse I will accept, because it is really being imposed as an external requirement. But! If you need to do this, then you will need to do the work of working out what all those many columns mean, and you may have to break the wide row down into constituent parts.
That’s addressed the “God Object” or “Wide table” smell. I’ve already mentioned the “Entity Attribute Value” (EAV) model a couple of times. I’m going to address why that may be give rise to a bad smell in the next post.
Recently I read an article by Phil Factor on the subject of “SQL Smells”. Phil (apparently not his real name), identifies a number of “smells” which he thinks indicate that a database design or SQL code needs to be reviewed. He classifies some of these as “Problems with Database Design”. I would go further and say some of them are problems with database requirements! In other words, your SQL smells because your Requirements smell!
“Requirements Smells cause SQL smells!”
I no longer claim to be a “Developer” and I have never claimed to be a DBA (Database Administrator), though I have found myself in the position of being an “accidental DBA”. The thought that Requirements could smell bad concerned me.
This realisation made me think about problems with Requirements in general and problems with databases in particular. It is better to avoid a problem rather than cure it, so I’m writing a series of blog posts on how to recognise problems in Requirements and prevent them from becoming “SQL Smells”.
Database design and SQL smells
Any computer system contains a “model” of the world it works with. This model forms the foundations of the system. If the system does not contain a concept, then it cannot work with it!
When people start to create a system they have to decide what concepts their system needs. This is the “Conceptual Model”. This model is transformed through a “Logical Model” until it finally becomes the “Physical Model”, which is the design for the database. The Conceptual and Logical models are not just first-cut versions of the Physical Model, different design decisions and compromises are made at each stage.
This is nothing to do with “Waterfall”, “Agile” or anything to do with any specific development process. In fact, this approach is pretty universal, whether formally or not. Some people combine the different stages, but there are risks to doing that.
A simple way of looking at the Conceptual Model is to say that it is concerned with finding out:
What the business and system need: at the conceptual stage these are known as “Entities”
What we need to know about those things: these are the “Attributes” of the Entities
We also need to document “Business Rules”: some of these will be represented as “Relationships”.
During the design and development process:
Entities will tend to become table definitions
Attributes will become the columns within those tables
Business Rules may become so-called “constraints”.
A poor Conceptual Model or bad design decisions can lead to systems which are difficult to build, maintain and use, and which do not perform well either. Once again,
“Requirements Smells will cause SQL Smells”
The idea of “smells” can help us address potential problems earlier and more cheaply.
Where are these “Requirements smells”?
I’m going to group my bad smells in a slightly different way to Phil Factor. I primarily work as a Business Analyst, so I am going to concentrate on “smells” to look for at the Conceptual and Logical Stages of specifying the Requirements for a database, starting with the smell that Phil describes as “The God Object”!