NULLs Data Model and SQL Smells

Nulls - An SQL Smell?
Nulls – An SQL Smell?

NULLs are a perennial problem. Nobody likes them. They confuse developers and users and many analysts do not really understand them.

The concept of NULL allows us to say that there are things that we do not know.

In his article on SQL Smells, Phil Factor associates several smells with NULLs. In this post I’ll explain how to avoid using NULLs and how to use them properly when they are necessary.

Why do we need NULLs at all? What is the benefit and what is the cost?

NULL has a simple meaning with wide-ranging and surprising consequences. NULL means the value is unknown. And this in turn means that the result of any calculation or concatenation which uses this value must also be unknown!

NULL is a feature of SQL. The benefit of allowing NULL or “three valued logic” (TRUE/FALSE/UNKNOWN) is that it allows a database to record that there are things we do not know. The cost of having them is that any calculation or concatenation which uses this value must also be unknown! This confuses many people.

Reasons for needing NULL

There are many reasons why we might not have data to put into a column. Thinking about why we are considering defining a column as NULLable will encourage us to consider alternatives.

The structural NULL – Permanent sub-types

Customer with Sub-types which may cause NULLs
Customer with Sub-types which may cause NULLs

Sometimes we want to combine two entities into a single “super-type” table. There are attributes of Person will never be used for Business (and vice-versa). These missing values will need NULLs.

The structural NULL – Lifecycle subtypes

Order with Lifecycle sub-types which may cause NULLs
Order with Lifecycle sub-types which may cause NULLs

Something similar can happen if we combine all the steps of a entities lifecycle into a single table. The attributes of the later stages will always be empty (NULL) until that stage is reached. These later steps often contain dates or times.

In both cases involving sub-types it may be possible to splitting the sub-types into separate tables. Consider whether it is worth the effort and make sure you avoid the pitfalls of sub-types in SQL.

Data that will never be there

Attributes and Values for an "Address"
Attributes and Values for an “Address”

Entities like “Address” are frequently modelled with attributes like “AddressLine_”. In many cases there will never be values for the later lines. They will not be mandatory in the user interface, but do they need to be NULL? Consider whether allowing them to default to “spaces” or an empty string, would be better and whether it would have any bad effects.

Things which should never allow NULL

Always decide whether you expect an attribute to have a value. Don’t leave it to chance.

There are some things which should hardly ever allow NULL. This includes all keys and identifiers. Avoid allowing short titles or descriptions to be NULL (For long descriptions allowing NULLs is understandable).

Unavoidable NULLs

Attributes and Values for a "Person" Entity - Sometimes NULL is hard to avoid
Attributes and Values for a “Person” Entity – Sometimes NULL is hard to avoid

There are some attributes where allowing NULL is hard to avoid. Life insurance and pensions companies may need a “date of death” for their customers! Having a column with allows NULL is often the easiest way of handling this.

You should resist the temptation to use “magic dates” or inappropriate data-types in order to avoid allowing NULL. The consequences are far worse than the problem.

Summary

NULLs are a problem, nobody likes them but they are necessary. Many problems with NULLs can be avoided by two rules:

  • Remember that NULL means “unknown value” and this has consequences.
  • Ask “_why_ don’t we have this data?”

In many cases NULLs can be avoided by data modelling – that means the analyst has to do work in the Conceptual or Logical Model.

Where next?

The next article is about another smell: having the same name for different things!

Indexes, Analysts and SQL Smells

SQL Indexes - A Goldilocks problem for analysts or an SQL Smell
SQL Indexes – A Goldilocks problem for analysts or an SQL Smell

Database Indexes are something which a lot of analysts ignore as being “too technical”. This is a pity.

Several SQL Smells in Phil Factor’s article point at possible bad decisions. Thinking in the  Logical Model can improve these decisions.

Choosing the correct indexes is a typical “Goldilocks Problem”: not too few, not too many, just the right number! Bad or inadequate requirements will contribute to designers making bad decisions. Phil Factor describes having the wrong indexes as an SQL Smell!

What is an Index and what does it do for us?

If you are an analyst, you may not know exactly what an index is. In non-technical terms, an index provides a quick way for the database manager to find the rows it needs in a table. There are several sorts of index.

An index is a “thing” in its own right. An index takes up space in the database. Updating an index costs effort. The main benefit of an index is that it makes select or read operations faster.

Types: Unique, Non-Unique, Clustered

There are three main kinds of index: Unique, Non-Unique and Clustered.

Imagine that we have a very simply database consisting of 3 tables:

  • Customer (not shown in the diagram)
  • Order
  • OrderLine
Two tables: Order and OrderLine - Where to put the indexes?
Two tables: Order and OrderLine – Where to put the indexes?

Unique Indexes:

Candidate Unique Indexes on an "Order" SQL table
Candidate Unique Indexes on an “Order” SQL table

In the Order table we can see three columns which might be used to identify the order

If you are not familiar with GUIDs, they are a way of assigning identifiers or “keys”. They are worth finding out about. It would be unusual to expect a human being to type in a GUID. An “OrderNum” (which paradoxically might contain letters!) would be more convenient for the users.

We expect all three: OrderId, GUID and OrderNum, to be unique. Therefore, all three are candidates for Unique Indexes. If an application attempts to create a duplicate value in a column which has a unique index, then the database manager will raise an error and reject the transaction.

Non-Unique Indexes for Foreign Keys

Candidate Indexes on an "OrderLine" SQL table
Candidate Indexes on an “OrderLine” SQL table

In the OrderLine table you can see two columns which identify things in other tables: OrderId and ProductId. These are Foreign Keys. In this case we cannot say they are unique, but they are candidates for non-unique indexes.

We could also have used OrderNum or OrderGUID as Foreign Keys into Order.

It is good practice for the rows in the OrderLine table to have a unique identifier. There are two common ways of doing this.

  • We can assign an OrderLineId (which is unique across the whole table) or
  • we can use the combination of OrderId and OrderLineNum which together would identify a row.

In this example, both OrderLineId and the OrderId and OrderLineNum combination are (seperate) candidate unique indexes.

Clustered Indexes

The order of the rows in an SQL table is specified by the Clustered Index. Each table can have only one clustered index. The clustered index must be unique.

People often make the “primary key” the clustered index, but it is worth considering other options. In the example, OrderLines can be added to an order after it has been created.

Using the OrderId, OrderLineNum index as the clustered index would make the database store all the “lines” for one order together (whenever they were added to the order). That may be more efficient for retrieval. Phil Factor identifies two smells with the choice of clustered indexes.

Non-Unique Indexes for Searching

Candidate Indexes on a "Customer" SQL table - A Non-Unique index on Name would help searches
Candidate Indexes on a “Customer” SQL table – A Non-Unique index on Name would help searches

Columns which will be used for searching should be considered candidates for a non-unique index.

The role of Analysts in choosing Indexes and Index types

Indexes are usually specified in the “Physical Model”. The analyst can help the database designer make the right decisions, by applying a little thought. The analyst should not try to pre-empt the designers decisions. They should aim to assist by identifying relevant “candidates”.

"<yoastmark

Summary

Indexes enforce business rules like uniqueness in an SQL database. They influence database performance. Considering candidate indexes in the Logical Model and even the Conceptual Model will help database designers make better decisions.

Where next?

The next article is about the smell of nothing, or “Nulls”. Nulls present problems for developers and database designers.

Packing lists into SQL columns – SQL Smells

Packing Lists into SQL Columns creates an SQL Smell
Packing Lists into SQL Columns creates an SQL Smell

Some SQL data types are amazingly flexible. As a consequence people are tempted to put all kinds of data into character columns. In his article on SQL Smells Phil Factor identifies packing lists or complex data into a column as one of the “smells”. To be frank, it stinks! One row should contain one value for each column. That value should mean a single thing. Doing anything else is inviting problems.

Let’s look at how to recognise this particular “smell”, where it comes from, the consequences of allowing designs containing it and how to remove it. I’ll also touch on the limited circumstances when it is acceptable.

Recognising “packing lists into a column”

Examples of lists packed into SQL Columns
Examples of lists packed into SQL Columns

If you are an analyst or designer, working with the Conceptual Design or Requirements for the database then you will know when you are tempted to do this. My advice is – Don’t do it! The explanation will come later.
On the other hand, this smell can be hard to recognise if it actually gets into the database design. There will be evidence in three places:

  • There may be signs of “lists” in the database design. The names of affected columns may be plural, or something like “List_of_…”. The column is likely to be defined as a character type.
  • There will be evidence in the data. This is the easiest place to find the evidence. There will usually be a separator character between the different elements of the list, like “1,2,3,4”. Beware! Fixed length character columns, divided into fields, mimicking an ancient punched card are not entirely unknown.
  • The code will provide evidence. You will know it when you see it. The code will parse the offending column into separate values based on either a separator value or column positions.

Limitations of the packing approach

This approach is not using a relational database as intended. This will impose limitations on your system.

  • You will not be able to search on the individual fields without unpacking them.
  • You won’t be able to update the fields without unpacking them and then re-packing them.
  • Changes to the implied record structure will mean fundamental changes to the database and the associated code.
  • All the packing and unpacking will cause poor performance.

Excuses for packing a list into a column

There are two reasons you may be tempted to pack a list in this way:

  • You think it will somehow be “more efficient”. Take Phil Factor’s word for it. It won’t!
  • Another system expects something in this form. In this case you would be better to “do things properly” and do the conversion close to the interface with the other system. That way you limit the effect the other system is having on you.

The only acceptable excuse is that the data in the column is going to be treated as a black box. All your system is doing is storing it.

How to avoid packing a list into a column

There is one main strategy for avoiding packing a list (or other complex data) in a column. You should aim to understand the list or data you are packing into the column. Consider breaking the complex data out into a new entity. Apply the techniques of data modelling or normalization.

Summary

A column containing complex data indicates problem with the Conceptual Model of the database. You should review the data model and apply the rules of normalization.

Where next?

Having dealt with a problem poor data modelling, in the next article I’m going to look at a general problem of data design and an associated SQL Smell – “Using inappropriate data types”.

Do “Entity Attribute Value” (EAV) Models smell bad?

Does the Entity Attribute Value (EAV) model an SQL smell?
Does the Entity Attribute Value (EAV) model have an SQL smell?

Have you heard of the “Entity Attribute Value” (EAV) Model or pattern? You may have, even if you don’t recognise the name.

The Entity Attribute Value pattern allows someone to add extra attributes to some entities. You know for sure that you have the EAV if you have entities or tables called:

  • …attribute
  • …attribute_value

You can substitute words like: property or feature instead of attribute. That’s right – you have an entity called attribute!

Phil Factor identifies the EAV as a potential “SQL Smell” in this article . I regard the Entity Attribute Value model as a “Requirements Smell” too. There are legitimate uses for EAV, but having an EAV may also indicate a problem. The problem my be with your Conceptual Model, or with the way it has been turned into a Logical Database Design.

My previous article dealt with tables which were very wide (have lots of columns). The EAV has the potential to produce a table (the “attribute_value” table) which is very narrow (typically only 3 or 4 columns) and is very long (with lots of rows).

Good Reasons to use an Entity Attribute Value (EAV) Model

The Entity Attribute Value model is not necessarily wrong. There are legitimate reasons for having an EAV:

  • Conscious modelling of data abstractions: EAVs can be used to take your requirements to a more abstract level. They are commonly found deep inside of “modelling tools”, CASE tools and other software packages which are intended to be configurable.
  • Consciously making the Conceptual Model “Extensible”: Allowing “user-defined” (or administrator defined) attributes for things is another legitimate use.
  • Anticipating very sparse data: EAV is a way of handling lots of NULLable columns. It uses space efficiently, but at the cost of more complex processing.

Bad Reasons to use an Entity Attribute Value (EAV) Model

  • It’s “cool”! I confess. I have done this. When I first discovered the EAV model, I tried to apply it everywhere. This is not a good idea.
  • Laziness is a very bad reason for using the EAV model. The argument goes something like: “we’re not sure what attributes the users need, so we will allow them to define their own”. The problem with this approach is that these “user-defined” attributes are hard to validate and process design becomes significantly harder.

Benefits of using an Entity Attribute Value (EAV) Model

  • Creates opportunities for re-use: Using the EAV model can create opportunities for reusing code. All those user-defined attributes are maintained by the same code. It can work very well with abstract Object-Oriented design.
  • Can make for very elegant and compact code: Well thought-out EAV code can be compact and elegant. This is one reason why you will find EAV models inside many packages.
  • Can make the data very compact: Using an EAV model can reduce the number of tables you need and the space the data takes up.

Disadvantages of using an Entity Attribute Value (EAV) Model

  • Data validation becomes harder: The value on the “Attribute_Value” table tends to be stored in a data-type like “varchar”. This makes validation of the data harder. Of course you can start to add validation yourself, but this is adding complexity.
  • Code becomes abstract and hard to understand: Code written to use an EAV always has to go through extra steps compared to having the column you want directly on the table.
  • Data becomes abstract and hard to understand. One solution to this is to add SQL views or a layer of code which transforms the abstract data in the EAV into something closer to what the business users are expecting.
  • The application may need “seed data” which is almost part of the code. This is what happens in some packages.
  • The application may require a complex “configuration” process. Again, this is what you find in some packages. You have to select which values
  • Performance: EAV requires 2 joins to get a value and the attribute name. This has performance implications.

Where the complexity and performance impact come from

How "user-defined" values are stored in the Entity Attribute Value (EAV) model
How “user-defined” values are stored in the Entity Attribute Value (EAV) model

We can imagine a simple EAV model where the “Entity” contains a single attribute called “Name”. Retrieving the values of “Name” is straightforward. Using the EAV model we can create user-defined attributes called: “Type, Colour, Length and Width”. We can record values for these attributes for any row in the entity table. It is hard to validate the data.

We can retrieve the value of a user-defined attributes using a JOIN. To get the “name” of attribute will require a second join. This can get messy!

SELECT will follow a different path for "user-defined" attributes in the EAV model
SELECT will follow a different path for “user-defined” attributes in the EAV model

Using the values in the EAV tables to identify rows in the Entity table is possible. This may present a challenge to the designers.

Summary – Does the Entity Attribute Value (EAV) Model smell bad?

I agree with Phil Factor. Think hard about whether you should use the EAV model. Using EAV inappropriately can have bad effects.

The Entity Attribute Value model may indicate the requirements are not understood. That is always a bad thing.

Where next?

In the next article I’m going to look at another “SQL Smell”. Phil Factor calls this one “Polymorphic Association”.

 

Problems with “God Objects” or very wide tables

Wide Tables, Wide SQL Tables
Wide SQL Tables

One of the “SQL smells” Phil Factor identifies in his article is the presence of “God Objects” in your Database or design. I agree with him, except that I would call them “very wide tables”. If you find them, then you may have a problem with the Conceptual Model you are using, or possibly t you should be considering using a different tool. In other words, you have a problem with your requirements. You have a “Requirements Smell”.

How many columns make a “God Object” or wide table?

How many columns can you have in an SQL table?
How many columns can you have in an SQL table?

Let’s start with the obvious question: How many columns make a “God Object” or wide table? The maximum number of columns you are allowed to have in a table varies with database manager. For example:

What the actual numbers are can depend on a lot of technical things. One hundred is still a big number.

Database management software will handle wide tables up to their limits. As with most things, when you approach the limit you will start to encounter difficulties, but that is missing the point. Even 100 columns may indicate a problem.

Why are “God Objects” or wide tables a problem?

Wide tables or "God objects" are a problem which originates in the Conceptual Model

The reasons with “God Objects” or wide tables cause an SQL Smell are technical, practical and what you might term business, or even philosophical problems. I’m a Business Analyst, so I’m going to start from the “Conceptual” end, with the Requirements for the database, and then look at the problems which these tables may cause in Development and then when the system is in operation. Also remember, that if we eliminate problems at the conceptual end, then we’re not going to encounter them further on. Wide tables are most certainly a problem with starts at the “Conceptual Model” stage.

”Conceptual Model” or philosophical problems

Each row in a relational table is supposed to represent something. The “something” may be a concrete object in the real world, or it may be something abstract like a contract or a transaction. Would you be able to explain to the users of your system, or your business owners what a single row represents? If not, you are likely to encounter problems.

Thinking about the columns in this wide table, each column is contains a value. How are you going to present or update those values? 1000 fields would make for a very busy screen. Even some sort of graphical representation is likely to be complex. Do your users really need to see all this data together? While there isn’t a rule which says that the whole of an entity has to be presented on a single screen, or as a single report, it has to represent something. Finally, every column in a row provides one value for one thing at one time. Is that really so in your wide table?

Problems during development

“God objects” or wide tables encourage handling one big lump of data. That in turn is going to encourage the creation of complicated code. Maybe life would be easier for everyone if the data and the process descriptions were much more focused.

If you are in an Analyst role, then think about how you are going to explain what should (and should not) be happening with all these columns.

Remember, SQL tables have no concept of “grouping” of the columns. The columns have an order, but it is not something you should be relying on. If you can form columns into groups, then you should probably consider “normalizing” them into other tables.

Problems in operation

“God objects” or wide tables can cause problems when the system is being used. The volume of data each row contains may cause performance problems when rows are read from the table, when rows are updated and when new rows are created.

Why do we get “God objects”?

Wide tables often start from trying to convert large and complex paper forms or spreadsheets straight into table designs. It seems like a good idea at first, but it can get bogged down in unexpected complexity.

Think about your least favourite paper form, especially if it runs to several pages – maybe it’s a tax return or something similar. Obviously the physical form represents something. If you were specifying a system to work with it, then you would be tempted to have a single table where each row represented a single form, there was a column for every question and each cell contained one person’s answer to a question. It would be just like an enormous spreadsheet. Some early commercial computer systems were like that. They worked but they were inflexible.

One clue that something is going wrong (apart from the number of columns) is the number of columns which need to allow “NULL” values. How many times does “Not Applicable” appear when you are filling in the paper form?

How do we solve the problem of the wide table?

The answer is to think about what all these columns mean and then start applying Data Modelling or normalization techniques to break the data into more manageable and useable chunks. If you can from groups of columns then those groups may be candidate entities and therefore candidate tables.

If you need to use the order of similar columns then maybe you should be considering a different table design like the “Entity Attribute Value” (MVP) Pattern. But beware, because that can give rise to a bad smell too!

Excuses for “God Objects” and wide tables

Nothing in Information Technology is ever clear-cut. There are usually grey areas. One person may regard a table as too wide and another may regard it as OK. There is always room for some discussion. There are times when using a table that is a little wider than we would normally like is acceptable. Here are some of the reasons (or maybe that should be excuses) that you may here for wide tables.

  • It gets all the work done in one place, so that other programs can use the data. I don’t really buy this one. I suspect that someone is guessing what these other programs need. If the guess is wrong then someone is going to have to re-design the big, wide table. I continue to maintain that having discrete data and performing discrete actions is better.
  • Here is a specific case I found where someone wanted to retrieve data from 2000 sensors. This is a case where using something other than a relational database might be better in the first instance. Depending on the details it might also be a case where using the Entity Attribute Value (EAV) model is appropriate as well.
  • We are being given the data in the wide form from another system. This excuse I will accept, because it is really being imposed as an external requirement. But! If you need to do this, then you will need to do the work of working out what all those many columns mean, and you may have to break the wide row down into constituent parts.

Where next?

That’s addressed the “God Object” or “Wide table” smell. I’ve already mentioned the “Entity Attribute Value” (EAV) model a couple of times. I’m going to address why that may be give rise to a bad smell in the next post.