News

Inappropriate data-types – SQL Smells

Inappropriate Data-types cause SQL Smells!
Inappropriate SQL Data-types cause SQL Smells!

People choose inappropriate data-types. This isn’t surprising. There are lots of SQL data-types, so people make inappropriate choices. Phil Factor names “Using inappropriate data-types” as a smell in his article on SQL Smells.

I’m going to concentrate on dates and numbers in this post. I will explain why people choose inappropriate data-types. I will also describe an approach which will encourage you to choose the right ones.

This will be a superficial treatment. I’m going to look at the problem from a high level. Dates and numbers can suffer from detailed technical problems as well.

Why do people choose inappropriate data-types?

Confusion of Storage, Interchange and Presentation forms of data causes selection of inappropriate data-types
SQL: Storage, Interchange and Presentation forms of data

Phil Factor identifies the main reason for this problem as:

“Confusing how data will be presented with how it will be stored”.

I agree. Here are some reasons Analysts choose inappropriate data-types:

  • We approach problems from the outside and take the users’ point of view. We should consider presentation. Inside the database, data should be stored in an appropriate form.
  • The same argument applies for interfaces and interchange formats. Interface requirements should not determine the way data is stored internally. Interfaces are still important. Where possible, standard interchange formats should be used.
  • Spreadsheets have made us lazy. You don’t have to think about the “data-type” when you key something into a cell. Format and validation are often added afterwards.
  • There are “folk memories” about problems with data in old file-based systems. These systems did not have the rich range of data-types of modern databases and languages.

Consequences of using inappropriate data-types

Inappropriate data types can have serious consequences for a system we are building. Some of these problems are not obvious. Many of these problems apply to all systems. Some of these problems become more important with larger databases.

  • Having the wrong format makes validation harder. It prevents the database engine from checking the content and increases the risk of “garbage data” getting into the system.
  • The possible presence of garbage data makes error handling throughout the system harder.
  • Storing data in “display format” imbeds that format deep inside the system.
  • “Dates” are associated with useful functions which simplify program design.
  • Inappropriate data-types can change how data is sorted. This influences how indexes work and cause performance issues.
  • The “correct” data-types are usually very space efficient. Using alternatives can waste space in the database for no benefit.

Let’s look at some specific examples:

Inappropriate data-types for Numbers, especially currency

SQL: Storing numbers as character strings - inappropriate data-types
SQL: Storing numbers as character strings

It is possible to present numbers as strings, even including the decimal and thousands separators and any related currency symbols.

Interchange files can contain numbers as text, because it is convenient.

Numbers stored as strings are harder to validate.

Numbers stored as strings are sorted differently to numbers stored as numbers. If you doubt me, then try the experiment of illustrated in the Figure: “Number versus Character sorting” in your favourite spreadsheet.

SQL: Number versus Character sorting. Consequences of inappropriate data-types
SQL: Number versus Character sorting

Inappropriate data-types for Dates

"<yoastmark

Some early databases did not handle dates very well. This encouraged designers to do-it-themselves with varying degrees of success.

It is possible to represent a date as an integer. Such a “date” will sort as you expect, but needs its own validation and will not help you with date arithmetic.

It is also possible (and unfortunately common) to store dates in character fields. In most cases this is simply “an accident waiting to happen”!

All these do-it-yourself options are vulnerable to the problem that Americans tend to specify dates “mm-dd-yy” and Europeans (including the British) tend to specify dates “dd-mm-yy”. There is nothing we as analysts can do about this except to make sure that the test data for any system always includes a date with “13th of the Month”!

Benefits of using the appropriate data-types

The benefits of using the appropriate data-types far outweigh any perceived costs. Most of the “cost” is simply being aware that there are options and then not choosing the inappropriate data-types!

Using the appropriate data-types will:

  • Help protect you from “garbage data” (The database will reject an incorrect leap year 29th of February!)
  • Sort as you (and the business) expect without the need to work out the details.
  • Allow you to specify the presentation separately from the storage. Many languages and presentation frameworks have these facilities built in.
  • Take up less storage space.
  • Make your system perform better!

How to prevent appropriate data-types

Choosing the Appropriate Data-type - Problem Prevention. Preventing inappropriate data-types
Choosing the Appropriate Data-type – Problem Prevention

People choose inappropriate data-types in the transition from the “Conceptual Model” to the “Logical Model” (or possibly the “Physical Model”). We have selected the entities and attributes the system needs, but we have chosen inappropriate data-types.

The solution is to separate different aspects of the data and the decisions we need to make in our minds.

Here is the approach I recommend:

In the Conceptual Model:

  • Decide what data you need (the “attribute”) and decide what kind of data it is. Do this from a “non-computer” point of view.
  • If it is a number, say what it counts or measures and what the units are.
  • Treat “money” as a unit.
  • For dates and times, label them loosely as “date”, “time” or “date-time”.
  • Record any “limits”. The database designer may use them in detailed design.
  • Say how the users of the system will expect to see it presented. You will use this information in the user-interface design.

In the Logical Model:

  • Decide what kind of “bucket” the database should put it into. A database professional may help you with this.
  • If it is a number, and a precise value, say it is an Integer (of some kind). For decimals, say how many decimal places you need.
  • Look hard at dates and times. Do you mean “date” or “time of day”? Do you need an “elapsed time” or a “point in time”?

In the Physical Model:

  • Decide exactly what SQL Data-type you are going to use. Many of the basic data-types have alternatives. There are several types of “Integer” and quite a lot of “Date and Time” types.
  • This is a good time to talk to a database professional.

Summary

There are two main reasons for choosing inappropriate data-types in SQL:

  • Concentrating too much on how data will be presented, rather than how it will be stored
  • Making decisions about the Physical Model prematurely

Using inappropriate data-types can have wide-ranging harmful effects on your database and system.

Avoid the problems by following a simple process:

  1. Concentrate on what data the system needs in the Conceptual Model.
  2. Outline how that data should be stored in the Logical Model.
  3. Confirm the exact SQL data-type in the Physical Model.

This does not have to be difficult or time consuming. It fits perfectly well with “Agile” development.

Where next?

“Inappropriate data-types” was a problem with converting a Conceptual Model into a Logical Model. In the next article I’m going to look at the SQL Smells and Requirements Smells around – “Using Hierarchies”.

Packing lists into SQL columns – SQL Smells

Packing Lists into SQL Columns creates an SQL Smell
Packing Lists into SQL Columns creates an SQL Smell

Some SQL data types are amazingly flexible. As a consequence people are tempted to put all kinds of data into character columns. In his article on SQL Smells Phil Factor identifies packing lists or complex data into a column as one of the “smells”. To be frank, it stinks! One row should contain one value for each column. That value should mean a single thing. Doing anything else is inviting problems.

Let’s look at how to recognise this particular “smell”, where it comes from, the consequences of allowing designs containing it and how to remove it. I’ll also touch on the limited circumstances when it is acceptable.

Recognising “packing lists into a column”

Examples of lists packed into SQL Columns
Examples of lists packed into SQL Columns

If you are an analyst or designer, working with the Conceptual Design or Requirements for the database then you will know when you are tempted to do this. My advice is – Don’t do it! The explanation will come later.
On the other hand, this smell can be hard to recognise if it actually gets into the database design. There will be evidence in three places:

  • There may be signs of “lists” in the database design. The names of affected columns may be plural, or something like “List_of_…”. The column is likely to be defined as a character type.
  • There will be evidence in the data. This is the easiest place to find the evidence. There will usually be a separator character between the different elements of the list, like “1,2,3,4”. Beware! Fixed length character columns, divided into fields, mimicking an ancient punched card are not entirely unknown.
  • The code will provide evidence. You will know it when you see it. The code will parse the offending column into separate values based on either a separator value or column positions.

Limitations of the packing approach

This approach is not using a relational database as intended. This will impose limitations on your system.

  • You will not be able to search on the individual fields without unpacking them.
  • You won’t be able to update the fields without unpacking them and then re-packing them.
  • Changes to the implied record structure will mean fundamental changes to the database and the associated code.
  • All the packing and unpacking will cause poor performance.

Excuses for packing a list into a column

There are two reasons you may be tempted to pack a list in this way:

  • You think it will somehow be “more efficient”. Take Phil Factor’s word for it. It won’t!
  • Another system expects something in this form. In this case you would be better to “do things properly” and do the conversion close to the interface with the other system. That way you limit the effect the other system is having on you.

The only acceptable excuse is that the data in the column is going to be treated as a black box. All your system is doing is storing it.

How to avoid packing a list into a column

There is one main strategy for avoiding packing a list (or other complex data) in a column. You should aim to understand the list or data you are packing into the column. Consider breaking the complex data out into a new entity. Apply the techniques of data modelling or normalization.

Summary

A column containing complex data indicates problem with the Conceptual Model of the database. You should review the data model and apply the rules of normalization.

Where next?

Having dealt with a problem poor data modelling, in the next article I’m going to look at a general problem of data design and an associated SQL Smell – “Using inappropriate data types”.

SQL Naming Conventions prevent SQL Smells

SQL Naming Conventions prevent bad SQL Smells

SQL Naming Conventions prevent bad SQL Smells

Anything written about Information Technology has to mention naming conventions. Phil Factor’s article on SQL Smells describes several smells which can be caused by poor naming of objects in the database. In this article, I am going to concentrate on naming tables and columns.

What makes a good naming convention?

 

I don’t think I can do better than quote Phil Factor in his original article:

“Identifiers should help to make SQL readable as if it were English.”

By “Identifiers” Phil means the names of the objects. The name of the object is to distinguish it from all the other things in the database. It does not need to fully describe the object. It is good if it tells people what the object contains, but it does not have to describe it in detail. There are better ways to do that.

By the way, SQL was originally meant to be called “Structured English Query Language” but the name was altered because of a trademark clash.

Short, meaningful names are good. Nobody will thank you for meaningless names like “T1231237” or “TEMP000”. I’ve actually seen both of those! Similarly, nobody who has to type it in is going to like really long names.

Any standards, naming conventions, or guidelines should help you to choose names which are useful where you are working. I’m not going to tell you what to do, but I am going to suggest some things to avoid, or at least to watch out for.

Maximum size in characters of Table and Column names in SQL Server, Oracle and MySQL
Maximum size in characters of Table and Column names in SQL Server, Oracle and MySQL

Naming things well is an art. Naming linking tables is especially hard. It is worth spending a little (but not too much) time on naming things. The names you choose will be referenced throughout your system and will be with with you for a long time.

Over the years, the maximum size permitted for table names has increased. That has made life easier. Really short table names practically demanded structured, coded names which were a pain. On the other hand, please don’t get carried away with long names.

Naming Conventions by exception

Rather than provide detailed guidance on a naming convention, other than the “Goldilocks – Meaningful, not too short, not too long, just right!”, I’m going to give you a list of naming SQL Smells which should make your nostrils twitch:

  • Avoid using reserved words in names: I did it inadvertently once by calling a column “Desc” (for “Description”) which clashed with DESC for “descending” in SQL. It caused all sorts of trouble and confusion.
  • Avoid special characters (including spaces) in names: This is allowed if you enclose the name in square brackets ([Table$name]), but this really is looking for trouble and making work. It is far better to keep things simple. In the same spirit, don’t use square brackets where they are not necessary ([Table_Name]). It’s ugly and harder to read.
  • Be suspicious of numbers in object names: Why would a table-name need to contain a number? There are occasional cases where a column name might contain a number. By convention “Address_Line_1” etc. is fairly common. But they are rather rare. Numbers in object names should make you suspect the data model.
  • Avoid system-generated object names: SQL Server (and other database managers) will provide default names for some objects, like indexes and constraints. The names it uses are ugly and sometimes give little clue what the object is doing.

Summary

When you are designing your own database, follow the local conventions and concentrate on making the names readable and easy to use.

Where next?

Having dealt with the issue of naming, in the next article I’m going to look at a specific SQL Smell – “Packing lists into a column”.

Polymorphic Association – bad SQL Smell!

Polymorphic Association creates a bad SQL Smell
Polymorphic Association creates a bad SQL Smell

When I first read Phil Factor’s article on SQL Smells I had to look up Polymorphic Association  to make sure that I understood what he was getting at. He’s right – this is a bad SQL smell! In my option it is one that can always be avoided.

But what is a “polymorphic association” in SQL?

Polymorphic Association

The reason I didn’t recognise Phil Factor’s polymorphic association, is that I knew by the name “alternative parent”. Polymorphic Association is probably known by other names too!

Relational Databases don’t recognise the idea the idea of sub-typing. The construction Phil Factor doesn’t like is the result of a flawed attempt to provide sub-typing.

We need to deal with some real situations like:

  • “Customer” – has the sub-types:
    • Person
    • Corporate Entity (or “Business”)
  • “Something we sell” (perhaps “Saleable item”?) – has the sub-types:
    • Product
    • Service

One way to do this is to define entities with sub-types. Doing sub-typing like this is a decision in the “Conceptual Model”. I need to point out that this is contentious because it builds the sub-types into the structure of the database itself. They become “hard-coded”.

In the rest of this article I’m going to use the “Customer: Person/Business” example.

What are we trying to achieve with “polymorphic association”?

Two ways of representing sub-typing
Two ways of representing sub-typing

In the example, someone has decided that we need to be able to record that Orders. Every Order is placed by either a Person or a Business. They have decided to have separate entities for Person and Business because they have different attributes. I have seen this conceptual requirement represented in several different ways, including:

  • “Person” and “Business” surrounded by a box representing “Customer”
  • A bar (often curved) drawn across the lines representing the relationships between Person or Business and Order.
  • A Customer super-type with Person and Business sub-types.

The requirements are fairly simple:

  • An Order is always placed by a Customer
  • A Customer has to be a Person or a Business (as far as we are concerned, it can’t be both)
  • So every Order is placed by either a Person or a Business.

What’s the problem with “polymorphic association”?

The problem of polymorphic association
The problem of polymorphic association

Phil Factor identifies a problem with “polymorphic association” when the database has reached the Physical Design stage. We can spot the potential for this problem to occur much earlier, when we are gathering Requirements and creating the Conceptual Design.

The problem in the Physical Design is that the Foreign Key in Order can refer to more than one table. The Cust_Type column tells use which table to use, but the database manager will not understand that. There are also further clues in the names of the various columns – If you look, you will notice a lot of use of the “super-type” in the names!

There are several problems with this Physical Design (Phil Factor’s article gives a more technical description):

  • It is not easy to spot in the database, and that can be a problem in itself.
  • It will result in complex procedural code.
  • The Order table cannot use constraints to check that the foreign keys are valid.
  • JOINs will be complicated
  • This is construction is likely to perform poorly.

Polymorphic Associations – One solution

One solution to this problem is to recognise the “super-type” and make it explicit.

Super-type and sub-types, highlighting the relationship decorations
Super-type and sub-types, highlighting the relationship decorations

This slightly different construction with a Customer entity or table, enables us to solve the difficulties with the polymorphic association or alternative parent design. Notice how the ends of the relationships nearest the Person and Business tables are not the familiar “crow’s feet” but indicate that there must be either one or zero rows.

When we use this design all the Foreign Keys in the Order table now reference rows in the Customer table.

A solution to polymorphic association with a super-type
A solution to polymorphic association with a super-type

Spotting potential Polymorphic Associations

Deal with Polymorphic Associations during the Conceptual Design of the database. The indications that the Requirements create a situation where someone might try to use a Polymorphic Association are:

  • There are several entities which are sufficiently similar that you consider having one entity, but are sufficiently different that you think multiple entities are justified.
  • Something (like an order) needs to have an association with one (but only one) from among this group.
  • Finally, watch for the word “OR” in your thinking. A Person OR a Business places and Order.

If you encounter these conditions, first you should challenge whether sub-typing is the correct solution. If you think it is, then use the single super-type and multiple sub-types form and avoid the polymorphic association.

Summary

I agree with Phil Factor. If you encounter the polymorphic association, stop! It is very unlikely to the the best solution in SQL. Don’t try to implement a polymorphic association. Create an explicit table for the super-type and then a separate table for each of the sub-types instead.

Where next?

In the next article I’m going to look at something which is responsible for a lot of bad “SQL Smells”. This multiple bad smell is the naming of SQL objects.

Do “Entity Attribute Value” (EAV) Models smell bad?

Does the Entity Attribute Value (EAV) model an SQL smell?
Does the Entity Attribute Value (EAV) model have an SQL smell?

Have you heard of the “Entity Attribute Value” (EAV) Model or pattern? You may have, even if you don’t recognise the name.

The Entity Attribute Value pattern allows someone to add extra attributes to some entities. You know for sure that you have the EAV if you have entities or tables called:

  • …attribute
  • …attribute_value

You can substitute words like: property or feature instead of attribute. That’s right – you have an entity called attribute!

Phil Factor identifies the EAV as a potential “SQL Smell” in this article . I regard the Entity Attribute Value model as a “Requirements Smell” too. There are legitimate uses for EAV, but having an EAV may also indicate a problem. The problem my be with your Conceptual Model, or with the way it has been turned into a Logical Database Design.

My previous article dealt with tables which were very wide (have lots of columns). The EAV has the potential to produce a table (the “attribute_value” table) which is very narrow (typically only 3 or 4 columns) and is very long (with lots of rows).

Good Reasons to use an Entity Attribute Value (EAV) Model

The Entity Attribute Value model is not necessarily wrong. There are legitimate reasons for having an EAV:

  • Conscious modelling of data abstractions: EAVs can be used to take your requirements to a more abstract level. They are commonly found deep inside of “modelling tools”, CASE tools and other software packages which are intended to be configurable.
  • Consciously making the Conceptual Model “Extensible”: Allowing “user-defined” (or administrator defined) attributes for things is another legitimate use.
  • Anticipating very sparse data: EAV is a way of handling lots of NULLable columns. It uses space efficiently, but at the cost of more complex processing.

Bad Reasons to use an Entity Attribute Value (EAV) Model

  • It’s “cool”! I confess. I have done this. When I first discovered the EAV model, I tried to apply it everywhere. This is not a good idea.
  • Laziness is a very bad reason for using the EAV model. The argument goes something like: “we’re not sure what attributes the users need, so we will allow them to define their own”. The problem with this approach is that these “user-defined” attributes are hard to validate and process design becomes significantly harder.

Benefits of using an Entity Attribute Value (EAV) Model

  • Creates opportunities for re-use: Using the EAV model can create opportunities for reusing code. All those user-defined attributes are maintained by the same code. It can work very well with abstract Object-Oriented design.
  • Can make for very elegant and compact code: Well thought-out EAV code can be compact and elegant. This is one reason why you will find EAV models inside many packages.
  • Can make the data very compact: Using an EAV model can reduce the number of tables you need and the space the data takes up.

Disadvantages of using an Entity Attribute Value (EAV) Model

  • Data validation becomes harder: The value on the “Attribute_Value” table tends to be stored in a data-type like “varchar”. This makes validation of the data harder. Of course you can start to add validation yourself, but this is adding complexity.
  • Code becomes abstract and hard to understand: Code written to use an EAV always has to go through extra steps compared to having the column you want directly on the table.
  • Data becomes abstract and hard to understand. One solution to this is to add SQL views or a layer of code which transforms the abstract data in the EAV into something closer to what the business users are expecting.
  • The application may need “seed data” which is almost part of the code. This is what happens in some packages.
  • The application may require a complex “configuration” process. Again, this is what you find in some packages. You have to select which values
  • Performance: EAV requires 2 joins to get a value and the attribute name. This has performance implications.

Where the complexity and performance impact come from

How "user-defined" values are stored in the Entity Attribute Value (EAV) model
How “user-defined” values are stored in the Entity Attribute Value (EAV) model

We can imagine a simple EAV model where the “Entity” contains a single attribute called “Name”. Retrieving the values of “Name” is straightforward. Using the EAV model we can create user-defined attributes called: “Type, Colour, Length and Width”. We can record values for these attributes for any row in the entity table. It is hard to validate the data.

We can retrieve the value of a user-defined attributes using a JOIN. To get the “name” of attribute will require a second join. This can get messy!

SELECT will follow a different path for "user-defined" attributes in the EAV model
SELECT will follow a different path for “user-defined” attributes in the EAV model

Using the values in the EAV tables to identify rows in the Entity table is possible. This may present a challenge to the designers.

Summary – Does the Entity Attribute Value (EAV) Model smell bad?

I agree with Phil Factor. Think hard about whether you should use the EAV model. Using EAV inappropriately can have bad effects.

The Entity Attribute Value model may indicate the requirements are not understood. That is always a bad thing.

Where next?

In the next article I’m going to look at another “SQL Smell”. Phil Factor calls this one “Polymorphic Association”.