People choose inappropriate data-types. This isn’t surprising. There are lots of SQL data-types, so people make inappropriate choices. Phil Factor names “Using inappropriate data-types” as a smell in his article on SQL Smells.
I’m going to concentrate on dates and numbers in this post. I will explain why people choose inappropriate data-types. I will also describe an approach which will encourage you to choose the right ones.
This will be a superficial treatment. I’m going to look at the problem from a high level. Dates and numbers can suffer from detailed technical problems as well.
Why do people choose inappropriate data-types?
Phil Factor identifies the main reason for this problem as:
“Confusing how data will be presented with how it will be stored”.
I agree. Here are some reasons Analysts choose inappropriate data-types:
- We approach problems from the outside and take the users’ point of view. We should consider presentation. Inside the database, data should be stored in an appropriate form.
- The same argument applies for interfaces and interchange formats. Interface requirements should not determine the way data is stored internally. Interfaces are still important. Where possible, standard interchange formats should be used.
- Spreadsheets have made us lazy. You don’t have to think about the “data-type” when you key something into a cell. Format and validation are often added afterwards.
- There are “folk memories” about problems with data in old file-based systems. These systems did not have the rich range of data-types of modern databases and languages.
Consequences of using inappropriate data-types
Inappropriate data types can have serious consequences for a system we are building. Some of these problems are not obvious. Many of these problems apply to all systems. Some of these problems become more important with larger databases.
- Having the wrong format makes validation harder. It prevents the database engine from checking the content and increases the risk of “garbage data” getting into the system.
- The possible presence of garbage data makes error handling throughout the system harder.
- Storing data in “display format” imbeds that format deep inside the system.
- “Dates” are associated with useful functions which simplify program design.
- Inappropriate data-types can change how data is sorted. This influences how indexes work and cause performance issues.
- The “correct” data-types are usually very space efficient. Using alternatives can waste space in the database for no benefit.
Let’s look at some specific examples:
Inappropriate data-types for Numbers, especially currency
It is possible to present numbers as strings, even including the decimal and thousands separators and any related currency symbols.
Interchange files can contain numbers as text, because it is convenient.
Numbers stored as strings are harder to validate.
Numbers stored as strings are sorted differently to numbers stored as numbers. If you doubt me, then try the experiment of illustrated in the Figure: “Number versus Character sorting” in your favourite spreadsheet.
Inappropriate data-types for Dates
Some early databases did not handle dates very well. This encouraged designers to do-it-themselves with varying degrees of success.
It is possible to represent a date as an integer. Such a “date” will sort as you expect, but needs its own validation and will not help you with date arithmetic.
It is also possible (and unfortunately common) to store dates in character fields. In most cases this is simply “an accident waiting to happen”!
All these do-it-yourself options are vulnerable to the problem that Americans tend to specify dates “mm-dd-yy” and Europeans (including the British) tend to specify dates “dd-mm-yy”. There is nothing we as analysts can do about this except to make sure that the test data for any system always includes a date with “13th of the Month”!
Benefits of using the appropriate data-types
The benefits of using the appropriate data-types far outweigh any perceived costs. Most of the “cost” is simply being aware that there are options and then not choosing the inappropriate data-types!
Using the appropriate data-types will:
- Help protect you from “garbage data” (The database will reject an incorrect leap year 29th of February!)
- Sort as you (and the business) expect without the need to work out the details.
- Allow you to specify the presentation separately from the storage. Many languages and presentation frameworks have these facilities built in.
- Take up less storage space.
- Make your system perform better!
How to prevent appropriate data-types
People choose inappropriate data-types in the transition from the “Conceptual Model” to the “Logical Model” (or possibly the “Physical Model”). We have selected the entities and attributes the system needs, but we have chosen inappropriate data-types.
The solution is to separate different aspects of the data and the decisions we need to make in our minds.
Here is the approach I recommend:
In the Conceptual Model:
- Decide what data you need (the “attribute”) and decide what kind of data it is. Do this from a “non-computer” point of view.
- If it is a number, say what it counts or measures and what the units are.
- Treat “money” as a unit.
- For dates and times, label them loosely as “date”, “time” or “date-time”.
- Record any “limits”. The database designer may use them in detailed design.
- Say how the users of the system will expect to see it presented. You will use this information in the user-interface design.
In the Logical Model:
- Decide what kind of “bucket” the database should put it into. A database professional may help you with this.
- If it is a number, and a precise value, say it is an Integer (of some kind). For decimals, say how many decimal places you need.
- Look hard at dates and times. Do you mean “date” or “time of day”? Do you need an “elapsed time” or a “point in time”?
In the Physical Model:
- Decide exactly what SQL Data-type you are going to use. Many of the basic data-types have alternatives. There are several types of “Integer” and quite a lot of “Date and Time” types.
- This is a good time to talk to a database professional.
Summary
There are two main reasons for choosing inappropriate data-types in SQL:
- Concentrating too much on how data will be presented, rather than how it will be stored
- Making decisions about the Physical Model prematurely
Using inappropriate data-types can have wide-ranging harmful effects on your database and system.
Avoid the problems by following a simple process:
- Concentrate on what data the system needs in the Conceptual Model.
- Outline how that data should be stored in the Logical Model.
- Confirm the exact SQL data-type in the Physical Model.
This does not have to be difficult or time consuming. It fits perfectly well with “Agile” development.
Where next?
“Inappropriate data-types” was a problem with converting a Conceptual Model into a Logical Model. In the next article I’m going to look at the SQL Smells and Requirements Smells around – “Using Hierarchies”.