“SELECT *” What could possibly be wrong with that? Everybody uses SELECT *, don’t they?
Phil Factor describes this as an SQL Smell, and it is. Finding it should immediately make you suspicious. The problem is not the statement, it’s where you find it!
Interactive and Batch SQL
You can use SQL interactively, in “batch”, and inside programs. One of the good things about SQL is that it looks pretty much the same wherever you find it.
“SELECT *” is intended to be used interactively. That’s how I use it, and I expect Phil Factor does the same. Typing the statement in the figure at a command line, or inside a development environment like SSMS is completely appropriate.
Some people create queries interactively using “SELECT *” as a starting point. That’s legitimate too. It’s a matter of personal style.
Don’t use this form of SELECT in a program or when you expect to reuse it. If you save the file, you shouldn’t be using “SELECT *”.
Why Using “SELECT *” is a problem
Sometimes we want things to break! We want something to fail before something worse happens.
You can change the design of tables in a database. One way is using the ALTER statement. Columns can be added and removed.
“SELECT *” will continue to return a result even when the tables it is using have changed significantly. This is a problem because we don’t know if it is still doing what we originally intended!
Legitimate uses of SELECT *
There are a few ways you can use an asterisk in a SELECT statement without taking a risk. That is when you are checking if something exists, or counting the number of rows. In both cases the columns of the tables are irrelevant.
Phil Factor identifies “SELECT *” as an SQL Smell. It can be used interactively, but almost anywhere else it has the potential to cause problems.
Do you use “LIKE” in searches? There times when Phil Factor thinks this is an SQL Smell too. Find out why in the next article.
Phil Factor identifies several SQL Smells associated with the use of DateTime, Date and Time data-types. Using the wrong types will waste space, harm performance and create “odd” behaviour.
If you are clear about what you are recording, you will avoid these issues. I prefer to say what I want, and let an expert choose the best date-types. In other words, I prefer to separate the analyst and designer roles. That way I avoid suggesting the wrong types.
DateTime, Date or Time – Which do you need?
DateTime columns which contain only Date or Time contribute two of SQL Smells. This wastes space, both in storage and in memory (which will degrade performance).
There is something worse here. Using DateTime in this way suggests we are not clear about what we want. This lack of clarity encourages the designer to “hedge their bet” by using DateTime.
How precise do you need to be?
Business Systems need to record dates and times, but they don’t need great precision. For many business transactions, the nearest second or even minute is adequate. In some cases recording extra precision can be misleading.
For a “Period” you need to know 3 things: Start, End and Duration. You only need to store two out of the 3. The third value can always be calculated if you have the other two.
Storing the “start” and “end” is more flexible. It is straightforward to work out if an event took place within a given period.
Telling if periods overlap is easy too.
If you decide to store a duration you must specify the units you intend and the precision you need. (As Phil puts it “milliseconds? Quarters? Weeks?”). Do not be tempted to store a duration in a “Time” data-type! “Time” is intended for “Time-of-day” not duration.
Dates and Times: Choosing the right data-type – Some simple questions
Choosing appropriate data-types for dates and times is not difficult, if you go about it the right way.
Divide the job into two steps: “The Requirement” and the “The Most suitable technical data-type”. Do the two steps separately, taking into account any local standards and conventions.
Is this an “Event” (a “Point in Time”) a “Period” or a “Duration” (both have beginning and an end)?
Event: What is the best way to represent this data? Should it be a “Date” or a “Time”? Does it really need “Date-time”?
Duration: What are the units and scale of this duration?
How precise do you need the value to be? You may surprise yourself.
How long do you expect this system (and the data in it) to last?
The Most Suitable Technical Data-type
Use the table (based on Microsoft’s Documentation) to choose the best data-type for your needs. This table is for SQL Server. Other database managers will have similar tables.
Microsoft recommends using DATETIME2 in preference to DATETIME for new work (it takes up less space and is more accurate). Providing the maximum date is acceptable, I would consider SMALLDATETIME for business transactions (but you do risk creating a “Year 2000 problem” if the data turns out to have a long life).
If your system will span several time-zones, then you should definitely consider the benefits of using the DATETIMEOFFSET data-type.
The DateTime, Date and Time data-types can all cause SQL Smells when they are used inappropriately. Problems can be avoided by following some simple guidelines.
Phil Factor doesn’t like “SELECT *”. Find out why in the next article.
Here’s another of Phil Factor’s SQL Smells. “Using the same column name in different tables but with different data-types”.
At first glance this seems harmless enough, but it can cause all sorts of problems. All of them are avoidable. If you are an analyst, make sure you are not encouraging this to happen. If you creating the Physical Model or DDL for a database, “Just don’t do it!”
The problem here is not “using the same column name in different tables”. That is perfectly ok. Similarly, using “different data-types for different columns” cannot be wrong. That’s exactly what you should expect to do.
The problem is doing both at the same time. The issues are: practical and technical.
The Practical Problem of the same column name with different data-types
Any human user is likely to think that the same name refers to the same type of thing. They won’t check that the definitions of both “names”. No amount of “procedures” or “standards” will make them do anything different.
Sooner or later this will cause one of the technical problems.
The Technical Problems from the same column name with different data-types
Technical problems will occur when a value moves from one column to the other, or when comparing the two columns. Data may be truncated and those data transformations cost effort.
These problems may not manifest themselves immediately. The consequences will be data-dependent bugs and poor performance.
The Solution to this Issue
This smell and the associated problems can be avoided by following some simple rules:
If two columns refer to the same thing (like a foreign key and a primary key), make sure they are the same data type.
If two columns refer to different things, then give them distinct names. (Do not resort to prefixing every column name with the table name. That’s horrible)
Having columns with have different names and the same data-type is perfectly OK.
“Using the same column name in different tables with different data-types” in an SQL database is simply “an accident waiting to happen.” It is easily avoided. Don’t do it and don’t do anything to encourage it.
The next article is about the smells which come from dates and times.
NULLs are a perennial problem. Nobody likes them. They confuse developers and users and many analysts do not really understand them.
The concept of NULL allows us to say that there are things that we do not know.
In his article on SQL Smells, Phil Factor associates several smells with NULLs. In this post I’ll explain how to avoid using NULLs and how to use them properly when they are necessary.
Why do we need NULLs at all? What is the benefit and what is the cost?
NULL has a simple meaning with wide-ranging and surprising consequences. NULL means the value is unknown. And this in turn means that the result of any calculation or concatenation which uses this value must also be unknown!
NULL is a feature of SQL. The benefit of allowing NULL or “three valued logic” (TRUE/FALSE/UNKNOWN) is that it allows a database to record that there are things we do not know. The cost of having them is that any calculation or concatenation which uses this value must also be unknown! This confuses many people.
Reasons for needing NULL
There are many reasons why we might not have data to put into a column. Thinking about why we are considering defining a column as NULLable will encourage us to consider alternatives.
The structural NULL – Permanent sub-types
Sometimes we want to combine two entities into a single “super-type” table. There are attributes of Person will never be used for Business (and vice-versa). These missing values will need NULLs.
The structural NULL – Lifecycle subtypes
Something similar can happen if we combine all the steps of a entities lifecycle into a single table. The attributes of the later stages will always be empty (NULL) until that stage is reached. These later steps often contain dates or times.
In both cases involving sub-types it may be possible to splitting the sub-types into separate tables. Consider whether it is worth the effort and make sure you avoid the pitfalls of sub-types in SQL.
Data that will never be there
Entities like “Address” are frequently modelled with attributes like “AddressLine_”. In many cases there will never be values for the later lines. They will not be mandatory in the user interface, but do they need to be NULL? Consider whether allowing them to default to “spaces” or an empty string, would be better and whether it would have any bad effects.
There are some things which should hardly ever allow NULL. This includes all keys and identifiers. Avoid allowing short titles or descriptions to be NULL (For long descriptions allowing NULLs is understandable).
There are some attributes where allowing NULL is hard to avoid. Life insurance and pensions companies may need a “date of death” for their customers! Having a column with allows NULL is often the easiest way of handling this.
You should resist the temptation to use “magic dates” or inappropriate data-types in order to avoid allowing NULL. The consequences are far worse than the problem.
NULLs are a problem, nobody likes them but they are necessary. Many problems with NULLs can be avoided by two rules:
Remember that NULL means “unknown value” and this has consequences.
Ask “_why_ don’t we have this data?”
In many cases NULLs can be avoided by data modelling – that means the analyst has to do work in the Conceptual or Logical Model.
The next article is about another smell: having the same name for different things!
Database Indexes are something which a lot of analysts ignore as being “too technical”. This is a pity.
Several SQL Smells in Phil Factor’s article point at possible bad decisions. Thinking in the Logical Model can improve these decisions.
Choosing the correct indexes is a typical “Goldilocks Problem”: not too few, not too many, just the right number! Bad or inadequate requirements will contribute to designers making bad decisions. Phil Factor describes having the wrong indexes as an SQL Smell!
What is an Index and what does it do for us?
If you are an analyst, you may not know exactly what an index is. In non-technical terms, an index provides a quick way for the database manager to find the rows it needs in a table. There are several sorts of index.
An index is a “thing” in its own right. An index takes up space in the database. Updating an index costs effort. The main benefit of an index is that it makes select or read operations faster.
Types: Unique, Non-Unique, Clustered
There are three main kinds of index: Unique, Non-Unique and Clustered.
Imagine that we have a very simply database consisting of 3 tables:
Customer (not shown in the diagram)
In the Order table we can see three columns which might be used to identify the order
If you are not familiar with GUIDs, they are a way of assigning identifiers or “keys”. They are worth finding out about. It would be unusual to expect a human being to type in a GUID. An “OrderNum” (which paradoxically might contain letters!) would be more convenient for the users.
We expect all three: OrderId, GUID and OrderNum, to be unique. Therefore, all three are candidates for Unique Indexes. If an application attempts to create a duplicate value in a column which has a unique index, then the database manager will raise an error and reject the transaction.
Non-Unique Indexes for Foreign Keys
In the OrderLine table you can see two columns which identify things in other tables: OrderId and ProductId. These are Foreign Keys. In this case we cannot say they are unique, but they are candidates for non-unique indexes.
We could also have used OrderNum or OrderGUID as Foreign Keys into Order.
It is good practice for the rows in the OrderLine table to have a unique identifier. There are two common ways of doing this.
We can assign an OrderLineId (which is unique across the whole table) or
we can use the combination of OrderId and OrderLineNum which together would identify a row.
In this example, both OrderLineId and the OrderId and OrderLineNum combination are (seperate) candidate unique indexes.
The order of the rows in an SQL table is specified by the Clustered Index. Each table can have only one clustered index. The clustered index must be unique.
People often make the “primary key” the clustered index, but it is worth considering other options. In the example, OrderLines can be added to an order after it has been created.
Using the OrderId, OrderLineNum index as the clustered index would make the database store all the “lines” for one order together (whenever they were added to the order). That may be more efficient for retrieval. Phil Factor identifies two smells with the choice of clustered indexes.
Non-Unique Indexes for Searching
Columns which will be used for searching should be considered candidates for a non-unique index.
The role of Analysts in choosing Indexes and Index types
Indexes are usually specified in the “Physical Model”. The analyst can help the database designer make the right decisions, by applying a little thought. The analyst should not try to pre-empt the designers decisions. They should aim to assist by identifying relevant “candidates”.
Indexes enforce business rules like uniqueness in an SQL database. They influence database performance. Considering candidate indexes in the Logical Model and even the Conceptual Model will help database designers make better decisions.
The next article is about the smell of nothing, or “Nulls”. Nulls present problems for developers and database designers.