“SELECT *” What could possibly be wrong with that? Everybody uses SELECT *, don’t they?
Phil Factor describes this as an SQL Smell, and it is. Finding it should immediately make you suspicious. The problem is not the statement, it’s where you find it!
Interactive and Batch SQL
You can use SQL interactively, in “batch”, and inside programs. One of the good things about SQL is that it looks pretty much the same wherever you find it.
“SELECT *” is intended to be used interactively. That’s how I use it, and I expect Phil Factor does the same. Typing the statement in the figure at a command line, or inside a development environment like SSMS is completely appropriate.
Some people create queries interactively using “SELECT *” as a starting point. That’s legitimate too. It’s a matter of personal style.
Don’t use this form of SELECT in a program or when you expect to reuse it. If you save the file, you shouldn’t be using “SELECT *”.
Why Using “SELECT *” is a problem
Sometimes we want things to break! We want something to fail before something worse happens.
You can change the design of tables in a database. One way is using the ALTER statement. Columns can be added and removed.
“SELECT *” will continue to return a result even when the tables it is using have changed significantly. This is a problem because we don’t know if it is still doing what we originally intended!
Legitimate uses of SELECT *
There are a few ways you can use an asterisk in a SELECT statement without taking a risk. That is when you are checking if something exists, or counting the number of rows. In both cases the columns of the tables are irrelevant.
Phil Factor identifies “SELECT *” as an SQL Smell. It can be used interactively, but almost anywhere else it has the potential to cause problems.
Do you use “LIKE” in searches? There times when Phil Factor thinks this is an SQL Smell too. Find out why in the next article.
Phil Factor identifies several SQL Smells associated with the use of DateTime, Date and Time data-types. Using the wrong types will waste space, harm performance and create “odd” behaviour.
If you are clear about what you are recording, you will avoid these issues. I prefer to say what I want, and let an expert choose the best date-types. In other words, I prefer to separate the analyst and designer roles. That way I avoid suggesting the wrong types.
DateTime, Date or Time – Which do you need?
DateTime columns which contain only Date or Time contribute two of SQL Smells. This wastes space, both in storage and in memory (which will degrade performance).
There is something worse here. Using DateTime in this way suggests we are not clear about what we want. This lack of clarity encourages the designer to “hedge their bet” by using DateTime.
How precise do you need to be?
Business Systems need to record dates and times, but they don’t need great precision. For many business transactions, the nearest second or even minute is adequate. In some cases recording extra precision can be misleading.
For a “Period” you need to know 3 things: Start, End and Duration. You only need to store two out of the 3. The third value can always be calculated if you have the other two.
Storing the “start” and “end” is more flexible. It is straightforward to work out if an event took place within a given period.
Telling if periods overlap is easy too.
If you decide to store a duration you must specify the units you intend and the precision you need. (As Phil puts it “milliseconds? Quarters? Weeks?”). Do not be tempted to store a duration in a “Time” data-type! “Time” is intended for “Time-of-day” not duration.
Dates and Times: Choosing the right data-type – Some simple questions
Choosing appropriate data-types for dates and times is not difficult, if you go about it the right way.
Divide the job into two steps: “The Requirement” and the “The Most suitable technical data-type”. Do the two steps separately, taking into account any local standards and conventions.
Is this an “Event” (a “Point in Time”) a “Period” or a “Duration” (both have beginning and an end)?
Event: What is the best way to represent this data? Should it be a “Date” or a “Time”? Does it really need “Date-time”?
Duration: What are the units and scale of this duration?
How precise do you need the value to be? You may surprise yourself.
How long do you expect this system (and the data in it) to last?
The Most Suitable Technical Data-type
Use the table (based on Microsoft’s Documentation) to choose the best data-type for your needs. This table is for SQL Server. Other database managers will have similar tables.
Microsoft recommends using DATETIME2 in preference to DATETIME for new work (it takes up less space and is more accurate). Providing the maximum date is acceptable, I would consider SMALLDATETIME for business transactions (but you do risk creating a “Year 2000 problem” if the data turns out to have a long life).
If your system will span several time-zones, then you should definitely consider the benefits of using the DATETIMEOFFSET data-type.
The DateTime, Date and Time data-types can all cause SQL Smells when they are used inappropriately. Problems can be avoided by following some simple guidelines.
Phil Factor doesn’t like “SELECT *”. Find out why in the next article.
One of the “SQL smells” Phil Factor identifies in his article is the presence of “God Objects” in your Database or design. I agree with him, except that I would call them “very wide tables”. If you find them, then you may have a problem with the Conceptual Model you are using, or possibly t you should be considering using a different tool. In other words, you have a problem with your requirements. You have a “Requirements Smell”.
How many columns make a “God Object” or wide table?
Let’s start with the obvious question: How many columns make a “God Object” or wide table? The maximum number of columns you are allowed to have in a table varies with database manager. For example:
What the actual numbers are can depend on a lot of technical things. One hundred is still a big number.
Database management software will handle wide tables up to their limits. As with most things, when you approach the limit you will start to encounter difficulties, but that is missing the point. Even 100 columns may indicate a problem.
Why are “God Objects” or wide tables a problem?
The reasons with “God Objects” or wide tables cause an SQL Smell are technical, practical and what you might term business, or even philosophical problems. I’m a Business Analyst, so I’m going to start from the “Conceptual” end, with the Requirements for the database, and then look at the problems which these tables may cause in Development and then when the system is in operation. Also remember, that if we eliminate problems at the conceptual end, then we’re not going to encounter them further on. Wide tables are most certainly a problem with starts at the “Conceptual Model” stage.
”Conceptual Model” or philosophical problems
Each row in a relational table is supposed to represent something. The “something” may be a concrete object in the real world, or it may be something abstract like a contract or a transaction. Would you be able to explain to the users of your system, or your business owners what a single row represents? If not, you are likely to encounter problems.
Thinking about the columns in this wide table, each column is contains a value. How are you going to present or update those values? 1000 fields would make for a very busy screen. Even some sort of graphical representation is likely to be complex. Do your users really need to see all this data together? While there isn’t a rule which says that the whole of an entity has to be presented on a single screen, or as a single report, it has to represent something. Finally, every column in a row provides one value for one thing at one time. Is that really so in your wide table?
Problems during development
“God objects” or wide tables encourage handling one big lump of data. That in turn is going to encourage the creation of complicated code. Maybe life would be easier for everyone if the data and the process descriptions were much more focused.
If you are in an Analyst role, then think about how you are going to explain what should (and should not) be happening with all these columns.
Remember, SQL tables have no concept of “grouping” of the columns. The columns have an order, but it is not something you should be relying on. If you can form columns into groups, then you should probably consider “normalizing” them into other tables.
Problems in operation
“God objects” or wide tables can cause problems when the system is being used. The volume of data each row contains may cause performance problems when rows are read from the table, when rows are updated and when new rows are created.
Why do we get “God objects”?
Wide tables often start from trying to convert large and complex paper forms or spreadsheets straight into table designs. It seems like a good idea at first, but it can get bogged down in unexpected complexity.
Think about your least favourite paper form, especially if it runs to several pages – maybe it’s a tax return or something similar. Obviously the physical form represents something. If you were specifying a system to work with it, then you would be tempted to have a single table where each row represented a single form, there was a column for every question and each cell contained one person’s answer to a question. It would be just like an enormous spreadsheet. Some early commercial computer systems were like that. They worked but they were inflexible.
One clue that something is going wrong (apart from the number of columns) is the number of columns which need to allow “NULL” values. How many times does “Not Applicable” appear when you are filling in the paper form?
How do we solve the problem of the wide table?
The answer is to think about what all these columns mean and then start applying Data Modelling or normalization techniques to break the data into more manageable and useable chunks. If you can from groups of columns then those groups may be candidate entities and therefore candidate tables.
If you need to use the order of similar columns then maybe you should be considering a different table design like the “Entity Attribute Value” (MVP) Pattern. But beware, because that can give rise to a bad smell too!
Excuses for “God Objects” and wide tables
Nothing in Information Technology is ever clear-cut. There are usually grey areas. One person may regard a table as too wide and another may regard it as OK. There is always room for some discussion. There are times when using a table that is a little wider than we would normally like is acceptable. Here are some of the reasons (or maybe that should be excuses) that you may here for wide tables.
It gets all the work done in one place, so that other programs can use the data. I don’t really buy this one. I suspect that someone is guessing what these other programs need. If the guess is wrong then someone is going to have to re-design the big, wide table. I continue to maintain that having discrete data and performing discrete actions is better.
Here is a specific case I found where someone wanted to retrieve data from 2000 sensors. This is a case where using something other than a relational database might be better in the first instance. Depending on the details it might also be a case where using the Entity Attribute Value (EAV) model is appropriate as well.
We are being given the data in the wide form from another system. This excuse I will accept, because it is really being imposed as an external requirement. But! If you need to do this, then you will need to do the work of working out what all those many columns mean, and you may have to break the wide row down into constituent parts.
That’s addressed the “God Object” or “Wide table” smell. I’ve already mentioned the “Entity Attribute Value” (EAV) model a couple of times. I’m going to address why that may be give rise to a bad smell in the next post.
…I know I have. If you read the blurb being written by all and sundry (and now including me) you would be forgiven for imagining that the entire world either lives with its head in the cloud, or is considering doing so in the near future.
As I allow myself a limited budget of both time and money for “education” I’m careful what I spend it on. Last Thursday (24th October 2013) I went to the “Cloud Success Roadshow” in Limerick. It was well worth my investment in time.
The roadshow is run by a company called Let’s Operate (http://www.letsoperate.com/) and although they show you their products, they definitely didn’t go in for the hard sell.
The roadshow reminded me of the factors I need to consider about my “IT strategy” as a whole. In a lot of cases the issue is not finding out what the characteristics of the “Cloud Option” will be, but finding out matching characteristics of the alternative. For example:
“The Cloud will cost x” (per user/month), but how much does my server actually cost me?
For that matter, how much is all that data worth (to me)?
Thinking properly about the Cloud will almost certainly make you think very hard about the speed and quality of your broadband connection and your dependence on it. If, like me, you live out in the country and work from home some of the time then this is important. Never mind the quality, a year ago, I lost my broadband when some “eejit” demolished a telegraph pole just down the road! I was sent scurrying down the road to borrow a connection from an acquaintance in order to send a vital eMail reply. I’ve done something about that.
There are still a couple of stops planned on the roadshow. If you live in Dublin or Belfast I suggest you consider spending an afternoon there if you have the time.
I’ve been busy with a project, I’ve finally got round to writing this a week later than I intended…
In a recent conversation, someone pointed out that people sometimes remove “constraints” from a database in order to improve performance. This made me ask myself:
Is this a good thing, or a bad thing?
I have to admit that this is a technical change that I have considered in the past. Never-the-less, I have mixed feelings about it.
After some thought, my opinion is:
For many situations a constraint is redundant. The fundamental structure of many applications means they are unlikely to create orphan rows.
The cost of the constraint is in the extra processing it causes during update operations. This cost is incurred every time a value in the constrained column is updated.
The benefit of a constraint is that it absolutely protects the constrained column from rogue values. This may be particularly relevant if the system has components (such as load utilities or interfaces with other systems) which by-pass the normal business transactions.
Other benefits of constraints are that they unequivocally state the “intention” of a relationship between tables and they allow diagramming tools which navigate the relationships to “do their thing”. Constraints provide good documentation, which is securely integrated with the database itself.
The costs of constraints are small, but constant and in the immediate term.
The benefits of constraints are avoiding a potentially large cost, but all in the future.
It’s the old “insurance” argument. Make the decision honestly based on a proper assessment of the real risk and your attitude to taking risks. Be lucky!
More Detailed Argument
For those who don’t just want to take my word for it. Here is a more detailed argument.
Let’s take the “business data model” of a pretty normal “selling” application.
When we perform the activities “Take Order” (maybe that should be “Take ORDER”), or “Update Order”
we create or update the ORDER and ORDER_LINE entities, and
in addition we refer to PRODUCT (to get availability and Price) and presumably to the CUSTOMER entity which isn’t shown on the diagram.
When I translate this into a Logical data model, I impose an additional rule “Every ORDER must contain at least 1 ORDER_LINE”. The original business model doesn’t impose this restriction.
Remember some people do allow ORDERs with no ORDER_LINES. They usually do it as part of a “reservation” or “priority process” which we are not going to try and have here.
When the transaction which creates the ORDER and ORDER_LINE makes it’s updates, then it will have read CUSTOMER and ORDER, so it is unlikely to produce orphan records, with or without constraints.
On the other hand, by having the constraints we can document the relationships in the database (so that a diagramming tool can produce the ERD diagram (really I suppose that should be “Table Relationship Diagram”)).
I am left wondering whether it would be possible or desirable to enforce my “Every ORDER must contain at least 1 ORDER_LINE” rule. I’ll think about that further. (Note to self: Can this be represented as a constraint which does not impose unnecessary and unintended restrictions on creating an ORDER?)
If we don’t have constraints and we have something other than our transaction which is allowed to create ORDERs and/or ORDER_LINEs (As I said, typically this would be an interface with another system or some kind of bulk load), we have no way of knowing how reliably it does it’s checking, and we might be allowing things we really do not want into our system. Constraints would reject faulty records and the errors they created (or “threw”) could be trapped by the interface.