Requirements Smells make SQL Smells!

 

Your SQL smells because your Requirements smell!
Your SQL smells because your Requirements smell!

Recently I read an article by Phil Factor on the subject of “SQL Smells”. Phil (apparently not his real name), identifies a number of “smells” which he thinks indicate that a database design or SQL code needs to be reviewed. He classifies some of these as “Problems with Database Design”. I would go further and say some of them are problems with database requirements! In other words, your SQL smells because your Requirements smell!

“Requirements Smells cause SQL smells!”

I no longer claim to be a “Developer” and I have never claimed to be a DBA (Database Administrator), though I have found myself in the position of being an “accidental DBA”. The thought that Requirements could smell bad concerned me.

This realisation made me think about problems with Requirements in general and problems with databases in particular. It is better to avoid a problem rather than cure it, so I’m writing a series of blog posts on how to recognise problems in Requirements and prevent them from becoming “SQL Smells”.

Database design and SQL smells

Any computer system contains a “model” of the world it works with. This model forms the foundations of the system. If the system does not contain a concept, then it cannot work with it!

A simplified database design process
A simplified database design process

When people start to create a system they have to decide what concepts their system needs. This is the “Conceptual Model”. This model is transformed through a “Logical Model” until it finally becomes the “Physical Model”, which is the design for the database. The Conceptual and Logical models are not just first-cut versions of the Physical Model, different design decisions and compromises are made at each stage.
This is nothing to do with “Waterfall”, “Agile” or anything to do with any specific development process. In fact, this approach is pretty universal, whether formally or not. Some people combine the different stages, but there are risks to doing that.

A simple way of looking at the Conceptual Model is to say that it is concerned with finding out:

  • What the business and system need: at the conceptual stage these are known as “Entities”
  • What we need to know about those things: these are the “Attributes” of the Entities
  • We also need to document “Business Rules”: some of these will be represented as “Relationships”.

During the design and development process:

  • Entities will tend to become table definitions
  • Attributes will become the columns within those tables
  • Business Rules may become so-called “constraints”.
Different Requirements become affect different aspects of the database
Different Requirements become affect different aspects of the database

A poor Conceptual Model or bad design decisions can lead to systems which are difficult to build, maintain and use, and which do not perform well either. Once again,

“Requirements Smells will cause SQL Smells”

The idea of “smells” can help us address potential problems earlier and more cheaply.

Where are these “Requirements smells”?

I’m going to group my bad smells in a slightly different way to Phil Factor. I primarily work as a Business Analyst, so I am going to concentrate on “smells” to look for at the Conceptual and Logical Stages of specifying the Requirements for a database, starting with the smell that Phil describes as “The God Object”!

Will losing constraints set you free?

I’ve been busy with a project, I’ve finally got round to writing this a week later than I intended…
In a recent conversation, someone pointed out that people sometimes remove “constraints” from a database in order to improve performance. This made me ask myself:
Is this a good thing, or a bad thing?
I have to admit that this is a technical change that I have considered in the past. Never-the-less, I have mixed feelings about it.
After some thought, my opinion is:
  • For many situations a constraint is redundant. The fundamental structure of many applications means they are unlikely to create orphan rows.
  • The cost of the constraint is in the extra processing it causes during update operations. This cost is incurred every time a value in the constrained column is updated.
  • The benefit of a constraint is that it absolutely protects the constrained column from rogue values. This may be particularly relevant if the system has components (such as load utilities or interfaces with other systems) which by-pass the normal business transactions.
  • Other benefits of constraints are that they unequivocally state the “intention” of a relationship between tables and they allow diagramming tools which navigate the relationships to “do their thing”. Constraints provide good documentation, which is securely integrated with the database itself.
In short:
  • The costs of constraints are small, but constant and in the immediate term.
  • The benefits of constraints are avoiding a potentially large cost, but all in the future.
It’s the old “insurance” argument. Make the decision honestly based on a proper assessment of the real risk and your attitude to taking risks. Be lucky!

More Detailed Argument

For those who don’t just want to take my word for it. Here is a more detailed argument.
Let’s take the “business data model” of a pretty normal “selling” application.
When we perform the activities “Take Order” (maybe that should be “Take ORDER”), or “Update Order”
  • we create or update the ORDER and ORDER_LINE entities, and
  • in addition we refer to PRODUCT (to get availability and Price) and presumably to the CUSTOMER entity which isn’t shown on the diagram.
When I translate this into a Logical data model, I impose an additional rule “Every ORDER must contain at least 1 ORDER_LINE”. The original business model doesn’t impose this restriction.
Remember some people do allow ORDERs with no ORDER_LINES. They usually do it as part of a “reservation” or “priority process” which we are not going to try and have here.
When the transaction which creates the ORDER and ORDER_LINE makes it’s updates, then it will have read CUSTOMER and ORDER, so it is unlikely to produce orphan records, with or without constraints.
On the other hand, by having the constraints we can document the relationships in the database (so that a diagramming tool can produce the ERD diagram (really I suppose that should be “Table Relationship Diagram”)).
I am left wondering whether it would be possible or desirable to enforce my  “Every ORDER must contain at least 1 ORDER_LINE” rule. I’ll think about that further. (Note to self: Can this be represented as a constraint which does not impose unnecessary and unintended restrictions on creating an ORDER?)
If we don’t have constraints and we have something other than our transaction which is allowed to create ORDERs and/or ORDER_LINEs (As I said, typically this would be an interface with another system or some kind of bulk load), we have no way of knowing how reliably it does it’s checking, and we might be allowing things we really do not want into our system. Constraints would reject faulty records and the errors they created (or “threw”) could be trapped by the interface.

 

How to “dig into a large database”?

This posting was prompted by a question on one of the BA forums on Linked In. The original question was:

 

I gave an answer in the forum, but here is an expansion, and some ponderings.

First of all, let’s set out the terms of reference. The question asks about a “large unfamiliar” database. I think we can assume that “unfamiliar” is “one that we haven’t encountered before”, but what is “LARGE”? To me “large” could be:  
  •  Lots of tables
  • Many terror-bytes 😉
  • Lots of transactions
  • Lots of users
  • There may be other interpretations

I’m going to go with “Lots of tables” with the definition or “lots of” being:

“more than I can conveniently hold in my head at one time”

I’ve also assumed that we are working with a “transactional database” rather than a “data warehouse”.

Preparation

Gilian, the questioner was given some good suggestions, which I summarised as “Collecting Information” or perhaps “Preparation”:
  • Understand objectives of “The Business”
  • Understand the objectives of “This Project” (Digging into the Database)
  • Collect relevant Organisation charts and find out who is responsible for doing what
  • Collect relevant Process Models for the business processes which use the database
  • Get hold of, generate, or otherwise create a data model (Entity Relationship Diagram or similar)

Of these, the one which is specific to working with a Database is the ERD. Having a diagram is an enormous help in visualising how the bits of the database interact.

Chunking

For me, the next step is to divide the model into “chunks” containing groups of entities (or tables). This allows you to:  
  • Focus – on one chunk
  • Prioritise – one chunk is more important, interesting or will be done before, another
  • Estimate – chunks are different sizes
  • Delegate – you do that chunk, I’ll do this one
  • And generally “Manage” the work;  do whatever are the project objectives.

I would use several techniques to divide the database or model up into chunks. These techniques work equally well with logical and physical data models. It can be quite a lot of work if you have a large model. None of the techniques are particularly complicated, but they are a little tricky to explain in words.

Here is a list of techniques: 
  • Layering
  • Group around Focal Entities
  • Process Impact Groups
  • Realigning

Organise the Data Model

I cannot over-emphasis how important it is to have a well-laid out diagram. Some tools do it well, some do it less well. My preference is to have “independent things” at the top.

 

I’ve invented a business.
  • We take ORDERs from CUSTOMERs. 
  • Each ORDER consists of one or more ORDER_LINES and each line is for a PRODUCT.
  • We Deliver what the customer wants as DELIVERY CONSIGNMENTS. 
  • Each CONSIGNMENT contains one or more Batches of product (I’ve haven’t got a snappy name for that).
  • We know where to take the consignment by magic, because we don’t have an Address for the Customer!
  • We reconcile quantities delivered against quantities ordered, because we sometimes have to split an order across several deliveries.
  • That’s it!

Layering

“Layering” involves classifying the entities or groups of entities as being about:
  • Classifications
  • Things
  • Transactions
  • Reconciliations

Things

Let’s start with “Things”. Things are can be concrete or they can be abstract. We usually record a “Thing” because it is useful in doing our business. Examples of Things are:
  • People
  • Organisations
  • Products
  • Places
  • Organisation Units (within our organisation, or somebody elses)

Classifications

Every business has endless ways of classifying “Things” or organising them into hierarchies. I just think of them as fancy attributes of the “Things” unless I’m studying them in their own right.
Note: “Transactions” can have classifications too (in fact almost anything can and does), I’ve just omitted them from the diagram!
Note: The same structure of “Classification” can apply to more than one thing. This makes sense if, for example, the classification is a hierarchy of “geographic area”. Put it in an arbitrary place, note that it belongs in other places as well, and move on!

Transactions

Transactions  are what the business is really interested in. They are often the focus of Business Processes.
  • Order
  • Delivery
  • Booking

 

Where there are parts of Transactions (eg Order_Line) keep the child with the parent.

Reconciliations

Reconciliations” (between Transactions) occur when something is “checked against something else”. In this case we are recording that “6 widgits have been ordered” and that “3 (or 6) have been delivered”.
 If you use these “layers”, arranged as in the diagram,  you will very likely find that the “One-to-manys” point from the top (one) down (many) the page.

Groups around Focal Entities

To do this, pick an entity which is a “Thing” or a “Transaction” then bring together the entities which describe it, or give more detail about it. Draw a line round it, give it a name, even if only in your head!
  • “Customer and associated classifications” and
  • “Order and Order_line” are candidate groups.

 

Process Impact Groups

To create a “Process Impact Group”
  • Select a business process
  • Draw lines around the entities which it: creates, updates and refers to as part of doing its work.
  • You should get a sort of contour map on the data model. 

 

In my example the processes are: 
  • Place Order
  • Assemble Delivery Consignment
  • Confirm Delivery (has taken place)

 

It is normal for there to be similarities between “Process Impact Groups” and “Focal Entity Groups”.  In fact, it would be unusual if there were not similarities!

Realigning

Try moving parts under headers (so, Order_line under Order) and reconciliations under the transaction which causes them. In the diagram, I’ve moved “Delivered Order Line” under “Delivery”, because it’s created by “Delivery related processes” rather than when the Order is created.

Finally, “Chunking”

Based on the insights you have gained from the above, draw a boundary around your “chunks”.
The various techniques are mutually supportive, not mutually exclusive. The chunks are of arbitrary size. If it is useful, you can: 
  • combine neighbouring chunks together or
  • you can use the techniques (especially “Focal entities” and “Process Entity Groups”) to break them down until you reach a single table/entity.

 

Tools

My preferred tools for doing this are: a quiet conference room, late at night; the largest whiteboard I can find; lots of sticky Post-its or file cards (several colours); a pack of whiteboard pens; black coffee on tap and the prospect of beer when I’ve finished. For a large database (several hundred tables) it can take several days!
Once you’ve done all this, then all you have to do, is do the work!  

 

 I hope you have as much fun doing it as I had writing about it! J

Another video – Splitting an Access database

When I’m making something I sometimes learn new things. They say “you should never stop learning”. I agree with “them”, whoever they may be.

While I was working on the SOPAG project, I investigated “splitting” the Access database into:

  • Logic and presentation, and
  • Data (database definitions and data values)

Components. I knew this could be done, but I had not spent much effort on it before.

The splitting itself was a straightforward enough exercise. Most of the work is done by Access itself. However, you might want to confirm that all the decisions it has made are sensible!

I decided to document the results for my own benefit, and then decided to convert a scrappy Powerpoint presentation into something a little more presentable to upload to YouTube.

Here it is: Splitting an Access database

I had fun doing the work to find out how it worked, and fun making the video. I hope you get something from watching it.