Problems with “God Objects” or very wide tables

Wide Tables, Wide SQL Tables
Wide SQL Tables

One of the “SQL smells” Phil Factor identifies in his article is the presence of “God Objects” in your Database or design. I agree with him, except that I would call them “very wide tables”. If you find them, then you may have a problem with the Conceptual Model you are using, or possibly t you should be considering using a different tool. In other words, you have a problem with your requirements. You have a “Requirements Smell”.

How many columns make a “God Object” or wide table?

How many columns can you have in an SQL table?
How many columns can you have in an SQL table?

Let’s start with the obvious question: How many columns make a “God Object” or wide table? The maximum number of columns you are allowed to have in a table varies with database manager. For example:

What the actual numbers are can depend on a lot of technical things. One hundred is still a big number.

Database management software will handle wide tables up to their limits. As with most things, when you approach the limit you will start to encounter difficulties, but that is missing the point. Even 100 columns may indicate a problem.

Why are “God Objects” or wide tables a problem?

Wide tables or "God objects" are a problem which originates in the Conceptual Model

The reasons with “God Objects” or wide tables cause an SQL Smell are technical, practical and what you might term business, or even philosophical problems. I’m a Business Analyst, so I’m going to start from the “Conceptual” end, with the Requirements for the database, and then look at the problems which these tables may cause in Development and then when the system is in operation. Also remember, that if we eliminate problems at the conceptual end, then we’re not going to encounter them further on. Wide tables are most certainly a problem with starts at the “Conceptual Model” stage.

”Conceptual Model” or philosophical problems

Each row in a relational table is supposed to represent something. The “something” may be a concrete object in the real world, or it may be something abstract like a contract or a transaction. Would you be able to explain to the users of your system, or your business owners what a single row represents? If not, you are likely to encounter problems.

Thinking about the columns in this wide table, each column is contains a value. How are you going to present or update those values? 1000 fields would make for a very busy screen. Even some sort of graphical representation is likely to be complex. Do your users really need to see all this data together? While there isn’t a rule which says that the whole of an entity has to be presented on a single screen, or as a single report, it has to represent something. Finally, every column in a row provides one value for one thing at one time. Is that really so in your wide table?

Problems during development

“God objects” or wide tables encourage handling one big lump of data. That in turn is going to encourage the creation of complicated code. Maybe life would be easier for everyone if the data and the process descriptions were much more focused.

If you are in an Analyst role, then think about how you are going to explain what should (and should not) be happening with all these columns.

Remember, SQL tables have no concept of “grouping” of the columns. The columns have an order, but it is not something you should be relying on. If you can form columns into groups, then you should probably consider “normalizing” them into other tables.

Problems in operation

“God objects” or wide tables can cause problems when the system is being used. The volume of data each row contains may cause performance problems when rows are read from the table, when rows are updated and when new rows are created.

Why do we get “God objects”?

Wide tables often start from trying to convert large and complex paper forms or spreadsheets straight into table designs. It seems like a good idea at first, but it can get bogged down in unexpected complexity.

Think about your least favourite paper form, especially if it runs to several pages – maybe it’s a tax return or something similar. Obviously the physical form represents something. If you were specifying a system to work with it, then you would be tempted to have a single table where each row represented a single form, there was a column for every question and each cell contained one person’s answer to a question. It would be just like an enormous spreadsheet. Some early commercial computer systems were like that. They worked but they were inflexible.

One clue that something is going wrong (apart from the number of columns) is the number of columns which need to allow “NULL” values. How many times does “Not Applicable” appear when you are filling in the paper form?

How do we solve the problem of the wide table?

The answer is to think about what all these columns mean and then start applying Data Modelling or normalization techniques to break the data into more manageable and useable chunks. If you can from groups of columns then those groups may be candidate entities and therefore candidate tables.

If you need to use the order of similar columns then maybe you should be considering a different table design like the “Entity Attribute Value” (MVP) Pattern. But beware, because that can give rise to a bad smell too!

Excuses for “God Objects” and wide tables

Nothing in Information Technology is ever clear-cut. There are usually grey areas. One person may regard a table as too wide and another may regard it as OK. There is always room for some discussion. There are times when using a table that is a little wider than we would normally like is acceptable. Here are some of the reasons (or maybe that should be excuses) that you may here for wide tables.

  • It gets all the work done in one place, so that other programs can use the data. I don’t really buy this one. I suspect that someone is guessing what these other programs need. If the guess is wrong then someone is going to have to re-design the big, wide table. I continue to maintain that having discrete data and performing discrete actions is better.
  • Here is a specific case I found where someone wanted to retrieve data from 2000 sensors. This is a case where using something other than a relational database might be better in the first instance. Depending on the details it might also be a case where using the Entity Attribute Value (EAV) model is appropriate as well.
  • We are being given the data in the wide form from another system. This excuse I will accept, because it is really being imposed as an external requirement. But! If you need to do this, then you will need to do the work of working out what all those many columns mean, and you may have to break the wide row down into constituent parts.

Where next?

That’s addressed the “God Object” or “Wide table” smell. I’ve already mentioned the “Entity Attribute Value” (EAV) model a couple of times. I’m going to address why that may be give rise to a bad smell in the next post.

Requirements Smells make SQL Smells!

 

Your SQL smells because your Requirements smell!
Your SQL smells because your Requirements smell!

Recently I read an article by Phil Factor on the subject of “SQL Smells”. Phil (apparently not his real name), identifies a number of “smells” which he thinks indicate that a database design or SQL code needs to be reviewed. He classifies some of these as “Problems with Database Design”. I would go further and say some of them are problems with database requirements! In other words, your SQL smells because your Requirements smell!

“Requirements Smells cause SQL smells!”

I no longer claim to be a “Developer” and I have never claimed to be a DBA (Database Administrator), though I have found myself in the position of being an “accidental DBA”. The thought that Requirements could smell bad concerned me.

This realisation made me think about problems with Requirements in general and problems with databases in particular. It is better to avoid a problem rather than cure it, so I’m writing a series of blog posts on how to recognise problems in Requirements and prevent them from becoming “SQL Smells”.

Database design and SQL smells

Any computer system contains a “model” of the world it works with. This model forms the foundations of the system. If the system does not contain a concept, then it cannot work with it!

A simplified database design process
A simplified database design process

When people start to create a system they have to decide what concepts their system needs. This is the “Conceptual Model”. This model is transformed through a “Logical Model” until it finally becomes the “Physical Model”, which is the design for the database. The Conceptual and Logical models are not just first-cut versions of the Physical Model, different design decisions and compromises are made at each stage.
This is nothing to do with “Waterfall”, “Agile” or anything to do with any specific development process. In fact, this approach is pretty universal, whether formally or not. Some people combine the different stages, but there are risks to doing that.

A simple way of looking at the Conceptual Model is to say that it is concerned with finding out:

  • What the business and system need: at the conceptual stage these are known as “Entities”
  • What we need to know about those things: these are the “Attributes” of the Entities
  • We also need to document “Business Rules”: some of these will be represented as “Relationships”.

During the design and development process:

  • Entities will tend to become table definitions
  • Attributes will become the columns within those tables
  • Business Rules may become so-called “constraints”.
Different Requirements become affect different aspects of the database
Different Requirements become affect different aspects of the database

A poor Conceptual Model or bad design decisions can lead to systems which are difficult to build, maintain and use, and which do not perform well either. Once again,

“Requirements Smells will cause SQL Smells”

The idea of “smells” can help us address potential problems earlier and more cheaply.

Where are these “Requirements smells”?

I’m going to group my bad smells in a slightly different way to Phil Factor. I primarily work as a Business Analyst, so I am going to concentrate on “smells” to look for at the Conceptual and Logical Stages of specifying the Requirements for a database, starting with the smell that Phil describes as “The God Object”!

Will losing constraints set you free?

I’ve been busy with a project, I’ve finally got round to writing this a week later than I intended…
In a recent conversation, someone pointed out that people sometimes remove “constraints” from a database in order to improve performance. This made me ask myself:
Is this a good thing, or a bad thing?
I have to admit that this is a technical change that I have considered in the past. Never-the-less, I have mixed feelings about it.
After some thought, my opinion is:
  • For many situations a constraint is redundant. The fundamental structure of many applications means they are unlikely to create orphan rows.
  • The cost of the constraint is in the extra processing it causes during update operations. This cost is incurred every time a value in the constrained column is updated.
  • The benefit of a constraint is that it absolutely protects the constrained column from rogue values. This may be particularly relevant if the system has components (such as load utilities or interfaces with other systems) which by-pass the normal business transactions.
  • Other benefits of constraints are that they unequivocally state the “intention” of a relationship between tables and they allow diagramming tools which navigate the relationships to “do their thing”. Constraints provide good documentation, which is securely integrated with the database itself.
In short:
  • The costs of constraints are small, but constant and in the immediate term.
  • The benefits of constraints are avoiding a potentially large cost, but all in the future.
It’s the old “insurance” argument. Make the decision honestly based on a proper assessment of the real risk and your attitude to taking risks. Be lucky!

More Detailed Argument

For those who don’t just want to take my word for it. Here is a more detailed argument.
Let’s take the “business data model” of a pretty normal “selling” application.
When we perform the activities “Take Order” (maybe that should be “Take ORDER”), or “Update Order”
  • we create or update the ORDER and ORDER_LINE entities, and
  • in addition we refer to PRODUCT (to get availability and Price) and presumably to the CUSTOMER entity which isn’t shown on the diagram.
When I translate this into a Logical data model, I impose an additional rule “Every ORDER must contain at least 1 ORDER_LINE”. The original business model doesn’t impose this restriction.
Remember some people do allow ORDERs with no ORDER_LINES. They usually do it as part of a “reservation” or “priority process” which we are not going to try and have here.
When the transaction which creates the ORDER and ORDER_LINE makes it’s updates, then it will have read CUSTOMER and ORDER, so it is unlikely to produce orphan records, with or without constraints.
On the other hand, by having the constraints we can document the relationships in the database (so that a diagramming tool can produce the ERD diagram (really I suppose that should be “Table Relationship Diagram”)).
I am left wondering whether it would be possible or desirable to enforce my  “Every ORDER must contain at least 1 ORDER_LINE” rule. I’ll think about that further. (Note to self: Can this be represented as a constraint which does not impose unnecessary and unintended restrictions on creating an ORDER?)
If we don’t have constraints and we have something other than our transaction which is allowed to create ORDERs and/or ORDER_LINEs (As I said, typically this would be an interface with another system or some kind of bulk load), we have no way of knowing how reliably it does it’s checking, and we might be allowing things we really do not want into our system. Constraints would reject faulty records and the errors they created (or “threw”) could be trapped by the interface.

 

What if you don’t have a Data Model?

My previous post got me thinking. One thing leads to another as they say. The whole of my process really requires that you have a Data Model (aka Entity Relationship Diagram/Model and several other names). But what do you do if you don’t have one, and can’t easily create one?

What’s the problem?

Suppose you have a database which has been defined without foreign key constraints. The data modelling tools use these constraints to identify the relationships they need to draw. The result is that any modelling tool is likely to produce a data model which looks like the figure below. This is not very useful!
Faced with this, some people will despair or run away! This is not necessary. Unless the database has been constructed in a deliberately obscure and perverse way (much rarer than some developers would have you believe) then it is usually possible to make sense of what is there. Remember, you have the database itself, and that is very well documented! Steve McConnellwould point out that the code is the one thing that you always have (and in the case of a database, it always matches what is actually there!).
To do what I propose you will need to use the “system tables” which document the design of the database. You will need to know (or learn) how to write some queries, or find an assistant who understands SQL and Relational Databases. I’ve used MS SQL Server in my examples, but the actual names vary between different database managers. For example: I seem to remember that in IBM DB/2 that’s Sysibm.systables. You will have to use the names appropriate for you.

The method

“Method” makes this sound more scientific than it is, but it still works!
  1. Preparation: Collect any documentation
  2. “Brainstorm:” Try to guess the tables/entities you will find in the database.
  3. List the actual Tables
  4. Group the Tables: based on name
  5. For each “chunk”, do the following:
    1. Identify “keys” for tables: Look for Unique indexes.
    2. Identify candidate relationships: Based on attribute names, and non-unique indexes.
    3. Draw your relationships.
    4. “Push out” any tables that don’t fit.
    5. Move on to the next group.
  6. When you’ve done all the groups, look for relationships from a table in one group to a table in another.
  7. Now try and bring in tables that were “pushed out”, or are in the “Miscellaneous” bucket.
  8. Repeat until you have accounted for all the tables.

 

At this point you are probably ready to apply the techniques I described in my previous post (if you haven’t been using them already). You might also consider entering what you have produced into your favourite database modelling tool.

The method stages in more detail.

Preparation:

Collect whatever documentation you have for the system as a whole: Use Cases, Menu structures, anything! The important thing is not detail, but to get an overview of what the system is supposed to do.

“Brainstorm:”

Based on the material above, try to guess the tables/entities you will find in the database. Concentrate on the “Things” and “Transactions” categories described in my previous post.
Don’t spend ages doing this. Just long enough so you have an expectation of what you are looking for.
Remember that people may use different names for the same thing e.g. ORDER may be PURCHASE_ORDER, SALES_ORDER or SALE.

List the tables:

select name, object_id, type, type_desc, create_date, modify_date
from sys.tables
(try sys.Views as well)

Group the tables

Group the tables based on name: ORDER, ORDER_ITEM, SALES_ORDER, PURCHASE_ORDER and ORDER_ITEM would all go together.
Break the whole database into a number of “chunks”. Aim for each chunk to have say 10 members, but do what seems natural, rather than forcing a particular number. Expect to have a number of tables left over at the end. Put them in a “Miscellaneous” bucket.

Identify the candidate keys, and foreign keys from the indexes

select  tab.name, idx.index_id, idx.name , idx.is_unique
from
sys.indexes as idx
join sys.tables as tab on tab.object_id = idx.object_id
where
tab.name like ‘%site%’;
 
select tab.name, col.column_id, col.name 
from
sys.columns as col
Join sys.tables as tab on col.object_id = tab.object_id
where
tab.name like ‘%PhoneBook%’
order by 1, 2 ;
 

From attribute names and indexes, identify relationships

 

  • Sometimes the index names are a give-way (FK_…)
  • Sometimes you have to look for similar column names (ORDER_LINE.ORDER_ID à ORDER.ID)
  • Multi-part indexes sometimes indicate hierarchies.

 

 Push Out

Look for “Inter.group” Relationships

 

How to “dig into a large database”?

This posting was prompted by a question on one of the BA forums on Linked In. The original question was:

 

I gave an answer in the forum, but here is an expansion, and some ponderings.

First of all, let’s set out the terms of reference. The question asks about a “large unfamiliar” database. I think we can assume that “unfamiliar” is “one that we haven’t encountered before”, but what is “LARGE”? To me “large” could be:  
  •  Lots of tables
  • Many terror-bytes 😉
  • Lots of transactions
  • Lots of users
  • There may be other interpretations

I’m going to go with “Lots of tables” with the definition or “lots of” being:

“more than I can conveniently hold in my head at one time”

I’ve also assumed that we are working with a “transactional database” rather than a “data warehouse”.

Preparation

Gilian, the questioner was given some good suggestions, which I summarised as “Collecting Information” or perhaps “Preparation”:
  • Understand objectives of “The Business”
  • Understand the objectives of “This Project” (Digging into the Database)
  • Collect relevant Organisation charts and find out who is responsible for doing what
  • Collect relevant Process Models for the business processes which use the database
  • Get hold of, generate, or otherwise create a data model (Entity Relationship Diagram or similar)

Of these, the one which is specific to working with a Database is the ERD. Having a diagram is an enormous help in visualising how the bits of the database interact.

Chunking

For me, the next step is to divide the model into “chunks” containing groups of entities (or tables). This allows you to:  
  • Focus – on one chunk
  • Prioritise – one chunk is more important, interesting or will be done before, another
  • Estimate – chunks are different sizes
  • Delegate – you do that chunk, I’ll do this one
  • And generally “Manage” the work;  do whatever are the project objectives.

I would use several techniques to divide the database or model up into chunks. These techniques work equally well with logical and physical data models. It can be quite a lot of work if you have a large model. None of the techniques are particularly complicated, but they are a little tricky to explain in words.

Here is a list of techniques: 
  • Layering
  • Group around Focal Entities
  • Process Impact Groups
  • Realigning

Organise the Data Model

I cannot over-emphasis how important it is to have a well-laid out diagram. Some tools do it well, some do it less well. My preference is to have “independent things” at the top.

 

I’ve invented a business.
  • We take ORDERs from CUSTOMERs. 
  • Each ORDER consists of one or more ORDER_LINES and each line is for a PRODUCT.
  • We Deliver what the customer wants as DELIVERY CONSIGNMENTS. 
  • Each CONSIGNMENT contains one or more Batches of product (I’ve haven’t got a snappy name for that).
  • We know where to take the consignment by magic, because we don’t have an Address for the Customer!
  • We reconcile quantities delivered against quantities ordered, because we sometimes have to split an order across several deliveries.
  • That’s it!

Layering

“Layering” involves classifying the entities or groups of entities as being about:
  • Classifications
  • Things
  • Transactions
  • Reconciliations

Things

Let’s start with “Things”. Things are can be concrete or they can be abstract. We usually record a “Thing” because it is useful in doing our business. Examples of Things are:
  • People
  • Organisations
  • Products
  • Places
  • Organisation Units (within our organisation, or somebody elses)

Classifications

Every business has endless ways of classifying “Things” or organising them into hierarchies. I just think of them as fancy attributes of the “Things” unless I’m studying them in their own right.
Note: “Transactions” can have classifications too (in fact almost anything can and does), I’ve just omitted them from the diagram!
Note: The same structure of “Classification” can apply to more than one thing. This makes sense if, for example, the classification is a hierarchy of “geographic area”. Put it in an arbitrary place, note that it belongs in other places as well, and move on!

Transactions

Transactions  are what the business is really interested in. They are often the focus of Business Processes.
  • Order
  • Delivery
  • Booking

 

Where there are parts of Transactions (eg Order_Line) keep the child with the parent.

Reconciliations

Reconciliations” (between Transactions) occur when something is “checked against something else”. In this case we are recording that “6 widgits have been ordered” and that “3 (or 6) have been delivered”.
 If you use these “layers”, arranged as in the diagram,  you will very likely find that the “One-to-manys” point from the top (one) down (many) the page.

Groups around Focal Entities

To do this, pick an entity which is a “Thing” or a “Transaction” then bring together the entities which describe it, or give more detail about it. Draw a line round it, give it a name, even if only in your head!
  • “Customer and associated classifications” and
  • “Order and Order_line” are candidate groups.

 

Process Impact Groups

To create a “Process Impact Group”
  • Select a business process
  • Draw lines around the entities which it: creates, updates and refers to as part of doing its work.
  • You should get a sort of contour map on the data model. 

 

In my example the processes are: 
  • Place Order
  • Assemble Delivery Consignment
  • Confirm Delivery (has taken place)

 

It is normal for there to be similarities between “Process Impact Groups” and “Focal Entity Groups”.  In fact, it would be unusual if there were not similarities!

Realigning

Try moving parts under headers (so, Order_line under Order) and reconciliations under the transaction which causes them. In the diagram, I’ve moved “Delivered Order Line” under “Delivery”, because it’s created by “Delivery related processes” rather than when the Order is created.

Finally, “Chunking”

Based on the insights you have gained from the above, draw a boundary around your “chunks”.
The various techniques are mutually supportive, not mutually exclusive. The chunks are of arbitrary size. If it is useful, you can: 
  • combine neighbouring chunks together or
  • you can use the techniques (especially “Focal entities” and “Process Entity Groups”) to break them down until you reach a single table/entity.

 

Tools

My preferred tools for doing this are: a quiet conference room, late at night; the largest whiteboard I can find; lots of sticky Post-its or file cards (several colours); a pack of whiteboard pens; black coffee on tap and the prospect of beer when I’ve finished. For a large database (several hundred tables) it can take several days!
Once you’ve done all this, then all you have to do, is do the work!  

 

 I hope you have as much fun doing it as I had writing about it! J