September 2013 - Duhallow Grey Geek

What if you don’t have a Data Model?

My previous post got me thinking. One thing leads to another as they say. The whole of my process really requires that you have a Data Model (aka Entity Relationship Diagram/Model and several other names). But what do you do if you don’t have one, and can’t easily create one?

What’s the problem?

Suppose you have a database which has been defined without foreign key constraints. The data modelling tools use these constraints to identify the relationships they need to draw. The result is that any modelling tool is likely to produce a data model which looks like the figure below. This is not very useful!

Faced with this, some people will despair or run away! This is not necessary. Unless the database has been constructed in a deliberately obscure and perverse way (much rarer than some developers would have you believe) then it is usually possible to make sense of what is there. Remember, you have the database itself, and that is very well documented! Steve McConnellwould point out that the code is the one thing that you always have (and in the case of a database, it always matches what is actually there!).

To do what I propose you will need to use the “system tables” which document the design of the database. You will need to know (or learn) how to write some queries, or find an assistant who understands SQL and Relational Databases. I’ve used MS SQL Server in my examples, but the actual names vary between different database managers. For example: I seem to remember that in IBM DB/2 that’s Sysibm.systables. You will have to use the names appropriate for you.

The method

“Method” makes this sound more scientific than it is, but it still works!

Preparation: Collect any documentation
“Brainstorm:” Try to guess the tables/entities you will find in the database.
List the actual Tables
Group the Tables: based on name
For each “chunk”, do the following:
1. Identify “keys” for tables: Look for Unique indexes.
2. Identify candidate relationships: Based on attribute names, and non-unique indexes.
3. Draw your relationships.
4. “Push out” any tables that don’t fit.
5. Move on to the next group.
When you’ve done all the groups, look for relationships from a table in one group to a table in another.
Now try and bring in tables that were “pushed out”, or are in the “Miscellaneous” bucket.
Repeat until you have accounted for all the tables.

At this point you are probably ready to apply the techniques I described in my previous post (if you haven’t been using them already). You might also consider entering what you have produced into your favourite database modelling tool.

The method stages in more detail.

Preparation:

Collect whatever documentation you have for the system as a whole: Use Cases, Menu structures, anything! The important thing is not detail, but to get an overview of what the system is supposed to do.

“Brainstorm:”

Based on the material above, try to guess the tables/entities you will find in the database. Concentrate on the “Things” and “Transactions” categories described in my previous post.

Don’t spend ages doing this. Just long enough so you have an expectation of what you are looking for.

Remember that people may use different names for the same thing e.g. ORDER may be PURCHASE_ORDER, SALES_ORDER or SALE.

List the tables:

select name, object_id, type, type_desc, create_date, modify_date

from sys.tables

(try sys.Views as well)

Group the tables

Group the tables based on name: ORDER, ORDER_ITEM, SALES_ORDER, PURCHASE_ORDER and ORDER_ITEM would all go together.

Break the whole database into a number of “chunks”. Aim for each chunk to have say 10 members, but do what seems natural, rather than forcing a particular number. Expect to have a number of tables left over at the end. Put them in a “Miscellaneous” bucket.

Identify the candidate keys, and foreign keys from the indexes

select tab.name, idx.index_id, idx.name , idx.is_unique

from

sys.indexes as idx

join sys.tables as tab on tab.object_id = idx.object_id

where

tab.name like ‘%site%’;

select tab.name, col.column_id, col.name

from

sys.columns as col

Join sys.tables as tab on col.object_id = tab.object_id

where

tab.name like ‘%PhoneBook%’

order by 1, 2 ;

From attribute names and indexes, identify relationships

Sometimes the index names are a give-way (FK_…)
Sometimes you have to look for similar column names (ORDER_LINE.ORDER_ID à ORDER.ID)
Multi-part indexes sometimes indicate hierarchies.

Push Out

Look for “Inter.group” Relationships

How to “dig into a large database”?

This posting was prompted by a question on one of the BA forums on Linked In. The original question was:

“So as a business analyst, whats your procedure to begin investigating a very large, unfamiliar database and then draw your understanding?”

I gave an answer in the forum, but here is an expansion, and some ponderings.

First of all, let’s set out the terms of reference. The question asks about a “large unfamiliar” database. I think we can assume that “unfamiliar” is “one that we haven’t encountered before”, but what is “LARGE”? To me “large” could be:

Lots of tables
Many terror-bytes 😉
Lots of transactions
Lots of users
There may be other interpretations

I’m going to go with “Lots of tables” with the definition or “lots of” being:

“more than I can conveniently hold in my head at one time”

I’ve also assumed that we are working with a “transactional database” rather than a “data warehouse”.

Preparation

Gilian, the questioner was given some good suggestions, which I summarised as “Collecting Information” or perhaps “Preparation”:

Understand objectives of “The Business”
Understand the objectives of “This Project” (Digging into the Database)
Collect relevant Organisation charts and find out who is responsible for doing what
Collect relevant Process Models for the business processes which use the database
Get hold of, generate, or otherwise create a data model (Entity Relationship Diagram or similar)

Of these, the one which is specific to working with a Database is the ERD. Having a diagram is an enormous help in visualising how the bits of the database interact.

Chunking

For me, the next step is to divide the model into “chunks” containing groups of entities (or tables). This allows you to:

Focus – on one chunk
Prioritise – one chunk is more important, interesting or will be done before, another
Estimate – chunks are different sizes
Delegate – you do that chunk, I’ll do this one
And generally “Manage” the work; do whatever are the project objectives.

I would use several techniques to divide the database or model up into chunks. These techniques work equally well with logical and physical data models. It can be quite a lot of work if you have a large model. None of the techniques are particularly complicated, but they are a little tricky to explain in words.

Here is a list of techniques:

Layering
Group around Focal Entities
Process Impact Groups
Realigning

Organise the Data Model

I cannot over-emphasis how important it is to have a well-laid out diagram. Some tools do it well, some do it less well. My preference is to have “independent things” at the top.

I’ve invented a business.

We take ORDERs from CUSTOMERs.
Each ORDER consists of one or more ORDER_LINES and each line is for a PRODUCT.
We Deliver what the customer wants as DELIVERY CONSIGNMENTS.
Each CONSIGNMENT contains one or more Batches of product (I’ve haven’t got a snappy name for that).
We know where to take the consignment by magic, because we don’t have an Address for the Customer!
We reconcile quantities delivered against quantities ordered, because we sometimes have to split an order across several deliveries.
That’s it!

Layering

“Layering” involves classifying the entities or groups of entities as being about:

Classifications
Things
Transactions
Reconciliations

Things

Let’s start with “Things”. Things are can be concrete or they can be abstract. We usually record a “Thing” because it is useful in doing our business. Examples of Things are:

People
Organisations
Products
Places
Organisation Units (within our organisation, or somebody elses)

Classifications

Every business has endless ways of classifying “Things” or organising them into hierarchies. I just think of them as fancy attributes of the “Things” unless I’m studying them in their own right.

Note: “Transactions” can have classifications too (in fact almost anything can and does), I’ve just omitted them from the diagram!

Note: The same structure of “Classification” can apply to more than one thing. This makes sense if, for example, the classification is a hierarchy of “geographic area”. Put it in an arbitrary place, note that it belongs in other places as well, and move on!

Transactions

Transactions are what the business is really interested in. They are often the focus of Business Processes.

Order
Delivery
Booking

Where there are parts of Transactions (eg Order_Line) keep the child with the parent.

Reconciliations

Reconciliations” (between Transactions) occur when something is “checked against something else”. In this case we are recording that “6 widgits have been ordered” and that “3 (or 6) have been delivered”.

If you use these “layers”, arranged as in the diagram, you will very likely find that the “One-to-manys” point from the top (one) down (many) the page.

Groups around Focal Entities

To do this, pick an entity which is a “Thing” or a “Transaction” then bring together the entities which describe it, or give more detail about it. Draw a line round it, give it a name, even if only in your head!

“Customer and associated classifications” and
“Order and Order_line” are candidate groups.

Process Impact Groups

To create a “Process Impact Group”

Select a business process
Draw lines around the entities which it: creates, updates and refers to as part of doing its work.
You should get a sort of contour map on the data model.

In my example the processes are:

Place Order
Assemble Delivery Consignment
Confirm Delivery (has taken place)

It is normal for there to be similarities between “Process Impact Groups” and “Focal Entity Groups”. In fact, it would be unusual if there were not similarities!

Realigning

Try moving parts under headers (so, Order_line under Order) and reconciliations under the transaction which causes them. In the diagram, I’ve moved “Delivered Order Line” under “Delivery”, because it’s created by “Delivery related processes” rather than when the Order is created.

Finally, “Chunking”

Based on the insights you have gained from the above, draw a boundary around your “chunks”.

The various techniques are mutually supportive, not mutually exclusive. The chunks are of arbitrary size. If it is useful, you can:

combine neighbouring chunks together or
you can use the techniques (especially “Focal entities” and “Process Entity Groups”) to break them down until you reach a single table/entity.

Tools

My preferred tools for doing this are: a quiet conference room, late at night; the largest whiteboard I can find; lots of sticky Post-its or file cards (several colours); a pack of whiteboard pens; black coffee on tap and the prospect of beer when I’ve finished. For a large database (several hundred tables) it can take several days!

Once you’ve done all this, then all you have to do, is do the work!

I hope you have as much fun doing it as I had writing about it! J

Security – Using Server Side security with MS SQL Server

There are times when I want to do something, and it gives rise to questions. Often I have to put the questions to one side in order to get on with the thing which is my immediate priority. When that happens, I put the questions to one side with the intention of coming back to them in the future. This is one of the occasions when I have had the opportunity to go back and answer some of the questions.

The questions in this case were:

How does Server-side security work on MS SQL Server? And
How can I control the access different users have to my database?

I’m not (have never been, and do not really intend to become) a DBA (Database Administrator). I understand a bit about “privileges” but in the past I’ve always had someone else “doing it for me”, and in any case , I’ve worked on other databases such as DB/2 or Oracle.

After a little bit of research and a bit of experimentation (fiddling around), I found the answer to my questions. Although it was hardly earth-shattering, the understanding is satisfying.

The key is to understand that in MS SQL Server the “Server Instance” and the “Database” are separate entities. It is possible to have several databases inside the same Server instance. In fact I do it all the time when I’m experimenting. This means that there are two separate things to define:

A LOGIN, which gives access to the Server Instance, and with Server-side authentication, provides the security, and
A USER, which belongs to the Database and is granted the Database privileges (including CONNECT).

Figure 1 Server-side security for MS SQL Server

Summary of the stages:

Ensure that SQL Server is set up to allow Server Authentication
Create the LOGIN (in the SQL Server Instance)
Create a USER corresponding with the LOGIN (in the Database)
Grant the USER CONNECT privilege (happens by default, but can be revoked)
Grant the USER the appropriate privileges on the Database

If you’re interested in taking this a bit further, I’ve summed all this up in a video on YouTube.

What is the point of Unit Testing?

I don’t normally add two entries to my blog on the same day but something attracted my interest. Somebody asked the question: “What is the point of Unit Testing?”

Actually, they asked two questions:

Why do we perform Unit Testing, even though we are going to do System testing?
What are the benefits of Unit Testing?

To which my immediate thought responses were:

Would you deliberately make something from parts you knew were broken? Or
Would you make something from parts which you suspected were broken?

And then I thought “that might sound a little rude” and reconsidered…

Modern development methods have lots of benefits, but sometimes in flexible and rapid methods something gets lost. People forget why things are done. Or, if they’ve never been told, they wonder if they are worth bothering with.

Now, you should always question everything, but sometimes things are there for a good reason. If you plan to take something away;

Understand why it was there in the first place,
If it is no longer needed, explain why it is no longer needed.
Understand (and be prepared to live with) the consequences of taking it away.

An old-fashioned view of a System Development Process

(by the way, you’ll notice that some of this material has been re-cycled from elsewhere)

If we take a rather old fashioned view of systems development using a “waterfall” model, then we will have a number of phases (an old IBM Development Process, but does n’t really matter).

Each phase produces something, and the stage below it expands it (produces more “things”) and adds detail.
The “Requirements” specify what things the system needs to do. They also identify the things that need to be visible on the surface of the system.
For each of the things that need to be visible on the surface, we need an “External Design”
The “External Design” specifies the appearance and functional behaviour of the system.
For everything we have in the “External Design” we need a design for the “Internals”
And finally someone needs the “Build” what we have specified.

You can view this as a waterfall, down the side of a valley. The process is one of decomposition.

I don’t especially recommend “Waterfall” as a way of running a development project, but it is a simple model which is useful as an illustration.

The Testing Process

On the other side of the valley we build things up.

Units are tested.
When they work they are aggregated into “Modules” or “Assemblies” or “Subsystems”, which are tested.
These assemblies are assembled into the System which is tested as a whole.
Finally the System is tested by representatives of the Users.

The process is one of developing “bits”, testing the bits and then assembling the bits and then testing the assembly.

The assembly process (in the sense of “putting things together”, not compiling a file written in “assembler”) costs time and effort. Parts are tested as soon as practical after they are created and are not used until they conform to their specification. The benefit is that we always working with things that we think work properly.

In a well-organised world, you would like to think that the Users are testing against the original requirements!

Development and Testing should be mutually supportive

What should happen is that at every level, each component or assembly should have some sort of specification (it may be a very rudimentary specification, but it should still exist) and it should be tested against that.

In fact, there is a thoroughly respectable development approach called “Test Driven Development”. The idea here is that the (Business) Analyst writes a “Test” which can be used to demonstrate that the system, at whatever level, is doing what it is supposed to be doing. Of course, the Analyst may need help to write an automated test, but the content should come from the Analyst.

This approach is really useful all the way through the development process. It’s a really good idea if a developer writes tests for the code s/he is writing before the code! In fact, I have known places where they insisted that a test was written for a bug before the developer attempted to fix the bug. That way demonstrating the fix was easy: Run the test without the fix – Test demonstrates the bug. Apply the fix and run the test again.

The Cost of Not Doing Unit (or other low-level) Testing

All bugs are found at the topmost level, which means that they are found after the product has been assembled or “built” and then we have to work out where the error has actually originated.

The Benefits of Unit Testing

Bugs are found sooner, and they are found closer to the point at which they are created.
Unit testing lends itself to automated testing which can be integrated with the build process. Ask a professional Java developer about “JUnit” or a Python developer about “UnitTest” (one word).
Automated testing increases the chances of trapping “regression bugs” as code is enhanced and bugs are fixed.

All of the above mean that well-planned and executed Unit Testing results in:

Reduced overall cost
Improved product quality

Oracle and Courses

My personal development time last week was spent completing an online course “Oracle DBA for absolute beginners”.

I wouldn’t have described myself as an “absolute beginner”, but I found plenty to enjoy in the course and came away having learned quite a bit about what is going on inside Oracle, and I assume most other database managers.

Circumstances influence what we do in life and so far I have had much more exposure to DB/2 and MS SQL Server than to Oracle. That hasn’t been a decision on my part, simply the choices that had been made for the projects I was involved in.

In a similar way, I’ve spent much more time “dealing with users” as a Business Analyst, than I have working out how to manage the space requirements and performance of a database. It does me good to learn just a little about the things a DBA has to consider. I don’t have to let those considerations govern what I consider the requirements to be, but at least I can understand where other people are coming from.

Taking the course led me to what you might consider “meta” thinking: thinking about not the content of the course, but the way it was presented and the platform Udemy on which it was presented.

I find Udemy interesting. It seems to work well. It certainly worked for me.

Udemy seem to be aiming to be a “neutral marketplace”. The courses belong to the course instructors. Of course Udemy have standards for courses, but beyond the usual “fit to print” conditions, they are mostly technical standards (quality of video and sound) rather than subject matter related. In a similar spirit, Udemy promote the platform, but the promotion I have seen seems to be fairly neutral with regard to individual courses. On the other hand, instructors or course owners are completely free to advertise their wares elsewhere and direct potential customers into Udemy. It’s a simple model which I think I will investigate further.