NYCPHP Meetup

NYPHP.org

[nycphp-talk] data modelling vs. db design (was: ER Diagramtoolfor MySQL/OS X)

Hans Zaunere lists at zaunere.com
Tue Oct 11 22:13:59 EDT 2005



Phil Duffy wrote on Tuesday, October 11, 2005 9:05 AM:
> > -----Original Message-----
> > From: talk-bounces at lists.nyphp.org
> > [mailto:talk-bounces at lists.nyphp.org] On Behalf Of Hans Zaunere Sent:
> > Monday, October 10, 2005 7:09 PM 
> > To: 'NYPHP Talk'
> > Subject: Re: [nycphp-talk] data modelling vs. db design (was: ER Diagram
> > toolfor MySQL/OS X) Normalization is often one of the most
> > misunderstood aspects of relational databases.  Of course, it's what
> > makes a relational database relational, and we're all taught that
> > you're not worth a cent unless you only work in the 5th normalized
> > form.  
> > 
> > Well, perhaps if you're a mathematician, then that's correct.  The
> > reality is that over-normalization can kill a project just as much as
> > under-normalization.  Don't forget - at least from the old SQL books I
> > learned from - why normalization was being preached.  "Don't have
> > duplicate data!" was often the mantra.  If you had more than one guy
> > named "John", then you need to have a table of first names, with a
> > lookup table relating first names with the other data you're storing. 
> > Seems rediculous, right? 
> > 
> > The practice of heavy normalization was based on some old, bygone,
> > concerns, namely "storage is expensive."  Add in academic
> > chest-beating, and pretty soon you have a table for every letter of the
> > alphabet and look-up tables to match.
> > 
> > As I think we generally find, with the abundance of hardware resources,
> > normalization is reduced to only an element of good design.  Sure,
> > performance can still be a factor, but let's consider the performance
> > of an entire system, and not just what's happening in the database.
> 
> <snip>
> 
> Hans,
> 
> While I agree with your general message, I have a reservation about the
> "storage is expensive" specific example.  Having been around since before
> Boole and Babbage, I can't even recall in the "old days" when developers
> were slavish about "storage is expensive".  We sought opportunities to
> reduce storage, when that was practical, but only in very specialized
> applications did we go out of our way to conserve disk storage.  The
> primary issue was data consistency and software maintainability.  Lack of
> normalization created situations which encouraged users to create
> inconsistent data.  For example, consider a pre-relational system in
> healthcare that recognized two 'pseudo-entities', physicians and patients
> (pseudo in the sense we now recognize these as roles as opposed to
> fundamental entities).  Each pseudo-entity had its own basic record to
> contain address, telephone number and other attributes.  That worked as
> long as the physician never became a patient.  However, a change to a
> physician/patient's record created an inconsistency in the changed
> attribute, e.g., the address.
> 
> The second issue was software maintenance.  There were situations where
> one record's structure might be modified while the other was not.

Good points, and I think you raise an important issue, namely, normalization
as a mechanism for separation and isolation.

> I believe it was a natural progression from normalization of databases to
> object-oriented design, because the latter encouraged all interactions
> with the database to be performed in a single module.  I know of at least
> one situation in healthcare where that limitation in the existing systems
> nearly killed a patient.

Ah hah!  Separation and isolation.  There is always the consideration of
what perspective we're looking at, and at the end, who is saying
normalization.  I think the storage reasoning has come from - at least in my
experience when browsing academic SQL references - a largely idealistic way
of storing data.  Perhaps storage was or wasn't really expensive, but it was
neat to not store redundant data, as the first few normal-forms try to
address:

http://www.datamodel.org/NormalizationRules.html

That's from the data-structures-are-cool-and-I'm-not-an-implementer
department, which begins down the road of:

http://en.wikipedia.org/wiki/Database_normalization#Sixth_normal_form
http://www.bkent.net/Doc/simple5.htm

and eventually to:

http://www.nerdbooks.com/item.php?id=1558608559

Not a bad thing, but not something I'd be ready to recommend to a customer
:)

>From a developer's perspective, however, the more important issue, as you
point out, is a good, clean, representation of data.  And, as hardware
resources become cheaper, that might mean that it's OK to have
redundant/repeating data, and still be correctly normalized and
architectured.

> But your principle point is that database design (and object design by
> extension in my opinion) is still an art form.  Both database and object
> design should look upon the 'rules' as significant guidelines.  Understand
> when to bypass these rules, and document those reasons (conditions may
> change in the future and others will probably have to clean up the
> situation).

Exactly - it's a balance between what works in a sensible way, and how
you're supposed to do it.  As some of the links above mention, they're
discussing "relational database theory."  A lot of the work these guys do is
great, and the basis for much of the technology we have today.  But as
architects and implementers on the front-lines, it's our job to find that
happy place between the theory and practical, cost effective software.

> I liked your example of normalizing the string 'John', which underlines
> the absurdity of over-normalization.  For me, appropriate normalization
> begins with the recognition of the fundamental entities in a system and a
> distinction between real world-modeled entities (e.g., a Person) as
> opposed to system-internal entities (e.g., a web page).  If I come to a
> branch in the road at which normalization or de-normalization both offer
> advantages, I am most likely to take the normalization branch, make a
> mental note and if I later discover that creates a significant
> performance issue that can't be otherwise corrected, I would de-normalize
> that area of the system intentionally (documenting my reasons and the
> implementation). 
> 
> It would be interesting to see if others have approached normalization
> guidelines differently.  I won't claim the above is the only way to do
> database design - it is still an art form after all.

I approach it in much the same way, however often take de-normalization if
all other things are equal.  Recognition of fundamental entities is really
the driver behind normalization in most environments I come across, and
without it, the total system, in terms of code/DB writing, style, and
maintainability, system performance, and future system changes by external
developers, is usually happier with a bit more data consolidation.


Good talk Phil,

---
Hans Zaunere / President / New York PHP
   www.nyphp.org  /  www.nyphp.com





More information about the talk mailing list