NYCPHP Meetup

Thu Sep 28 22:14:00 EDT 2006

    I greatly enjoyed the talk about Cake.

    I've been thinking a lot about "Active Records" as a part of web 
frameworks.  There are quite a few ways to implement them and they all 
have some strengths and weaknesses.  I'm convinced that we need active 
records to simplify the writing of secure and maintainable code,  but 
there's still an impedance mismatch between SQL and objects.

    Cake's model has a lot of vertical integration with the rest of the 
framework,  which helps in the rapid development department.  As I 
understand it,  developers build out a number of classes that extend 
AppModel.  They can add methods to define actions on the model,  
override methods to change behaviors,  and define a few variables to set 
metadata.

    One of the interesting questions is early loading versus lazy 
loading.  The Active Record from Ruby on Rails,  for instance,  lazy 
loads everything and caches nothing.  When you ask for 'User.name',  
Ruby immediately does something like

SELECT name from user where id=66;

    This is elegant.  It may even be efficient when we can avoid 
SELECTing big values (say the text of a blog post.)  Unfortunately,  it 
takes six SELECT statements to get the value of six columns for a single 
row,  and 60 SELECT statements to get those values for ten columns (say 
to see the last ten blog posts.)

    I've been playing around with something I call a "passive record";  
like RoR,  passive records get their behaviors determined by run-time 
introspection of the database.  You don't subclass passive_record to 
represent classes:  this is one less thing for you to maintain,  but you 
do lose a good extension point.  The code looks something like

// may be something like $user_table=$factory->table in the future
$user_table=new locust_passive_table($conn,"user",");
$user=$user_table->fetch_record($user_id);

echo $user->first_name;

   $conn,  by the way,  is an instance of an OO database access library 
I've written.  It's got some nice features:  you can say

$conn->select_scalar("SELECT COUNT(*) FROM ..");
$conn->select_scalars("SELECT measurement FROM measurements");

    Column values are held in a protected array,  access to variables 
like first_name goes through "magic"  __get and __set methods.  (This 
enables a ~sweet~ syntax.)  If you want to make changes to the record,  
you can do something like

$user->first_name="Bozo";
$user->last_name="Clown";
$user->update();

    Now there are interesting questions:  the easy way to implement 
update() is to do something like

UPDATE user SET first_name='Bozo',... WHERE user_id=$user_id

    iterating over ALL the columns in the row to generate the SET 
clause.  This sets you up for "lost update" problems,  however.

Let's imagine that Bozo is updating his user record,  and that

user.suspended="n" in the database before the above code gets called.

     in the process that's doing the above.  Imagine that a sysadmin 
uses a PHP script to suspend Bozo's access...

$user->suspended="y";
$user->update();

      after the first process does fetch_record() and before the first 
process does update().  As a result of this race condition,  "Bozo" 
unsuspends himself,  a bad outcome.

    One answer to this problem is to keep track of which variables have 
been "touched",  and flush out only the ones that have been "touched".  
This isn't very hard to do.  It doesn't eliminate every possible "lost 
update" situation,  but it eliminates many of them.

   Since RoR does SELECTs and UPDATEs as you get and set variables,  it 
avoids much of this trouble,  but opens up another can of worms.   If 
you're not using transactions (or if you're running the READ COMMITTED 
isolation levels),  other processes can see intermediate states.  
Another process might see a new first name and an old last name:  once 
again,  a source of strange,  hard to understand and fix bugs.  If you 
are using transactions,  you're forcing the database to hold the 
"transaction window" open for longer,  which slows the database down and 
increases the risk of failed transactions and deadlocks.

    It looks like Cake uses early loading quite consistently...  Looking at

http://manual.cakephp.org/chapter/models

if I do

$user = $this->User->read(null, '25')

    for a user who is linked to a comment table,  I'd get back an array 
of comments,  so I can do something like

echo $user["comments"]["5"]["body"];

    If I wanted to display a list of (selected) users,  I might want to 
display the number of comments that they've made,  and I could do that 
by doing

echo count($user["comments"]);

    This is quite intuitive.  It's even reasonably efficient if a user 
has,  say,  3 comments,  but it would be quite slow for a user who has 
800 comments.  Efficiency demands that the database do the counting work 
in this case...  I don't really care if it's done by:

(1) A COUNT(*) generated for every user row,
(2) A JOIN/GROUP BY/COUNT(*) over the user and comment rows,  or
(3) A sub-select over comment inside the select on user,

but it's essential that a framework lets me do this.  It would be nice 
if the framework did 1,  2 or 3 automatically,  but I could live with it 
if I had to do (1) manually,  either in SQL or via the framework.

It's straightforward to implement lazy loading if you're using __get and 
__set...  In that case,  you don't even need to look at related tables 
until a user asks a question about them,  say,...

$user->count("comments");

or asks for $user->comments[3]->body.  In the latter case,  there are 
two places where we can lazy load:  (i) when the user asks for the 
comments[] array,  and (ii) when the user asks for the "body" property 
of $user->comments[3].  As always,  we've got a choices...  We can

(A) populate the comments array with fully-formed passive_records when 
we create it (one SELECT),  or
(B) create a number of "empty" passive_records that contain a comment id 
and lazy-load the rest of the variables when needed.

    Which one is best depends on your case...  If you want them all,  
(A) makes a lot of sense.  If you want to look at the first 10 
comments,  (B) works better,  but this is still going to be slower than 
the old-fashioned way with

SELECT * FROM comments WHERE user_id LIMIT 10...

    Although I suppose you could add some special method to do

$comments=$user->fetch_related("comments",10);

    My feeling about this is that I can't accept an "Active Record" 
implementation that (sometimes) uses the database in an outrageously 
inefficient way.  I might trade a factor of 2 for convenience,  but you 
can certainly get into situations where the RoR way and the PHPCake way 
could cost you a factor of 10 or more.  I have to admit that I really 
don't know the answer.

    Something I really admire from the Java Spring Framework is a JDBC 
wrapper which throws named exceptions for database errors:  see

http://www.springframework.org/docs/api/org/springframework/dao/DataAccessException.html

    My current $conn object lets me do

try {
   $conn->query("INSERT ..."
} catch(DuplicateKeySQLException e)

     and that's really nice.  It would be nice to have a fleshed-out 
hiearchy of exceptions for database errors to make error handling a snap.

NYCPHP Meetup

NYPHP.org

[nycphp-talk] Cake PHP and "Active Records"