NYCPHP Meetup

NYPHP.org

[nycphp-talk] Text versioning/manipulation caching architecture suggestions

dann dann at bentobox.net
Tue Jan 29 14:49:32 EST 2008


Hi Max,

I had a similar situation in the past involving a single table of  
Content Items which evolved from simple strings to elaborate objects  
with branching / staging, localization, history, and cached  
processing. The solution I implemented involved three tables:

1. Content Items: a placeholder for the item and its primary data  
(id, name, current version, access level, etc.)
2. Content Item Instances: the item value is kept here (id, item_id,  
version, value, etc.)
3. Content Items Metadata: the caching lives here (id, item_id, key,  
value, creation date)

When a Content Item is updated you'll increment its version number  
and add a new instance with the new value and version number (be wary  
of the potential race condition here). That gives you a full item  
history in the Content Item Instances table.

When an item is accessed you'll check its Metadata to see if a cached  
value for its particular processing key exists. If it does and the  
creation date is later than the update date of those processors (I  
use a single site-wide update date for simplicity) show the cached  
value. Otherwise create the cache and store it in the Metadata table.


Some notes:

You might be able to store the value directly in the Content Items  
table if you don't have to deal with i18n, staging, branching and the  
like. In my case I needed multiple instances per Content Item to  
begin with, and the history was a simple bonus extension to that  
implementation. You'd still need something like Table 2 above, but  
you wouldn't have to check it every time you need the raw value for a  
Content Item.

Likewise my metadata situation is a bit more complicated, as I  
actually store the caches on a per-instance basis. I still have Table  
3, but I also have a fourth table for Content Item Instance Metadata  
which is where the caches live.

You mentioned that you have many tables in your database that need  
this treatment. You may need to experiment a bit to see if it is more  
efficient to triple the number of tables in your database (ouch) or  
to add something like a 'table_name' field to Tables 2 and 3 above  
and use them for all your data. I suspect this later route may lead  
to some table size issues, however, as those tables tend to fill up  
rather rapidly.


I've been meaning to write this up for awhile; if the ideas seem  
useful I can prepare a more rigorous treatment. Let me know what you  
eventually decide to use, I'd be interested to hear how your  
particular requirements shape the solution.

-- dann


> Hello buds,
>
> It's been a while since I've been active on the list but I figured  
> I'd give
> a holler and see if anyone had any suggestions for an application  
> design
> problem I've run into.
>
> I have a large number of text fields across many tables in a fairly  
> large
> database all of which can be manipulated in any number of ways.  
> Some common
> manipulations would be scrubbing strings for display on the web (XHTML
> compliance and XSS avoidance), censoring of "bad" words, rich-text,  
> etc.
>
> All in all, once you mix and match all of the various text  
> manipulations,
> you end up with a large number of versions of the same chunk of  
> text, and
> you need access to all of them based on a plethora of variables  
> such as user
> options, access interface etc. On top of that, some fields can be  
> edited,
> and I'd like to keep copies of the entire revision history, which adds
> another level of complexity.
>
> Originally I thought of some sort of memory caching solution, but  
> the main
> goal of this is to come up with a scalable solution and there is  
> currently a
> few gigabytes of text that this would apply to, so if anything it  
> would
> probably need to expire. It's possible that I could have some  
> mixture of
> short-term memory cache and long-term disk cache, as disk/database  
> space
> isn't a large concern.
>
> Another issue is manipulation function versioning, e.g. when a new  
> word is
> added to the censor function, you want to purge the cache of all of  
> the
> censored text created by the last version.
>
> Maybe I'm just over-complicating the entire thing, but doing this  
> sort of
> manipulation on a high traffic site seems like a gigantic  
> duplication of
> CPU-intensive work that could (and should) be avoided.
>
> I've come up with a lot of solutions, but none of them seem very  
> elegant.
> I'm trying to avoid a lot of excess DB queries and SQL joins.
>
> I've done some searching around and it seems like anyone who has  
> solved this
> problem hasn't discussed it publicly. I thought maybe someone  
> dealing with
> locale on a large scale might have come up with a good solution,  
> but since
> locale is mostly static, it doesn't seem to apply in most cases.
>
> So has anyone dealt with something similar, or is there an obvious  
> solution
> that I'm missing? I'd be interested in hearing some of the more  
> seasoned
> NYPHPer's opinions.
>
> Thanks for any advice in advance!
> -Max
>





More information about the talk mailing list