NYCPHP Meetup

NYPHP.org

[nycphp-talk] DB-independent fulltext indexing--help!

Andy Crain copyboy992000 at yahoo.com
Tue Apr 13 13:52:04 EDT 2004


Everyone,

My problem in a nutshell: How do I use, or at least
roughly emulate, fulltext indexing in a db-portable
way? My project requires near perfect portability, so
I’m already using PEAR::DB and standard ansi sql
extensively, but I need to provide the ability to
search large text fields.

It seems to me there are two possible routes, but each
has its drawbacks. I could:

1) ignore the variety of proprietary implementations
and roll my own solution. This I suspect would involve
a keyword table and a mapping table that would track
occurrences of keywords in text records. This has the
benefit of being entirely db-independent, as well as
allowing me to add soundex or synonym lookups or
whatever else I might want to do. Reading/selecting is
nice and fast, but the downside, and it seems to me to
be a big one, is on inserts/updates. I can’t conceive
of a way to do this that does not involve inserting
the large text field (one query), splitting the text
field and, for _each word_ in the text field:
-check the keyword table for that word and get its id
(1 query), and if it’s not already in the keyword
table, insert it and get its id (possibly another
query)
-insert the keyword id and the text record id into a
mapping table (another query)
So, to insert, say, a 500-word string of text, that's
1001 (best case) or 1501 (worst case) queries!
(Duplicate and junk words would make this number a bit
smaller, but you get the point.)
(Oh, and I'm hoping the only answer isn't to run a
cron job to do inserts/keyword table updates
overnight, because I need the "index" to be current in
real time.)

2) Or, since I'm already using PEAR DB, maybe I should
extend it, or maybe extend each of the drivers, to
somehow modify where clauses to have a particular
rdms's fulltext syntax (e.g. "WHERE MATCH(cols)
AGAINST('searchString')" for mysql, "WHERE
CONTAINS(cols, 'searchString')" for mssql, "WHERE
CONTAINS(cols, 'searchString') > 0" for oracle, etc.)
This seems really kludgy, though, and it also means
that, for oracle and mssql at least, there's an
additional wrinkle of having to schedule regular index
refreshes/updates.

Any ideas? I prefer option one, if only I could
somehow reduce the processing work on inserts/updates.

Thanks,
Andy


	
		
__________________________________
Do you Yahoo!?
Yahoo! Small Business $15K Web Design Giveaway 
http://promotions.yahoo.com/design_giveaway/



More information about the talk mailing list