NYCPHP Meetup

NYPHP.org

[nycphp-talk] SQL Full text searcing and storing

Paul Houle paul at devonianfarm.com
Sun Apr 15 00:24:18 EDT 2007


Ben Sgro (ProjectSkyline) wrote:
>  
> I'm curious what field types, indexes are best for this type of 
> application.
>  
> I'm also curious about benchmarks. Google can return a huge number of 
> records in a fraction of a second.
> Is this a forked process, each doing small amounts of work, or one 
> large beefy server doing the transaction?
>  
> My DB experiance is mostly mySQL and I would prefer to build using this.
>  
    Most popular relational databases have a full text search 
extension.  This includes MySQL,  Oracle,  MS SQL and Postgres -- 
unfortunately,  these implementations do not correspond to any 
standard,  so the details are different for different databases.  MySQL 
has a full-text index that works quite well,  with that caveat that it 
only works for MyISAM at the moment:

    http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

this means you can have full-text indexes or transactions,  not both.  
Assuming you have a MyISAM table,  implementing fulltext search is as 
simple as creating a FULLTEXT index and running queries with the MATCH 
operator.  You could be benchmarking MySQL with your own data in less 
than half an hour (unless it takes more than that to rebuild your 
index!)  The fulltext index will be updated automatically when you 
modify the table,  so it's a real "set-it-and-forget-it" proposition.

    Postgresql implements FT search with the "tsearch2" extension which 
involves a shared library and stored procedures.  It feels a lot like 
somebodies science project -- you'll need to write your own code to 
maintain the index,  and dump/restores of your database may be an 
adventure,  but tsearch2 is flexible and lets you do really neat things.

    Years ago I spent a weekend writing some perl scripts that create an 
inverted index in mysql tables.  It's an inefficient way to do full text 
searching,  but it lets you do things that other search engines don't 
do,  such as similarity searches.  I loaded about 200,000 short 
documents into it on a cheap PC and found I could get interactive 
responses (<10 seconds) doing some pretty fancy things.  I've used this 
to support a few little research projcts.

    There are plenty of other specialized full-text engines,  such as 
Lucene and Xerces,  that do a great job,  but would require you to do 
work to maintain the index.

    So far as Google,  here's what I can tell you:

* Google almost certainly is based on a distributed main-memory 
database.   Google keeps it's index in the RAM of a large number of 
computers...  It's too big to fit in the RAM of one machine,  so queries 
get split between several machines.

* Google's most critical 'secret' isn't pagerank,  but the use of 
implicit phrase searching.  Older methods of text retrieval don't 
consider the ordering of words when scoring -- if you want to get 
results like Google,  you really do need to score words higher when they 
are in proximity to each other,  and you need an algorithm that blends 
this well with other sources of information.

* Google's other 'secret' is that it's a got a huge amount of text to 
worth with.  The real intelligence is in the data it searches,  ~not~ in 
it's algorithms -- Google's algorithms just bring that intelligence 
out.  You have different problems when you do text retrieval at 
different scales:  if you've got 100 documents and expect 2 to be 
relevant for a query,  it's probably not acceptable to have a recall of 
50%...  For small numbers of documents,  you have to work hard to 
eliminate false negatives.  If you've got 10 billion documents,  and 1 
million are relevant to a particular topic,  you don't really care if 
you lose 90% of them -- but you do care that a few really excellent 
documents float to the top.

Academic researchers who tried applying algorithms like PageRank to 
small data sets (1 million documents) couldn't produce evidence that 
PageRank works -- because it doesn't for small data sets.



More information about the talk mailing list