NYCPHP Meetup

NYPHP.org

[nycphp-talk] Squashing accented characters

John Campbell jcampbell1 at gmail.com
Sun Oct 24 13:24:31 EDT 2010


I use a regex, and apply it to the source and the indexed text.

it is pretty simple like:
preg_replace('/[àâäåãá]/iu','a',$x); preg_replace('/[éèêë]/iu','e',$x)

It is a bit of a hack, but works quite well in practice.  If you do some
googling, you can find many regex variations that will do what you want.
 Some get pretty involved such as handling ligatures (*ß -> ss)*


On Sat, Oct 23, 2010 at 2:50 AM, Paul A Houle <paul at devonianfarm.com> wrote:

>  For my site at
>
> http://ookaboo.com/
>
> I'm running into the problem that people are searching for "Dusseldorf" but
> the name of the place is "Düsseldorf",  so they don't find it.
>
> It seems to me a good answer to this is to have some function that squashes
> accented characters down to unaccented forms.  I'd index the unaccented
> forms and also squash down queries so they'd always match up.  I definitely
> need to do both ISO-Latin-1 and the Latin-Extended-A,   because fate has
> given me a lot of place names that have the Polish dark L in them (ł<http://fileformat.info/info/unicode/char/0142/>).
> It also seems like there are a lot of characters in Latin Extended-B that
> would also map plausably to unaccented characters.
>
> I can see how to write something like this,  I'd need to parse out the
> Unicode code points from UTF-8 and run them through a lookup table,  but
> it's a lot of details and I wonder if anybody has written a PHP function to
> do this already.
>
> _______________________________________________
> New York PHP Users Group Community Talk Mailing List
> http://lists.nyphp.org/mailman/listinfo/talk
>
> http://www.nyphp.org/Show-Participation
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nyphp.org/pipermail/talk/attachments/20101025/c015d15d/attachment.html>


More information about the talk mailing list