NYCPHP Meetup

NYPHP.org

[nycphp-talk] List of langauge/country names and codes

David Krings ramons at gmx.net
Sat Jan 5 23:07:57 EST 2008


John Campbell wrote:
>> After some extensive googling I found that the ISO codes are crap omitting
>> about 80% of the langauges spoken in the world. That is why SIL created a new
>> list, which is obviously way larger and uses three letter codes. Both lists
>> are published and free to use (as far as I can tell), but both fail to list
>> the languages in their native names. So for someone who doesn't speak English
>> these lists are useless. In any case, the complete list is available here:
>> http://www.ethnologue.com/codes/
> 
> What are you trying to do?  If this is for translating software, just
> use ISO 639-1 (language) + ISO 3166-1 (country).  If you attempt to
> use ISO 639-3 it will be a huge mess.  ISO 639-1 may exclude 80% of
> the "known" languages, but it covers 99.99% of the computer using
> population.  It is also the format included in the "Accept-Language"
> HTTP header.  The problem with ISO 639-3 is that it includes things
> like 60 different Arabic dialects, but there is only one "software"
> version of Arabic.

For my current project I use language dependent tables or better to say, some 
tables that are identical in structure exist multiple times once for each 
language currently supported in the system. Right now I only have tables that 
each have a common name extended by "_de" and "_en_us" (subject to change). 
Each user account can select the preferred language, which leaves only the 
login page up to guessing based on what the browser sends (and which in my 
case is sometimes inaccurate as I want German, but my browser sends over that 
it uses English). To me it just doesn't make sense to show in the list for 
example "Farsi", because Farsi doesn't use the latin alphabet. I also don't 
think that listing all languages by the English names makes much sense, 
because that way someone speaking German would need to look under "G" rather 
than "D" for "Deutsch".
I fully understand that I won't get a full and comprehensive list as there 
seems to be none, but I like to base it on the SIL three letter codes and fill 
in as much as possible with native names and writing as possible. I did find a 
few lists such as the one on wikipedia: 
http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes

Now, you bring up a very good point. Since this all will be displayed in a 
browser in the end choosing something that isn't supported by the http header 
doesn't help much. SIL is working on getting their list to be the new ISO 
list, but that isn't the case and who knows when that happens. So I guess I 
save myself a lot of useless work and keep in mind that the SIL list exists, 
but go ahead and use ISO 639-1 for the language codes. That list is really 
kludgy as it really needs ISO 3166-1 to make sense as for example UK English 
is different from US English and different from Indian English. Not so much as 
that one being able to read one couldn't read the other, but there may be 
different words in use. For example, how many words are there for a grinder? 
There is grinder, hero, sub, sandwich, hoagie, po'boy and a bunch of other 
ones that I can't recall. Also, the word "sneaker" strikes me not to be common 
around the capital region. Likewise, I never heard of a package store being 
called a beverage store, but people generally understand what is meant unlike 
when talking to someone from California, where at least in some regions the 
package store is something like a UPS store or the post office. Also, in 
regards to German (de) there are German dialects in the US that have nothing 
to do with each other, such as the Pennsylvanian German and the Texas German, 
which both are not really close to the High German spoken in most parts of 
Germany, so how will then de_us differ from de_us?
I think I have to draw down my lofty goals quite a bit and prepare for 
expansion later as there are apparently a few things that are not likely to 
fall into place
- there is no list that I am looking, most likely due to the problems that I 
ran into
- ISO 639-1 seems to be the one and only widely supported language code list - 
even when this list rapidly inhales
- the chance to find all the native names and writings and display them 
properly in unicode is slim to none (not that unicode can't do it, but I won't 
have the input that I need), which will make my list to be largely incomplete 
pretty much forever
- even if I could compile a list of languages the way I want to, while it may 
suit others who are free to use it later, getting translations for a language 
spoken by a few hundred on a south pacific island is unlikely

I didn't want to base my approach on the fact who has a computer today or not. 
That would be shortsighted and arrogant, but I guess higher powers created a 
status quo that to ignore doesn't seem to be beneficial for anyone. I will 
need to think about this a bit more and come up with a solution that allows 
for adding/editing/changing the list in use.

Damn, I suck at saving the world...hehehe.

Thanks for showing me the wall I was about to run into.

David



More information about the talk mailing list