I wrote this about a year and a half ago; I'm cleaning out my drafts folder.
There are reasons why one might think that Turkey should not be admitted to the European Union, but surely the silliest must be that Turkish is not an Indo-European language. Following Phersu, I can just imagine the consequences of taking this seriously. First, the Basque-speaking provinces of France and Spain leave the EU, along with Hungary, Finland, Estonia and Malta. But then, of course, India and Pakistan will submit rival applications to join, closely followed, no doubt, by the Iraqi Kurds. The whole idea is so stupid that I can't believe it was meant seriously, or even guess what Giscard d'Estaing thought "Indo-European" meant.
That said, Turkish does have features which are absent or attenuated in (most) Indo-European languages. (Disclaimer: I do not speak Turkish.) For instance, it's highly agglutinative, forming new words by adding suffixes to roots, and doing so recursively. (German does this too, but to nowhere near the same degree.) This leads to words like yapabilecekdiyseniz, "if you were going to be able to do". (Readers may amuse themselves by analyzing this example using the Turkish Suffix Dictionary.) Moreover, these words are not oddities, like "antidisestablishmentarianistic", but in everyday use. I once heard a talk by a computational linguist specializing in Turkish — Gerjan van Schaaik, who oddly seems to have no web presence — where he mentioned that if one studied the corpus of Turkish daily newspapers, one could easily build a lexicon of 500,000 entries, and still cover only 95% of the words in the corpus. (I can't tell, from my notes, whether van Schaaik was talking about something that had actually been done, or just making a rough estimate.) This property of Turkish becomes very important for a number of technologies, including one without which the modern world would simply grind to a halt: spam filtering.
Özgür et al. do not report on the ability of their classifiers to discriminate between spam, and weirdly pseudo-learned pronouncements from former presidents of France.
Posted by crshalizi at July 16, 2006 04:59 | permanent link