Why is it difficult to build a Machine Translation system for Arabic Variations (Darja)?

06 Feb 2016

“Why Doesn’t Google Translate Support Arabic Variations?” [Draft]

Because. They. Are. Nothing. But. Dialects.

I’m a simplistic person and that’s my simplistic explanation, but if I were to fake being mature, I would go:

Disclaimer: I know enough to know I know nothing.

I shall be using the Algerian Arabic (AKA Darja) as an example. Other Arabic variations and vernaculars have a similar cases…

TALKING

Since they are just dialects, they have no agreed-upon set of rules, which makes them tend to change at such a rapid pace that you can observe that each generation has its different ways of talking: different vocabulary and syntax. In addition to the ever-changing metaphors and expressions… Needless to say, Darja is way irregular than Standard Arabic, and a great portion of it is hardly translatable.

Dialects are poor. I mean, they lack many words and meanings and cannot express them; that is why people code-switch to Standard Arabic (or French for that matter) to express some complex thoughts.

In her Master’s thesis, miss Ramdani reflects on Algerians’ habit of code-switching and heavy borrowing from French… What I’m getting at is this: in order to support Darja, you need to support both Arabic and French.

Each city, each town, each family, and each person seem to have their own tongue. Which one would Google Translate use?

Then should we support the way people talk in large cities like Algiers, Constantine, and Annaba? Supposing that all people of a said city talk the same way regardless of their socio-economical statuses and intellectual levels, would others be obligated to adopt their dialect? Will someone from the Sahara have to use Algiers’ talk so as a machine can handle it? That’s malice, tyranny! I rather die than be pretentious or adopt someone else’s accent! (sounds extreme~~~ but you get what I’m trying to say…)

Examine the following words. Noting that each one is linked to a distinct locality, which ones should be chosen to construct a sentence?

“Yes”: ih, ah, wah, hah, haa, hih, hey… “Mine”: Diali, ta3i, ti3i… “Where”: Wain, wiin, fain, fiin…

You cannot make a cocktail of them: Mixing random regional dialects would be weird and disturbing. You have to be consistent…

For instance, “Algerian Wikipedia” is disgusting! I find it challenging to follow because of the random regional dialects used in its articles. They are like incomprehensible, “instable”, insane, incoherent. I read a few pages before giving it up.

Arabic dialects are more or less like pidgins (mixture of langs) such as Spanglish (Spanish-English) and Singlish (Singaporean Chinese-English):

Will you next ask Google to add pidgins? If no, why support dialects?

WRITING

How would computer “read” non-standard Arabic? Really, that’s an issue.

Arabic uses an Abjad writing system where vowels are normally omitted since they can be figured out from the context. In Standard Arabic, that’s relatively easy, you just need to know the basic grammar/syntax; but in non-standard Arabic, it’s like hell!

Sure, vowels can be represented using diacritics, but really, who would use them? I mean, I have never seen anyone use them while writing dialects (as in social media and forms).

Sure, we may adopt the Latin alphabet, but then again, there’s no agreed-upon Romanization system: Those who are into English, would use the English orthography; those who are into French, would use the French orthography; others may like to use numerals. (“Near”: qreeb, qrib, grib, 9riib, 9rib…)

An Algerian IT University published a study about the “Algerian Language”. It states that over 60 forms of “Insha’Allah” (“God Willing”) have been found while analyzing comments on an Algerian newspaper’s website. Over 60 ways to write the same word! Normalizing such writing is like impossible…

Proper nouns are not capitalized which makes recognizing them tricky (whether the Arabic or the Latin script is used.) Okay, here are two sentences with no capitalization:

The words “george” and “orwell” don’t exist in English. Figure that out, and it’ll be plain as death that “george orwell” is a proper noun. The second sentence may be mistakenly translated to “sword studies in university.” You see, like the majority of Arabic names, “Saif” is an Arabic (and Darja) word that means “sword”, and is used as a boys’ given name.

In short: sociolects/idiolects + spelled as pronounced = inconsistent orthography.

DATA

By “data” I mean corpus (structured records), like multilingual texts… sorta…? You know, Machine Translate uses different methods (algorithms) to translate from one lang to another:

As far as I know, unlike Standard/Classical Arabic, there are no corpora (plural of corpus) for Arabic variations. Standard/Classical Arabic has a huge literature (at least from the seventh century onwards) and many dictionaries (they are needed for Word Sense Disambiguation and stuff) while there’s no complete dictionary for dialects, AFAIK…

Crowd-sourcing is used to “train” and improve Google Translate. Arabs’ contribution is embarrassingly low. Imagine what including dialects would cause! That low percentage would be much low-low-lower. Here is why: To participate, you need to be good at more than one language. Those who fulfill this requirement, are likely to be educated. and by “promoting” their dialects, that percentage needs to be divided by the number of Arab countries, I guess. That’s an awfully outcome, if you ask me.

Dismiss the previous paragraph: Educated people would have no interest in working with dialects: “Why bother if information is already accessible?”, they might think… (As a side note, Arabs participation in Wikipedia is low, due to this logic among other things.)

YES PAIN, NO GAIN

As you can see, there is a lot of works: guessing vowels, proper nouns, translation…

Far from some people’s claims, Standard Arabic is generally understood (pay attention, I said UNDERSTOOD) by the public since it is used practically everywhere: in print (books, manuals, newspapers, etc.), in the media (traditional and online), and even in real-life: Mostly in school, religious speeches/preaches. Eh, and, occasionally, people make it a game to communicate in it as it’s considered funny and Shakespearean…

In addition, I’ve talked with some Arabs from different countries (Egypt, Lebanon, Morocco, Iran…) and we understood each other most of the time (like, I speak Algerian and they speak Egyptian) and when we don’t understand each other, we use Standard Arabic as an intermediate lang.

Dialects have no official status. Arabic is international.

I mean, why go through the trouble of adding new ones if the present works fine?!

Conclusion

Eh, am I supposed to end my shit with some sorta conclusions? I mean, I started with the “NO TO DIALECTS” conclusion even before thinking of any reasons (or arguments) for why they should be “deserted”.

Probably there’s a counter-argument to everything I’ve said, but the Ultimate Proof that I’m wrong, is to make Google Translate process Arabic properly. Just then, we may consider adding other variations…

“Destructive Criticism” is tolerated, but being constructive would be nicer…