Learning Arabic can be hard because one word can have many meanings. For example, in Hans Wehr’s Arabic-German dictionary, the verb قامَ – يَقُومُ has over 40 different German meanings in just the basic form. However, the dictionary doesn’t tell you which meanings are most common or show you how to use the words in conversations or texts. A new tool developed by Mirko Vogel aims to address this issue and could raise the standard of how we work with modern Arabic dictionaries.
In this guest article, Mirko Vogel describes his idea and shows how Arabic learners and translators can use his tool to their advantage.
In a nutshell: What is Muraija (مُرَيْجَع) and how can it help people using Arabic?
➤ The Arabic language is vast and complex, similar to an ocean. Muraija offers a different approach to explore the language, as if you’re surfing.
➤ Grammar books and dictionaries are guides for using language correctly. Muraija, with over 50 million words, shows you how people really use Arabic.
➤ Understanding the words and the rules of a language doesn’t mean you really know the language. Muraija explains how Arabic words work together to make sense.
Motivation
Do you remember the first time you encountered the verb انتهك? Did the context provide enough information for you to deduce its meaning, as in the following sentence?
شارت إلى أن الولايات المتحدة تنتهك القانون الدولي باستمرار وحملتها المسؤولية الكاملة لفشل المفاوضات.
Or did you look it up in a “classical” dictionary like Hans Wehr? After figuring out the root, finding the entry ن-ه-ك and jumping to the VIIIth stem, you’ll then have found a list of 22 English verbs that can be used to translate انتهك:
to waste, emaciate, enervate, exhaust (ه s.o.); to violate, abuse, defile, profane, desacrate (هـ s.th.); to infringe, violate (ه a law), trespass (هـ on rights, freedom of s.o.), offend against (ه s.o.); to rape, ravish (ها a woman); to insult, defame, malign, slander, abuse, brutalize (ه a man)
However, if translation was your primary objective, you might have opted for using a translation memory like reverso context, yielding “violate”, “break” and “abuse”, or simply pasted the whole sentence into your favorite MT (machine translation) tool, providing you with a decent translation directly:
She noted that the United States is constantly violating international law and held it fully responsible for the failure of the negotiations.
The common drawback of these tools is that you spend most of your time focusing on English words, which is kind of suboptimal if your objective is to learn Arabic. This imbalance becomes even more pronounced if English is not your native language, leading to the need to translate the English translation once again.
So how about the obvious solution, monolingual dictionaries (Arabic <-> Arabic), at least for advanced students?
اِنْتَهَكَ – [ن هـ ك]. (ف: خما. متعد). اِنْتَهَكَ، يَنْتَهِكُ ، مص. اِنْتِهاكٌ. 1. “اِنْتَهَكَتْهُ الحُمَّى” : أَضْنَتْهُ وَجَهَدَتْهُ، أَضْعَفَتْهُ. 2. “اِنْتَهَكَ حُرْمَةَ مَنْزِلِهِ” : تَعَدَّى عَلَيْها، خَرَقَها بِما لا يَسْمَحُ بِهِ القانُونُ. 3. “اِنْتَهَكَتْ إِسْرائِيلُ الْمَسْجِدَ الأقْصَى” : دَنَّسَتْهُ. 4. “تَعَمَّدَ أَنْ يَنْتَهِكَ عِرْضَهُ أَمامَ الْمَلإِ” : أَنْ يُشينَهُ، أَنْ يَمَسَّ شَرَفَهُ. 5. “اِنْتَهَكَ عِرْضَهُ” : شَتَمَهُ.
Source: The entry “انتهك” in the Alghani Azzahir Arabic Dictionary from 2013
I have to admit that I rarely consult Arabic dictionaries, as even modern ones employ a style which I find difficult to process – and as a professional conference interpreter, I consider myself a fairly advanced learner. 🙂
However, by looking at the dictionary entry, we find many examples that show how انتهك can be combined with other words. This is relevant because words do not combine freely, there are standardized ways of expressing ideas. (For the nerds: That is what the famous corpus linguist Sinclair calls “idiom principle”.)
You might be understood if you say كسّر قانونًا, instead of انتهك قانونًا, but it will require more effort from the listener; whatever you are going to say next is likely not to be taken seriously because linguistic competence is often equated with subject matter expertise.
Collocations. Knowing the company a word keeps
These standardized – and thus frequent – word combinations are called collocations, and are not only relevant for language production, but for language understanding as well. Parts of the enormous progress conversational AI has made recently is due to the fact, that it is possible to deduce the meaning of a word from its context alone, without any connection to some external world. Or, like J.R. Firth (an English linguist and a leading figure in British linguistics) already claimed in the 1950s: “You shall know a word by the company it keeps.”
So, what company does the word انتهك keep? What are the objects this verb commonly takes?
The above word cloud shows the most frequent objects of انتهك, accounting for 66% of the occurrences of the verb. That is, in two out of three cases, one out of the following is
…violated…
- a rule (yellow)
- a decision (green)
- a right (red)
- some country’s sovereignty (blue)
- something sacred (purple)
Given recent advancements in Arabic language processing, it is possible to automatically extract collocations from a large corpus and create a collocation dictionary for the Arabic language. A first version of this dictionary, called Muraija (مُرَيْجَع), is now available online.
You can either take a look immediately, or read on to understand the magic behind the scenes.
How does it work? The nerdy part
Note: This technical description is meant to be understandable by people who are neither computer linguists nor deep learning experts, and is thus in turn necessarily fuzzy or even incorrect in the details.
Since Muraija aims to represent the Arabic language as it is used in practice, we must carefully select the corpus to serve as our starting point. The Internet contains many Arabic texts that were not originally written in Arabic but were translated, either by humans or machines, from other languages, mostly English.
These texts do not always reflect how native speakers use the language (some call it “translationese”), which is especially relevant when it comes to technical terms. For example, when preparing for an interpretation assignment on Power-to-X technologies, I read many Wikipedia articles about the subject to familiarize myself both with the technology and the terminology. The English article on carbon capture and storage links to the Arabic article التقاط وتخزين ثنائي أكسيد الكربون, but this term has almost no hits on Google. When looking at genuine Arabic sources, the expression احتجاز الكربون وتخزينه seems to be much more common.
Given the need for a substantial amount of data, using texts from Arabic newspapers seems like a good choice. As a starting point Muraija uses the el-Khair corpus, which consists of millions of articles from the following newspapers and news sites:
- Alittihad (التحاد اإلماراتية), Emirates: http://www.alittihad.ae
- Echorouk Online (الشروق أون الين), Algeria: http://www.echoroukonline.com/ara
- Alriyadh (الرياض), Saudi-Arabia: http://www.alriyadh.com
- Alyaum (اليوم), Saudi-Arabia: http://www.alyaum.com
- Tishreen (تشرين), Syria: http://www.tishreen.news.sy
- Alqabas (القبس), Kuwait: http://www.alqabas.com.kw
- Almustaqbal (المستقبل), Lebanon: http://www.almustaqbal.com
- Almasry Alyoum (المصري اليوم), Egypt: http://www.almasryalyoum.com
- youm7 (اليوم السابع), Egypt: http://www.youm7.com
- Saba News Agency (وكالة أنباء سبأ اليمنية), Yemen: http://www.sabanews.net
As a first step, every sentence undergoes morphological analysis, which implies diacritization, part of speech tagging and lemmatization. For the example sentence we started this blog post with, we get:
أَشارَت إِلَى أَنَّ الوِلاياتِ المُتَّحِدَةَ تَنْتَهِك القانُونَ الدُّوَلِيَّ بِاِسْتِمْرارٍ وَحَمَلَتْها المَسْؤُولِيَّةَ الكامِلَةَ لِفَشَلِ المُفاوَضاتِ .
Then a dependency parser analyses the sentence, yielding a parse tree, which specifies relations between words, e.g. that القانون is the object of the verb تنتهك. In this step, prefixes and suffixes like و and ها are treated as separate words.
For these steps, we use tools from CAMeL Labs, from the New York University Abu Dhabi.
The last step is extracting collocations for predefined patterns:
- verb + (prep +) object: تَنْتَهِك القانُونَ / تَنْتَهِك بِاِسْتِمْرارٍ / حَمَلَتْ المَسْؤُولِيَّةَ
- noun + (prep +) noun: المَسْؤُولِيَّةَ لِفَشَلِ / فَشَلِ المُفاوَضاتِ
- noun + adjective: الوِلايات المُتَّحِدَةَ / القانُونَ الدُّوَلِيَّ / المَسْؤُولِيَّةَ الكامِلَةَ
The collocations are then stored in lemmatized form, allowing تَنْتَهِك القانُونَ and اِنْتَهَكُوا القَوانِين to represent the same collocation اِنْتَهَك قانُون, which has the (unwanted) side effect that gender agreement is lost for female nouns, e.g., مَسْؤُولِيَّة كامِل.
Errors. Why Muraija is not Marja
This is a fully automatic procedure and errors can – and do – happen in every step. In our example, the word حملتها was incorrectly analyzed as وَحَمَلَتْها whereas it should be وَحَمَّلَتْها, leading to the extraction of the “verb + object” collocation حَمَل مَسْؤُولِيَّة. Even though this is a valid collocation, it is not contained in this sentence. (Unfortunately, this error is systematic, so currently you will only find تحميل مسؤولية but not حمّل مسؤولية in Muraija.)
While the analyzer did not make any mistake in our example, the collocation extraction algorithm failed to catch the “verb + subject” collocation تَنْتَهِك الوِلاياتِ because these words do not form a verbal but a nominal phrase (اسم أن وخبر أن).
Both the analyzer and the parser use deep learning models (call them AI if you want 🙂 which have been trained mainly on Modern Standard Arabic, with a focus on news. These models perform less well on “older” Arabic or literary texts, that is why Muraija is currently based on a contemporary news corpus.
But even on this “easy” corpus, the analyzer sometimes confuses the first, the second and the fourth stem, like حَمَل مَسْؤُولِيَّة as we have seen above, or صَمْت مُطَبَّق instead of صَمْت مُطَبِق (which means “complete silence”). This happens because this difference is invisible in undiacritized texts – which constitute the large majority of the training data for the models.
Considering these shortcomings, would it not be preferable to wait until we have better models? I think that the desire to build something as flawless as the sacred language of the َQuran is partially responsible for the lack of tools for Arabic. And that the rumor of this language being almost impossible to learn stems less from its complex morphology but from this scarcity. Therefore, instead of waiting for the development of a perfect tool in the future, let’s create something useful now and improve upon it iteratively. And, in order to do some expectation management, let’s not pretend it’s already an awe inspiring authority in the field of Arabic language, a real “Marja”. We call it it “Muraija”, which is the diminutive (صيغة التصغير) of “Marja”, like … “baby authority”. 🙂
The Arabic language is very often referred to as a coast-less ocean, and although the sea of collocations extracted by the procedure described above definitely has coasts, it is nevertheless that large (about 11 million collocations) that going swimming on your own is not recommended. So we need a decent vessel, that is, a user interface.
The word انتهك we started this blog post with appears about 5000 times in the corpus, and is contained in about 800 collocations. Limiting ourselves to the pattern “verb + object”, we still have more than 200 collocations – too much to display them all. So we display only the ten most frequent ones, which has two drawbacks:
- There are frequent collocations which are boring like اِنْتَهَك كُلّ (e.g. ينتهك كل اتفاقيات السلام)
- There a lot of interesting collocations which are less frequent, like انتهك جو (e.g. الطائرات تنتهك الأجواء اللبنانية) or even very infrequent, like انتهك ملكية (e.g. قانون تنتهك الملكية الفكرية)
There is nothing we can do about the fact that most collocations are infrequent – it is a linguistic phenomenon known as Zipf’s law, but we can do better than hiding them all. The idea is to group collocations semantically, so that all of
/ انتهك قانُون / دُسْتُور / اِتِّفاقِيَّة / اِتِّفاق / مُعاهَدَة / عُرْف
قاعِدَة / مَبْدَأ / مِيثاق / مِعْيار / لائِحَة / مادَّة / شَرْط
can be displayed together.
This grouping is fairly subjective: Here I put all “written or unwritten rules” together, whereas you might, for example, prefer to have a separate group for “written agreements”. Currently, this grouping process is purely manual, but might be partially automated in the future.
Before coming back to the metaphor of the ocean and the need for decent vessel, I’d like to share a story from when I was studying Arabic in Beirut, partially providing the motivational background for Muraija:
I was sitting in a café, reading a book, when a stranger approached me, asking me if I could really read Arabic. At first, I gave him the usual response, joking that I was only looking at the images. Of course, there were no images in the book … Then I told him that I was actually studying Arabic. After a brief moment of surprise, he started laughing. He declared that I wasn’t properly dressed for this discipline, joking that I was missing the suspenders and the thick glasses … and why was I smiling anyway? Studying Arabic is not fun!
While I agree that learning is not fun all the time, it should at least be enjoyable often. Therefore I wanted to build a tool not like a majestic tall ship, sitting low in the water, but like a surfboard, light and agile. That’s why Muraija’s database is designed to be fast. If you have a decent internet connection, collocations and examples should load within the fraction of a second. So while waiting for the bus, you can make the following maritime trip:
1st stop: انتهك
Besides قانوت, what can be “violated”? حَقّ (a right), حُرِّيَّة (a freedom), قاعِدَة (a principle), قَرار (a decision), اِتِّفاقِيَّة (an agreement) – among others. Let’s have a look at اتفاقية!
2nd stop: اتفاقية
Interesting; you can use the adjective ثنائي to express the idea of a bilateral agreement. (Side track: What else can be ثنائي? Well: علاقة and تعاوم take up the lion’s share.)
Which verbs can be used to express the idea of making an agreement? وَقَّع, أَبرَم, أَقَرّ and عَقَد – among others. Let’s investigate وقّع!
3rd stop: وقّع
You can وَقَّع a مذكرة, too. Let’s look into that!
4th stop: مذكرة
Ah, you can say مذكرة تفاهم for “memorandum of understanding”! And there is both مذكرة توقيف and مذكرة اعتقال for “warrant of arrest”.
Not only a surfboard
Even if surfing the Arabic language the way described above is the primary use case of Muraija, it works fairly well as a search engine, (partially) answering the question “Can I say this?”.
… where the last answer is obviously wrong. At least during the Arab spring you could be fairly sure when reading the verb اندلعت that it was a revolution that broke out – and not a war or a fire. So a collocation unknown to Muraija might not only be used, but even be used very frequently. In case of doubt, you can check the corresponding “noun + noun” collocation: اندلاع الثورة is found more than two thousand times.
In the above examples we searched for complete collocations, but you do not have to. If you type at least two words, Muraija will list the most frequent collocation containing these words, which is especially helpful if you are unsure about prepositions: صادق قانون yields صادَق عَلَى قانُون, and التزم صمت yields both اِلْتَزَم صَمْت and اِلْتَزَم بِ صَمْت, whereas the former is much more frequent.
Muraija searches both lemmas and inflected forms, so يعترضون على الاقتراحات will find the collocation اِعْتَرَض عَلَى اِقْتِراح, but is currently unable to handle the attached prefixes ل and ب. So تقدم ب اقتراح will not return any result, you have to search for تقدم ب اقتراح.
Not only for students
The question “Can I say this?” does not occur to language learners only, but to native speakers as well. I happen to ask myself this question quite often when translating from Arabic into German, which is my mother tongue. Am I expressing the idea in a natural way, like a native speaker spontaneously would do? Or am I producing some kind of translationese? Therefore, I imagine that at least in a more mature state Muraija would be useful for native speakers of Arabic, too.
Another use case is preparation for simultaneous interpretation, where you do not only need to know how to express a given idea in the target language, but doing so should require as little mental effort as possible. Otherwise you won’t have enough mental capacities to listen and talk at the same time. Both common sense and academic research (Gile’s “Gravitational Model of Linguistic Availability”) tell us that we can recall a word or an expression more easily, if we have used it recently.
Thus reviewing the “linguistic material” relevant to an upcoming interpretation assignment is a common preparation technique for interpreters. So if I am to interpret a discussion about implementing international agreements on tax evasion into national law, قانُون, اِتِّفاقِيَّة and ضَرِيبَة would be good starting points. (But since Muraija does not contain any specialized legal corpora yet, it should not be your only source of preparation! 🙂
Try it out … and contribute
In the past, only a small group of people had access to Muraija, the current version (0.9) is the first public release. You are very invited to try it out – with the adventurous spirit of a (maritime) explorer :-).
I would highly appreciate your feedback by email or via the feedback form, either in English or in Arabic. Furthermore, in addition to giving feedback, you can contribute by reviewing examples and – even more important – grouping collocations.
A good starting point would be to group possible objects for a given verb. So you could search for your favorite one and click on تنصيف المتواردات at the bottom of the box فعل + مفعول به. (You might want to log in with your Google account – or create a new one – by clicking on the door symbol on the top left. Then you get the credits for your contributions … and we might add a list of fame later on 🙂
After having picked at least one good example for a collocation, you can group it together with other collocations having the same or a similar meaning. You can create as many groups as you want, optionally give them titles and reorder them. Sometimes it makes sense to hide boring collocations by moving them to special group and making it invisible by clicking on the eye symbol.
If you are courageous, you can pick your favorite noun and do this grouping exercise there – which is much more work because there are a lot of collocation patterns. The entry قانُون is a good example of how the result can look.
Last but not least, in addition to giving feedback and grouping collocations, please spread the word about Muraija :-)”