Monday, December 21, 2015

Qur'anic Arabic Corpus - Arabic Treebank by Dr. Kais Dukes

Click below box to toggle between post's dark, light and white modes.
         

The Qur'anic Arabic Corpus http://corpus.quran.com as per word of Dr Kais Dukes in 2014 end was being used by over 5 million people a year from 165 different countries. Below map was also taken from his under construction version 0.5 i.e. www.arabictreebank.org which is no longer available and has been taken down by Dr. Kais Dukes.


World map of users of the Qur'anic Arabic Corpus, provided by Google Analytics. Countries with the highest number of users are shaded in darker blue.

In 2014 end, various suggestions from me were under discussion with Dr. Kais Dukes. The green ones had been implemented on www.arabictreebank.org which hosted the under construction version 0.5, unfortunately the site is no longer accessible.

Below I list them again with slight modifications in some, so that whenever the work is restarted on version 0.5, any volunteer can try to get these suggestions implemented.

1. Link all roots in Qur'anic Arabic Corpus to Arabic Almanac also.  
http://ejtaal.net/aa/#bwq=
root in buckwalter English equivalent letters:
A b t v j H x d * r z s $ S D T Z E g f q k l m n h w y
ا ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق ك ل م ن ه و ي

2. Add relevant images for Al Airaab ul Mufassal by Salih with every Ayat page
https://archive.org/details/ErabMufassal

3. Add relevant images for I'raab ul Qur'an il Kareem by Dr Mahmood Sulayman Yaqoot
This book gives word by word syntactical analysis
https://archive.org/details/waq110411

4. Add images for Madinah Mushaf with tajweed coloring with every Ayat page.
https://skydrive.live.com/?cid=70F882FE8CA92D5E&id=70F882FE8CA92D5E!899
Suggested location: a joint page containing Recitation, embedded images, and translations. 

5. Add Patterns data for Lemmas in 1st stage and in 2nd stage for individual words. 
Use patterns data from https://revivearabic.blogspot.com/2015/09/quran-concordance-roots-patterns-letters-synonyms.html . As 1st stage a Beta lemma table can be created by copying all the patterns data from there. As a 1st stage 2 lemma tables can be added, 1 original already added and 2nd based on concordance data presented as beta test version only. Then it can be evaluated further before replacing original lemmas table.
Note: For broken plural patterns, singular pattern must also be identified, and pattern for singular also written, as already done in above concordance documents . Sorting should also be according to singular since same broken plural pattern can come for multiple possible singular patterns. 

6. Add quick navigation boxes:
i. For Aayaat. 
ii. For words (by also enabling quick navigation box for Aayaat to go to detailed page for a word, e.g. inputting 1.1.1 or 1:1:1 should take to detailed page of 1:1:1) 
iii. For Roots. Search quickly in box using buckwalter equivalent English letters or Arabic letters. To search root of furqan on any page of corpus, just type frq in the root box and go to dictionary immediately from any page of corpus.

7. Allow showing of complete Qur'anic Aayaat in Qur'an Dictionary in Arabic.

8. Add particles data. Site should allow to see all instances and concordance of all particles including 1 letter particles. Particles that are composed of sub-particles should show sub-particles data also. Like words are broken down to roots and patterns, some particles can be broken down to sub-particles.

9. Add all proper nouns in Dictionary. 

10. Add 4th column Root and Pattern in word by word.

11. For morphological search, also add other search possibilities after studying tables given in Statistical Parsing by Machine Learning from a Classical Arabic Treebank and also add patterns based search. Allow Arabic input also in addition to Buckwalter. 

12. Add roots column in lemmas table and give sort by root option in lemmas. In verbs table, give sort by verb forms option.

13. Check any missing lemmas and prefixes and suffixes from dictionary and add them after preparing spreadsheet of any missed words or particles. 

14. Check lemmas with suffixes as some splits of segments and suffixes are causing problems specially with tashdeed. e.g. http://arabictreebank.org/word?id=12:56:2

15. Treebank automatic rules should have exception rules in cases where editor or expert feels that computer suggestion is wrong due to an exceptional case. He should be able to overwrite computer rules in such cases and such cases may be highlighted for future review by other experts. 

16. With Syntactic Treebank Arabic Terms also add English Terms.

Long Term Ideas and Suggestions dependent on 2ndry tool development.
1. Link each Ayat to following quick and detailed study tools after preparing:
Qur'anic Tafaaseer and studies tool :
 

Similar to Arabic Almanac :
1. Containing page images for all books
2. With index named according to last Ayat being covered on each page image.  Named like 001.001 meaning Surah 1, Ayat 1.
3. Search like 001.002, 104.001, 002.200 etc. and all books to be added in images format like Arabic Almanac.
4. Each Ayat can then be linked directly from Qur'anic Arabic Corpus and other study sites. 

i. Arabic to Arabic only tool:
An Arabic Tool having all important books dealing with Qur'anic Morphology, Qur'anic Syntax, Qur'an Tafsir and other Qur'anic Studies areas, all of which can be studied instantly and together just by inputting Surah number and Ayat number and as a result displaying the relevant pages from all these books together for that,  like Arabic Almanac.

ii.  Other languages tool:
i. A Word for Word Meanining of The Qur'an by Muhammad Mohar Ali
ii. Urdu: Anwar ul Bayan fi Hal Lughat ul Qur'an by Ali Muhammad
iii. Tafsir al Qurtuby Urdu Translation
iv. In the shade of The Qur'an by Sayyid Qutb
v. Tafheem ul Qur'an English by Maududi
vi. Maariful Qur'an
vii.  Tafsir ibn e  Kathir

Qur'an Ayat Navigator PDFs help in achieving some of the Qur'anic Tafaaseer and studies tool targets, but tools like Arabic Almanac are best option.

Appendix 1:
Dr. Kais Dukes has recently co-founded Hiper Fabric and his latest public activities can be tracked using:
https://github.com/kaisdukes?tab=activity
his web address and email address can also be found from above link. 
I recommend as many people as possible to contact him and suggest him to resume work on Qur'anic Arabic Corpus.
His Google scholar citations page,  his linked in page

Appendix 2:
In case Dr. Kais Dukes somehow can't update the site himself, I suggested plan B  i.e. a funded project in which full time experts are hired and paid via visitor donations. These experts can work full time on the project and complete, verify and expand its scope. Project can be completed easily this way within a year.
1. Discuss with potential experts and estimate cost. 3 to 5 experts can be dedicated including Dr. Kais Dukes himself or 1 expert programmer and 2 to 4 Classical Arabic experts or even more.
2. Set a donation target on home page of Qur'anic Arabic Corpus. Track it in percentage and total cost, received amount and remaining amount. Give donation options on home page of Qur'anic Arabic Corpus. Better add a widget displayed at top of every page for donation like Wikipedia and Archive.
3. Hire the experts full time for a set work target and time.
4. Once donations start being received, start work using full time experts in parallel.
5. Discuss ideas and suggestions and implement the useful ones in next version.
6. Extend the funding method to implement other useful projects for mass benefit of Islam and Muslims. 

Dr. Kais Dukes can discuss with others and improve it further. Project should be completed within a year. Last update i.e. version 0.4 was released on 1 May 2011.

For any updates till work is resumed on current plan or plan B, you may contact Dr. Kais Dukes directly .

Appendix 3:
Keep checking tweets on https://twitter.com/asimiqbal2nd for updates on QAC related issues.

Appendix 4:

User Guide to provide guide for new volunteers and annotators to understand how they can help improve Qur'anic Arabic Corpus.

1. For now volunteers should download Dr Kais Dukes PHD Thesis :
Statistical Parsing by Machine Learning from a Classical Arabic Treebank
and specially study Chapters 3, 5 and 6 in detail.
Read his thesis online:
2. Study and keep the following important tables as reference:
i. PDF page number 107, printed page number 88: Part-of-speech tags for Classical Arabic

ii. PDF page number 117, printed page number 98 : Morphological feature tags for Classical Arabic

iii. PDF page number 119, printed page number 100: Morphological segmentation rules for Classical Arabic

iv. PDF page number 142, printed page number 123 : Dependency relations for Classical Arabic

v. PDF page number 145, printed page number 126


vi. PDF page number 154, printed page number 135



3. Study Buckwalter Extended Transliteration:
http://corpus.quran.com/java/buckwalter.jsp