Assessing Copilot’s Semantic Depth in Classical Arabic: A Mixed-Methods Evaluation Using Alfiyah ibn Malik and Nadham Al-Imrithy
DOI:
https://doi.org/10.18326/lisania.v9i1.41-59Keywords:
Copilot’s performance, semantic interpretations, Arabic translations, METEOR Scores, AI’s Contextual depthAbstract
It was rather surprising that Windows users readily embraced Copilot, even trusting it with translation projects. Surely, not many users would trust its accuracy in providing cross-language explanations for prompts solely based on the developer's claims. Building on that, this research aimed to test it in a manner distinct from other assessments. Researchers evaluated how accurately Copilot interpreted and understood the advanced Arabic prose from the intricate works of Alfiyah ibn Malik and Nadham Al-Imrithy. The aim was to understand Copilot’s strengths and weaknesses in terms of literal accuracy, terminological-analogical mastery, and contextual depth. Using a mixed-method approach under the Collect-Measure-Repeat (CMR) framework of Responsible AI, the researchers conducted qualitative performance assessments with three experts and quantitative evaluations using METEOR (Metric for Evaluation of Translation with Explicit Ordering). The results showed that although Copilot had no issues comprehending and translating simple Arabic commands, especially word-for-word, it struggled with contextual understanding for many of the complex texts and displayed numerous inconsistencies when the instructions were vague. Copilot's performance issues in context saturation were evident during iterative phases. This led to the conclusion that, while Copilot is competent enough to attempt the challenging task of interpreting complex linguistic structures, it still needs human assistance and cross-references.
References
Ahmed, I., Kajol, M., Hasan, U., & Datta, P. P. (2023). ChatGPT vs. Bard: A Comparative Study [Preprint]. https://doi.org/10.36227/techrxiv.23536290.v1
Alruqi, T. N., & Alzahrani, S. M. (2023). Evaluation of an Arabic Chatbot Based on Extractive Question-Answering Transfer Learning and Language Transformers. AI, 4(3), 667–691. https://doi.org/10.3390/ai4030035
Anwar, S., Kesuma, G. C., & Koderi. (2023). Development of al-Qawaid an-Nahwiyah Learning Module Based on Qiyasiyah Method for Arabic Language Education Department Students | Pengembangan Modul Pembelajaran al-Qawaid an-Nahwiyah Berbasis Metode Qiyasiyah untuk Mahasiswa Jurusan Pendidikan Bahasa Arab. Mantiqu Tayr: Journal of Arabic Language, 3(1), Article 1. https://doi.org/10.25217/mantiqutayr.v3i1.2830
Balloccu, S., Reiter, E., Li, K. J.-H., Sargsyan, R., Kumar, V., Reforgiato, D., Riboni, D., & Dusek, O. (2024). Ask the experts: Sourcing a high-quality nutrition counseling dataset through Human-AI collaboration. Findings of the Association for Computational Linguistics: EMNLP 2024, 11519–11545. https://doi.org/10.18653/v1/2024.findings-emnlp.674
Berkey, J. P. (2014). The Transmission of Knowledge in Medieval Cairo: A Social History of Islamic Education. Princeton University Press.
Bilquise, G., Ibrahim, S., & Shaalan, K. (2022). Bilingual AI-Driven Chatbot for Academic Advising. International Journal of Advanced Computer Science and Applications, 13(8). https://doi.org/10.14569/IJACSA.2022.0130808
Carvalho, L., Martinez-Maldonado, R., Tsai, Y.-S., Markauskaite, L., & De Laat, M. (2022). How can we design for learning in an AI world? Computers and Education: Artificial Intelligence, 3, 100053. https://doi.org/10.1016/j.caeai.2022.100053
Chaturvedi, S., Thakur, A., & Srivastava, P. (2024). Refining Language Translator Using In-depth Machine Learning Algorithms. 2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), 1–6. https://doi.org/10.1109/ICRITO61523.2024.10522202
Chen, Y., Clayton, E. W., Novak, L. L., Anders, S., & Malin, B. (2023). Human-Centered Design to Address Biases in Artificial Intelligence. Journal of Medical Internet Research, 25(1), e43251. https://doi.org/10.2196/43251
Dahia, I., & Belbacha, M. (2024). Machine-Learning-based English Quranic Translation: An Evaluation of ChatGPT. International Journal of Linguistics, Literature and Translation, 7(8), 128–136. https://doi.org/10.32996/ijllt.2024.7.8.17
Denkowski, M., & Lavie, A. (2014). Meteor Universal: Language-Specific Translation Evaluation for Any Target Language. Proceedings of the Ninth Workshop on Statistical Machine Translation, 376–380. https://doi.org/10.3115/v1/W14-3348
Esfandiari, R., & Allaf-Akbary, O. (2024). Assessing interactional metadiscourse in EFL writing through intelligent data-driven learning: The Microsoft Copilot in the spotlight. Language Testing in Asia, 14(1), 51. https://doi.org/10.1186/s40468-024-00326-9
Farghal, M., & Haider, A. S. (2024). Translating classical Arabic verse: Human translation vs. AI large language models (Gemini and ChatGPT). Cogent Social Sciences, 10(1), 2410998. https://doi.org/10.1080/23311886.2024.2410998
Fodhil, M., & Hanifah, S. (2022). Analysis of The Values of Moral Education in Nadzam Imrithy by Sheikh Syarafuddin Yahya Al-Imrithy. SCHOOLAR: Social and Literature Study in Education, 2(1), Article 1. https://doi.org/10.32764/schoolar.v2i1.1477
Fuad, B. (2010). Terjemah Alfiyah Ibnu Malik dan Penjelasannya. Mobile Santri.
Gemini Team, Reid, M., Savinov, N., Teplyashin, D., Dmitry, Lepikhin, Lillicrap, T., Alayrac, J., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., Antonoglou, I., Anil, R., Borgeaud, S., Dai, A., Millican, K., Dyer, E., Glaese, M., … Vinyals, O. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context (arXiv:2403.05530). arXiv. https://doi.org/10.48550/arXiv.2403.05530
Hameed, D. A., Faisal, T. A., Alshaykha, A. M., Hasan, G. T., & Ali, H. A. (2022). Automatic evaluating of Russian-Arabic machine translation quality using METEOR method. 040036. https://doi.org/10.1063/5.0067018
Haq, Y. N. (2022). Manhaj al-Imam Ibn Aqil fi Syarhi Alfiah al-Imam Ibn Malik [bachelor's thesis, Fakultas Dirasat Islamiah]. https://repository.uinjkt.ac.id/dspace/handle/123456789/61799
He, S. (2024). Prompting ChatGPT for Translation: A Comparative Analysis of Translation Brief and Persona Prompts (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2403.00127
Hidayatullah, A. S., & Fauji, I. (2023). Bridging Theory and Practice in Arabic Language Education | Indonesian Journal of Islamic Studies. https://ijis.umsida.ac.id/index.php/ijis/article/view/1724
Inas, A. (2024). Analysis of Nahwu Content in Alfiyah Ibnu Malik. ALIT: Arabic Linguistics and Teaching Journal, 1(1), Article 1. https://journal.zamronedu.co.id/index.php/alit/article/view/76
Inel, O., Draws, T., & Aroyo, L. (2023). Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 11(1), Article 1. https://doi.org/10.1609/hcomp.v11i1.27547
Johnson, D., Goodman, R., Patrinely, J., Stone, C., Zimmerman, E., Donald, R., Chang, S., Berkowitz, S., Finn, A., & Jahangir, E. (2023). Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the Chat-GPT model. https://doi.org/10.21203/rs.3.rs-2566942/v1
Khoshafah, F. (2023). ChatGPT for Arabic-English Translation: Evaluating the Accuracy. Ministry of Education, Yemen. https://doi.org/10.21203/rs.3.rs-2814154/v2
Lavie, A., & Agarwal, A. (2007). Meteor: An automatic metric for MT evaluation with high levels of correlation with human judgments. Proceedings of the Second Workshop on Statistical Machine Translation, 228–231.
Lozano, M., Winthrop, S., Goldsworthy, C., Leventis, A., & Birkenshaw, A. (2024). Semantic Depth Redistribution in Large Language Models to Contextual Embedding Preservation. https://doi.org/10.22541/au.173083529.98863661/v1
Muthiah, A., & Zain, L. (2020). Konsep Ittishal Al-Sanad Sebagai Syarat Kajian Kitab Kuning Dalam Tradisi Pesantren An-Nahdliyyah Cirebon. Jurnal Studi Hadis Nusantara, 2(1). https://jurnal.syekhnurjati.ac.id/index.php/jshn/article/download/6746/3133
Olsher, D. (2014). Semantically-based priors and nuanced knowledge core for Big Data, Social AI, and language understanding. Neural Networks, 58, 131–147. https://doi.org/10.1016/j.neunet.2014.05.022
Perkins, M. (2023). Academic integrity considerations of AI Large Language Models in the post-pandemic era: ChatGPT and beyond. Journal of University Teaching and Learning Practice, British University, Vietnam, 20(2). https://doi.org/10.53761/1.20.02.07
Raj, H., Gupta, V., Rosati, D., & Majumdar, S. (2023). Semantic Consistency for Assuring Reliability of Large Language Models (arXiv:2308.09138). arXiv. https://doi.org/10.48550/arXiv.2308.09138
Ras, G., van Gerven, M., & Haselager, P. (2018). Explanation Methods in Deep Learning: Users, Values, Concerns and Challenges. arXiv Preprint arXiv:1803.07517.
Russell, R. G., Novak, L. L., Patel, M., Garvey, K. V., Craig, K. J. T., Jackson, G. P., Moore, D., & Miller, B. M. (2023). Competencies for the Use of Artificial Intelligence–Based Tools by Health Care Professionals. Academic Medicine, 98(3), 348–356. https://doi.org/10.1097/ACM.0000000000004963
Stratton, J. (2024). An Introduction to Microsoft Copilot. In J. Stratton, Copilot for Microsoft 365 (pp. 19–35). Apress.
Sulaeman, I., Syuhadak, S., & Sulaeman, I. (2023). ChatGPT as a New Frontier in Arabic Education Technology. Al-Arabi: Jurnal Bahasa Arab Dan Pengajarannya= Al-Arabi: Journal of Teaching Arabic as a Foreign Language, 7(1), 83–105. http://dx.doi.org/10.17977/um056v7i1p83-105
Zhang, M. (2024). A Study on the Translation Quality of ChatGPT. International Journal of Educational Curriculum Management and Research, 5(1). https://doi.org/10.38007/IJECMR.2024.050121
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Nely Rahmawati Zaimah, Syamsul Hadi, Chafidloh Rizqiyah, Risty Kamila Wening Estu, Akhmad Roja Badrus Zaman

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.