Assessing Copilot’s Semantic Depth in Classical Arabic:  A Mixed-Methods Evaluation Using Alfiyah ibn Malik and Nadham Al-Imrithy

Nely Rahmawati Zaimah; Syamsul Hadi; Chafidloh Rizqiyah; Risty Kamila Wening Estu; Akhmad Roja Badrus Zaman

doi:10.18326/lisania.v9i1.41-59

Authors

Nely Rahmawati Zaimah Sekolah Tinggi Agama Islam Al-Anwar Rembang https://orcid.org/0009-0002-4580-7222
Syamsul Hadi Sekolah Tinggi Agama Islam Al-Anwar Rembang
Chafidloh Rizqiyah Sekolah Tinggi Agama Islam Subang
Risty Kamila Wening Estu Sekolah Tinggi Agama Islam Al-Anwar Rembang
Akhmad Roja Badrus Zaman Albert-Ludwigs-Universität Freiburg

DOI:

https://doi.org/10.18326/lisania.v9i1.41-59

Keywords:

Copilot’s performance, semantic interpretations, Arabic translations, METEOR Scores, AI’s Contextual depth

Abstract

It was rather surprising that Windows users readily embraced Copilot, even trusting it with translation projects. Surely, not many users would trust its accuracy in providing cross-language explanations for prompts solely based on the developer's claims. Building on that, this research aimed to test it in a manner distinct from other assessments. Researchers evaluated how accurately Copilot interpreted and understood the advanced Arabic prose from the intricate works of Alfiyah ibn Malik and Nadham Al-Imrithy. The aim was to understand Copilot’s strengths and weaknesses in terms of literal accuracy, terminological-analogical mastery, and contextual depth. Using a mixed-method approach under the Collect-Measure-Repeat (CMR) framework of Responsible AI, the researchers conducted qualitative performance assessments with three experts and quantitative evaluations using METEOR (Metric for Evaluation of Translation with Explicit Ordering). The results showed that although Copilot had no issues comprehending and translating simple Arabic commands, especially word-for-word, it struggled with contextual understanding for many of the complex texts and displayed numerous inconsistencies when the instructions were vague. Copilot's performance issues in context saturation were evident during iterative phases. This led to the conclusion that, while Copilot is competent enough to attempt the challenging task of interpreting complex linguistic structures, it still needs human assistance and cross-references.

References

Ahmed, I., Kajol, M., Hasan, U., & Datta, P. P. (2023). ChatGPT vs. Bard: A Comparative Study [Preprint]. https://doi.org/10.36227/techrxiv.23536290.v1

Alruqi, T. N., & Alzahrani, S. M. (2023). Evaluation of an Arabic Chatbot Based on Extractive Question-Answering Transfer Learning and Language Transformers. AI, 4(3), 667–691. https://doi.org/10.3390/ai4030035

Anwar, S., Kesuma, G. C., & Koderi. (2023). Development of al-Qawaid an-Nahwiyah Learning Module Based on Qiyasiyah Method for Arabic Language Education Department Students | Pengembangan Modul Pembelajaran al-Qawaid an-Nahwiyah Berbasis Metode Qiyasiyah untuk Mahasiswa Jurusan Pendidikan Bahasa Arab. Mantiqu Tayr: Journal of Arabic Language, 3(1), Article 1. https://doi.org/10.25217/mantiqutayr.v3i1.2830

Balloccu, S., Reiter, E., Li, K. J.-H., Sargsyan, R., Kumar, V., Reforgiato, D., Riboni, D., & Dusek, O. (2024). Ask the experts: Sourcing a high-quality nutrition counseling dataset through Human-AI collaboration. Findings of the Association for Computational Linguistics: EMNLP 2024, 11519–11545. https://doi.org/10.18653/v1/2024.findings-emnlp.674

Berkey, J. P. (2014). The Transmission of Knowledge in Medieval Cairo: A Social History of Islamic Education. Princeton University Press.

Bilquise, G., Ibrahim, S., & Shaalan, K. (2022). Bilingual AI-Driven Chatbot for Academic Advising. International Journal of Advanced Computer Science and Applications, 13(8). https://doi.org/10.14569/IJACSA.2022.0130808

Carvalho, L., Martinez-Maldonado, R., Tsai, Y.-S., Markauskaite, L., & De Laat, M. (2022). How can we design for learning in an AI world? Computers and Education: Artificial Intelligence, 3, 100053. https://doi.org/10.1016/j.caeai.2022.100053

Chaturvedi, S., Thakur, A., & Srivastava, P. (2024). Refining Language Translator Using In-depth Machine Learning Algorithms. 2024 11th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), 1–6. https://doi.org/10.1109/ICRITO61523.2024.10522202

Chen, Y., Clayton, E. W., Novak, L. L., Anders, S., & Malin, B. (2023). Human-Centered Design to Address Biases in Artificial Intelligence. Journal of Medical Internet Research, 25(1), e43251. https://doi.org/10.2196/43251

Dahia, I., & Belbacha, M. (2024). Machine-Learning-based English Quranic Translation: An Evaluation of ChatGPT. International Journal of Linguistics, Literature and Translation, 7(8), 128–136. https://doi.org/10.32996/ijllt.2024.7.8.17

Denkowski, M., & Lavie, A. (2014). Meteor Universal: Language-Specific Translation Evaluation for Any Target Language. Proceedings of the Ninth Workshop on Statistical Machine Translation, 376–380. https://doi.org/10.3115/v1/W14-3348

Esfandiari, R., & Allaf-Akbary, O. (2024). Assessing interactional metadiscourse in EFL writing through intelligent data-driven learning: The Microsoft Copilot in the spotlight. Language Testing in Asia, 14(1), 51. https://doi.org/10.1186/s40468-024-00326-9

Farghal, M., & Haider, A. S. (2024). Translating classical Arabic verse: Human translation vs. AI large language models (Gemini and ChatGPT). Cogent Social Sciences, 10(1), 2410998. https://doi.org/10.1080/23311886.2024.2410998

Fodhil, M., & Hanifah, S. (2022). Analysis of The Values of Moral Education in Nadzam Imrithy by Sheikh Syarafuddin Yahya Al-Imrithy. SCHOOLAR: Social and Literature Study in Education, 2(1), Article 1. https://doi.org/10.32764/schoolar.v2i1.1477

Fuad, B. (2010). Terjemah Alfiyah Ibnu Malik dan Penjelasannya. Mobile Santri.

Gemini Team, Reid, M., Savinov, N., Teplyashin, D., Dmitry, Lepikhin, Lillicrap, T., Alayrac, J., Soricut, R., Lazaridou, A., Firat, O., Schrittwieser, J., Antonoglou, I., Anil, R., Borgeaud, S., Dai, A., Millican, K., Dyer, E., Glaese, M., … Vinyals, O. (2024). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context (arXiv:2403.05530). arXiv. https://doi.org/10.48550/arXiv.2403.05530

Hameed, D. A., Faisal, T. A., Alshaykha, A. M., Hasan, G. T., & Ali, H. A. (2022). Automatic evaluating of Russian-Arabic machine translation quality using METEOR method. 040036. https://doi.org/10.1063/5.0067018

Haq, Y. N. (2022). Manhaj al-Imam Ibn Aqil fi Syarhi Alfiah al-Imam Ibn Malik [bachelor's thesis, Fakultas Dirasat Islamiah]. https://repository.uinjkt.ac.id/dspace/handle/123456789/61799

He, S. (2024). Prompting ChatGPT for Translation: A Comparative Analysis of Translation Brief and Persona Prompts (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2403.00127

Hidayatullah, A. S., & Fauji, I. (2023). Bridging Theory and Practice in Arabic Language Education | Indonesian Journal of Islamic Studies. https://ijis.umsida.ac.id/index.php/ijis/article/view/1724

Inas, A. (2024). Analysis of Nahwu Content in Alfiyah Ibnu Malik. ALIT: Arabic Linguistics and Teaching Journal, 1(1), Article 1. https://journal.zamronedu.co.id/index.php/alit/article/view/76

Inel, O., Draws, T., & Aroyo, L. (2023). Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 11(1), Article 1. https://doi.org/10.1609/hcomp.v11i1.27547

Johnson, D., Goodman, R., Patrinely, J., Stone, C., Zimmerman, E., Donald, R., Chang, S., Berkowitz, S., Finn, A., & Jahangir, E. (2023). Assessing the accuracy and reliability of AI-generated medical responses: An evaluation of the Chat-GPT model. https://doi.org/10.21203/rs.3.rs-2566942/v1

Khoshafah, F. (2023). ChatGPT for Arabic-English Translation: Evaluating the Accuracy. Ministry of Education, Yemen. https://doi.org/10.21203/rs.3.rs-2814154/v2

Lavie, A., & Agarwal, A. (2007). Meteor: An automatic metric for MT evaluation with high levels of correlation with human judgments. Proceedings of the Second Workshop on Statistical Machine Translation, 228–231.

Lozano, M., Winthrop, S., Goldsworthy, C., Leventis, A., & Birkenshaw, A. (2024). Semantic Depth Redistribution in Large Language Models to Contextual Embedding Preservation. https://doi.org/10.22541/au.173083529.98863661/v1

Muthiah, A., & Zain, L. (2020). Konsep Ittishal Al-Sanad Sebagai Syarat Kajian Kitab Kuning Dalam Tradisi Pesantren An-Nahdliyyah Cirebon. Jurnal Studi Hadis Nusantara, 2(1). https://jurnal.syekhnurjati.ac.id/index.php/jshn/article/download/6746/3133

Olsher, D. (2014). Semantically-based priors and nuanced knowledge core for Big Data, Social AI, and language understanding. Neural Networks, 58, 131–147. https://doi.org/10.1016/j.neunet.2014.05.022

Perkins, M. (2023). Academic integrity considerations of AI Large Language Models in the post-pandemic era: ChatGPT and beyond. Journal of University Teaching and Learning Practice, British University, Vietnam, 20(2). https://doi.org/10.53761/1.20.02.07

Raj, H., Gupta, V., Rosati, D., & Majumdar, S. (2023). Semantic Consistency for Assuring Reliability of Large Language Models (arXiv:2308.09138). arXiv. https://doi.org/10.48550/arXiv.2308.09138

Ras, G., van Gerven, M., & Haselager, P. (2018). Explanation Methods in Deep Learning: Users, Values, Concerns and Challenges. arXiv Preprint arXiv:1803.07517.

Russell, R. G., Novak, L. L., Patel, M., Garvey, K. V., Craig, K. J. T., Jackson, G. P., Moore, D., & Miller, B. M. (2023). Competencies for the Use of Artificial Intelligence–Based Tools by Health Care Professionals. Academic Medicine, 98(3), 348–356. https://doi.org/10.1097/ACM.0000000000004963

Stratton, J. (2024). An Introduction to Microsoft Copilot. In J. Stratton, Copilot for Microsoft 365 (pp. 19–35). Apress.

Sulaeman, I., Syuhadak, S., & Sulaeman, I. (2023). ChatGPT as a New Frontier in Arabic Education Technology. Al-Arabi: Jurnal Bahasa Arab Dan Pengajarannya= Al-Arabi: Journal of Teaching Arabic as a Foreign Language, 7(1), 83–105. http://dx.doi.org/10.17977/um056v7i1p83-105

Zhang, M. (2024). A Study on the Translation Quality of ChatGPT. International Journal of Educational Curriculum Management and Research, 5(1). https://doi.org/10.38007/IJECMR.2024.050121