Chinese generative AI models (DeepSeek and Qwen) rival ChatGPT-4 in ophthalmology queries with excellent performance in Arabic and English

Authors

  • Malik Sallam Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Amman, Jordan; Department of Clinical Laboratories and Forensic Medicine, Jordan University Hospital, Amman, Jordan https://orcid.org/0000-0002-0165-9670
  • Israa M. Alasfoor Section of Ophthalmology, Department of Special Surgery, School of Medicine, The University of Jordan, Amman, Jordan; Section of Ophthalmology, Department of Special Surgery, Jordan University Hospital, Amman, Jordan https://orcid.org/0009-0003-7257-6762
  • Shahad W. Khalid Section of Ophthalmology, Department of Special Surgery, School of Medicine, The University of Jordan, Amman, Jordan; Section of Ophthalmology, Department of Special Surgery, Jordan University Hospital, Amman, Jordan https://orcid.org/0009-0004-3856-3340
  • Rand I. Al-Mulla Section of Ophthalmology, Department of Special Surgery, School of Medicine, The University of Jordan, Amman, Jordan; Section of Ophthalmology, Department of Special Surgery, Jordan University Hospital, Amman, Jordan https://orcid.org/0009-0003-4175-3965
  • Amwaj Al-Farajat Section of Ophthalmology, Department of Special Surgery, School of Medicine, The University of Jordan, Amman, Jordan; Section of Ophthalmology, Department of Special Surgery, Jordan University Hospital, Amman, Jordan https://orcid.org/0000-0003-2367-537X
  • Maad M. Mijwil College of Administration and Economics, Al-Iraqia University, Baghdad, Iraq; Department of Computer Techniques Engineering, Baghdad College of Economic Sciences University, Baghdad, Iraq https://orcid.org/0000-0002-2884-2504
  • Reem Zahrawi Department of Ophthalmology, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates
  • Mohammed Sallam Department of Pharmacy, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates; Department of Management, Mediclinic Parkview Hospital, Mediclinic Middle East, Dubai, United Arab Emirates; Department of Management, School of Business, International American University, Los Angeles, United States; College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences (MBRU), Dubai, United Arab Emirates https://orcid.org/0000-0003-3273-524X
  • Jan Egger Institute for Artificial Intelligence in Medicine (IKIM), Essen University Hospital (AöR), Girardetstraße, Germany; Center for Virtual and Extended Reality in Medicine (ZvRM), Essen University Hospital (AöR), Hufelandstraße, Germany; Cancer Research Center Cologne Essen (CCCE), University Medicine Essen (AöR), Hufelandstraße, Germany; University of Duisburg-Essen, Faculty of Computer Science, Schützenbahn, Germany https://orcid.org/0000-0002-5225-1982
  • Ahmad S. Al-Adwan Department of Business Technology, Al-Ahliyya Amman University, Amman, Jordan https://orcid.org/0000-0001-5688-1503

DOI:

https://doi.org/10.52225/narra.v5i1.2371

Keywords:

LLM, OpenAI, DeepSeek, Qwen, eye disease

Abstract

The rapid evolution of generative artificial intelligence (genAI) has ushered in a new era of digital medical consultations, with patients turning to AI-driven tools for guidance. The emergence of Chinese-developed genAI models such as DeepSeek-R1 and Qwen-2.5 presented a challenge to the dominance of OpenAI’s ChatGPT. The aim of this study was to benchmark the performance of Chinese genAI models against ChatGPT-4o and to assess disparities in performance across English and Arabic. Following the METRICS checklist for genAI evaluation, Qwen-2.5, DeepSeek-R1, and ChatGPT-4o were assessed for completeness, accuracy, and relevance using the CLEAR tool in common patient ophthalmology queries. In English, Qwen-2.5 demonstrated the highest overall performance (CLEAR score: 4.43±0.28), outperforming both DeepSeek-R1 (4.31±0.43) and ChatGPT-4o (4.14±0.41), with p=0.002. A similar hierarchy emerged in Arabic, with Qwen-2.5 again leading (4.40±0.29), followed by DeepSeek-R1 (4.20±0.49) and ChatGPT-4o (4.14±0.41), with p=0.007. Each tested genAI model exhibited near-identical performance across the two languages, with ChatGPT-4o demonstrating the most balanced linguistic capabilities (p=0.957), while Qwen-2.5 and DeepSeek-R1 showed a marginal superiority for English. An in-depth examination of genAI performance across key CLEAR components revealed that Qwen-2.5 consistently excelled in content completeness, factual accuracy, and relevance in both English and Arabic, setting a new benchmark for genAI in medical inquiries. Despite minor linguistic disparities, all three models exhibited robust multilingual capabilities, challenging the long-held assumption that genAI is inherently biased toward English. These findings highlight the evolving nature of AI-driven medical assistance, with Chinese genAI models being able to rival or even surpass ChatGPT-4o in ophthalmology-related queries.

Downloads

Download data is not yet available.

Downloads

Issue

Section

Original Article

Citations