Human vs. generative artificial intelligence in writing assessment: investigating feedback alignment, score validity, and teacher agency

Anbreen, Tanzeela; ÖZTURAN, TUBA; Shrestha, Prithvi; Maqsood, Ammara

doi:10.1007/s11092-026-09486-z

Human vs. generative artificial intelligence in writing assessment: investigating feedback alignment, score validity, and teacher agency

Anbreen T., ÖZTURAN T., Shrestha P., Maqsood A.

Educational Assessment, Evaluation and Accountability, 2026 (SSCI, Scopus)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2026
Doi Numarası: 10.1007/s11092-026-09486-z
Dergi Adı: Educational Assessment, Evaluation and Accountability
Derginin Tarandığı İndeksler: Social Sciences Citation Index (SSCI), Scopus, IBZ Online, ABI/INFORM, Education Abstracts, Educational research abstracts (ERA), ERIC (Education Resources Information Center)
Anahtar Kelimeler: Feedback alignment, Generative artificial intelligence, L2 writing, Scoring validity, Teacher agency
Erzincan Binali Yıldırım Üniversitesi Adresli: Evet

Özet

The recent advances in technology, such as the emergence of generative artificial intelligence (GenAI) tools, warrant careful integration into education. In particular, exploring feedback and scores generated by both human raters and GenAI tools is crucial for assessing feedback alignment and score validity in L2 writing assessment. Moreover, L2 writing teachers’ agency in collaborating with these tools is a notable area of research. Given the importance of the topic, this mixed-methods research design aims to address three research questions: The alignment of GenAI and human scores and feedback on the same writing task responses; the justifications for scoring and feedback; and teachers’ agency in negotiating their roles in GenAI-supported assessment contexts. For that purpose, fifty essays (an IELTS retired task for Academic Writing Task 2) were rated by a human rater and ChatGPT-5 using the IELTS Task 2 criteria. The results displayed a strong correlation between human and ChatGPT-5 scores, confirming the scoring validity. Then, the rater was asked, and ChatGPT-5 was prompted to investigate the justifications for their scoring decisions. The findings yielded a contrast between the human rater and ChatGPT-5. These findings were also carefully interpreted following Kane’s argument-based approach to validity. Lastly, the thematic analysis of the semi-structured interview to navigate teachers’ agency in GenAI-mediated writing assessment was in accord with Priestley’s ecological model of agency. Overall, the findings illustrate the need for a hybrid model since blending GenAI-led surface-level evaluation with human-led cognitive, critical, and contextual evaluation is essential for a comprehensive and valid writing assessment.