Are Large Language Models Capable of Assessing Students’ Written Products?

A Pilot Study in Higher Education

Authors

  • Daniele Agostini Università di Trento

DOI:

https://doi.org/10.6093/2284-0184/10671

Keywords:

Large Language Models (LLMs), AI-Assisted Assessment, Technology-Enhanced Assessment, Artificial Intelligence in Education, Assessment Rubrics, Higher Education, Student Assessment, Authentic Tasks, Academic Assessment, Educational Technology

Abstract

The rapid adoption of large language models (LLMs) like ChatGPT in higher education raises critical questions about their capabilities for assessment. This pilot study explores whether current LLMs can support university instructors in evaluating students’ written work using rubrics, even for open-ended tasks. Five prominent LLMs (ChatGPT-3.5, ChatGPT-4, Claude 2, Bing Chat, Bard) plus an outsider (OpenChat 3.5) evaluated 21 anonymous group projects from an education course using a 5-criteria rubric. Their scores were compared to two human expert raters through statistical analyses. Results found Claude 2 and ChatGPT-4 had the highest overall agreement with human raters, although the open-source OpenChat 3.5 model performed well above its scale. Agreement varied by assessment criteria; LLM scoring aligned more closely on basic objectives but diverged on complex tasks like evaluating assessment practices and the educational project design. Current LLMs show promise in supporting assessment but lack independent scoring ability, especially for sophisticated rubric dimensions. Further research should refine prompting techniques and specialize models, moving towards AI-assisted rather than autonomous evaluation. The main limitations of this study are the small sample size and limited disciplines. This study provides initial evidence for the possibilities and pitfalls of LLM assessment aid in higher education.

Downloads

Download data is not yet available.

Published

2024-01-16

Issue

Section

Brain Education Cognition