2023: Benchmarking the Quality of Educational Quizzes Using Large Language Models

Master / Bachelor's theses

Student
Maia Filip

Supervisor(s)Advisor(s)

Abstract

Large Language Models (LLMs) are increasingly used to create educational content such as quizzes. While generation quality has improved, there is no standardized, reproducible benchmark for evaluating assessments against pedagogically relevant criteria such as difficulty, fidelity to the source material, coverage, and distractor quality.

This thesis constructs and validates a modular benchmarking framework to systematically and reproducibly evaluate quizzes using multiple LLMs as judges. By providing structured, rubric-driven scores and logging all evaluation details, the framework quantifies variance, supports robust aggregation, and produces actionable benchmarking reports. This system provides a systematic, reproducible approach to evaluating quizzes against pedagogically relevant criteria.