Koli Calling 2025 - Paper Accepted - LLM-Based Multi-Artifact Consistency Verification for Programming Exercise Quality Assurance - Applied Education Technologies

We are excited to share that our paper, “LLM-Based Multi-Artifact Consistency Verification for Programming Exercise Quality Assurance”, was presented at the Koli Calling International Conference on Computing Education Research in Koli, Finland (November 13–16, 2025)

Auto-graded programming exercises in platforms such as Artemis rely on several interconnected artifacts: problem statements, templates, solutions, test suites, and learning objectives. When these artifacts drift apart, students face confusing tasks and instructors spend time on manual QA. Our work introduces an ontology-driven, LLM-powered approach to keep them aligned:

Exercise ontology that models statements, templates, solutions, tests, and learning objectives together, with five inconsistency categories (Structural, Semantic, Assessment, Temporal, Scope).
Consistency checker that lets an LLM read all artifacts at once and highlight conflicting spans, currently covering Structural and Semantic issues such as mismatched signatures, types, visibility, and naming between text and code.
PECV-bench benchmark of 91 perturbed Java exercise variants with 93 annotated inconsistencies, plus a reproducibility package (code, prompts, configs) so others can compare LLMs and prompt strategies.
Practical results: the reference configuration (OpenAI o4-mini) recovers about nine out of ten injected issues (recall 0.91, F1 0.75), while Grok 3 Mini halves latency and cost with a slight accuracy drop — making consistency review a short, pre-release check for teaching teams.

The PECV-bench benchmark and reference pipeline are released openly:

GitHub: https://github.com/ls1intum/PECV-bench
Zenodo archive: https://doi.org/10.5281/zenodo.17260263

This paper continues our AET line of work on robust, scalable assessment in Artemis. We look forward to collaborating with the Koli community on the next steps for reliable programming exercise quality assurance.

Felix T.J. Dietrich presenting 'LLM-Based Multi-Artifact Consistency Verification for Programming Exercise Quality Assurance' at Koli Calling 2025.

Northern lights above Lake Pielinen during Koli Calling 2025.

Snowy afternoon view over Koli National Park.

Felix T.J. Dietrich and Patrick Bassner enjoying the Koli trails.

Citation

LLM-Based Multi-Artifact Consistency Verification for Programming Exercise Quality Assurance
Felix T.J. Dietrich, Yuchen Zhou, Tobias Wasner, Stephan Krusche, and Maribel Acosta.
25th Koli Calling International Conference on Computing Education Research (Koli_'25 ) . Koli, Finland, November 2025. doi: 10.1145/3769994.3770042

Koli Calling 2025 - Paper Accepted - LLM-Based Multi-Artifact Consistency Verification for Programming Exercise Quality Assurance

Monday, 24 November 2025 • Felix T.J. Dietrich

Citation

Felix T.J. Dietrich's recent posts