How grading exam papers is related to AI

4 minute read

Published:

This semester, I serve as a TA of a Computer Science couse. It has come to the final exam period, and what I need to do is to grade the final exam. During the grading, I have some thinking that relates grading students’ exam papers to the development of large language models (LLMs).

The exam itself, like an LLM evaluation benchmark, is imperfect. The score I assign might seem definitive, but just like with LLMs, it doesn’t always capture the full picture. It might focus on specific topics, be biased towards certain approaches, or leave room for subjectivity. This means some students might score higher by mastering “tricks” rather than demonstrating a deeper understanding, just like smaller, specialized LLMs can outperform larger models on specific tasks. However, just as larger LLMs are generally more versatile and useful, students with a broader understanding will ultimately be better equipped for success.

In another aspect, grading is slow and deliberate at the beginning. I carefully consider the grading scheme, analyze each answer, and assign a score. But as I progress, a fascinating shift occurs. My brain, like a trained LLM model, becomes more efficient. I assess answers more quickly and consistently, relying on an internal “grading model” developed through experience.

Also, just as evaluating LLMs often requires multiple rounds, so does grading. Proofreading my initial assessments and revisiting borderline cases ensures fairness and accuracy. This multi-layered approach helps me provide each student with a grade that reflects their true potential.

In the end, grading exams is a complex process with limitations. I need to develop my own LM to grade other LLMs :D By acknowledging these limitations and employing a thoughtful, multi-layered approach, I strive to provide students with fair and accurate assessments that reflect their true understanding and performance.

P/s: A funny story is that when I grade a half of the whole pile of exam paper, I find out that the solution of one question is wrong! Thus, I need to re-grade that question! OMG!

Prompt to Gemini: Please help me to write up a short blog with the following outline: I am the TA of a course. In the occassion that I grade the final exam, I relate grading student’s exam paper to evaluating language models. Think the grading is similar to evaluating Large Language Model, as the score does not really reflect the capability of model. As the exam or the evaluation benchmark is always not representative, biased, subjective, … Many students can have higher score because of some tricks, just like smaller specialized language models win large models over a certain category. But overall, larger models still have higher utility and be more prevalent. Also, when grading, at the begining, grade slow, as need to think about criteria. But later grade quickly, as we trained a model in our brain for grading! Grading sometimes need multiple round evaluation, which means need to proof-check the answer and grade. Just like evaluating LLMs, should evaluate them in multiple round!

Leave a Comment