[AirClass Dev Log 7] AirClassGrading 2: Applying AI to Written-Response Grading

What this post covers

In the previous post, I covered the first AirClassGrading flow: automated grading for Teams assignments. This seventh post covers the second flow: applying AI to written-response grading.

Written-response grading starts from a different input. Teams assignment grading begins with files submitted by students. Written-response assessment often begins with paper answer sheets, scanned into a PDF, where the system must separate each student and each problem before grading can happen.

The core flow is:

split the scanned answer PDF into student-problem images;
define partial-credit criteria in JSON;
run AI first-pass grading per problem;
repeat grading to check instability;
let the teacher confirm final scores;
calculate class-level summaries and export results.

The overall flow looks like this.

direction: right

pdf: "scanned answer PDF"
setup: "PDF / region setup
cover pages and problem boxes"
crop: "problem images
student × problem"
criteria: "partial-credit criteria JSON
grading_criteria.json"
ai: "repeated AI first pass
grading_runs"
teacher: "teacher final confirmation
teacher_finals"
result: "statistics / export
class averages and CSV"

pdf -> setup -> crop -> criteria -> ai -> teacher -> result

Why written-response grading needed its own flow

Written-response assessment is one of the hardest areas to automate. It is not enough to compare the final answer. The teacher has to read the reasoning process and assign partial credit.

A teacher often checks:

whether the key equation was set up correctly;
whether the calculation process is valid;
where an intermediate mistake happened;
whether partial credit is possible even if the final answer is wrong;
whether the graph or notation is sufficient;
whether the conclusion statement is missing;
whether the same mistake is being penalized twice.

If this is given to AI without structure, the result can be unstable. So the first design decision was to define the grading unit clearly.

In this project, the grading unit is:

one student × one problem

For 64 students and 5 problems, that means 320 grading units.

64 students × 5 problems = 320 grading units

This unit makes it possible for AI to apply problem-specific criteria, and for the teacher to review one problem across students when needed.

Folder structure

The written-response grading flow is organized under written_response_grader.

written_response_grader/
  core/
    grade_cli.py
    wrg_preprocess.py
    wrg_stats.py
    wrg_workflow.py
    wrg_paths.py
    ieum_chat_client.py

  gui/
    grading_gui_qt.py
    workflow_dashboard.py
    preprocessing_setup_dialog.py

  data/
    grading_criteria.json
    preprocess_config.json
    manual_regions_page002_ccw05.json
    random_mapping.json
    student_problem_images_hq/
    grading_reviews/
    teacher_finals/

  source_pdfs/
    1test.pdf
    1test_rotated_90cw.pdf
    1problem.pdf
    1answer.pdf

  docs/
    written_response_grading_workflow.md
    AUDIT_REPORT.md

Responsibilities are split like this.

Area	Role
`core`	preprocessing, AI grading CLI, statistics, paths, API client
`gui`	PySide6 GUI, workflow dashboard, PDF/region setup dialog
`data`	criteria, settings, student images, AI reviews, teacher finals
`source_pdfs`	answer sheet, problem sheet, solution PDF
`docs`	workflow documentation and code audit notes

Step 1: split the answer PDF into problem images

The first step is PDF preprocessing. The input files are:

source_pdfs/
  1test.pdf                 # original scanned student answer PDF
  1test_rotated_90cw.pdf    # orientation-corrected answer PDF
  1problem.pdf              # problem sheet
  1answer.pdf               # solution and partial-credit criteria

The scanned answer sheet cannot be graded directly. It may contain cover pages, multiple classes, multiple students, and different page orientations or slight skew. The GUI first sets the PDF and regions.

The settings are stored under data/.

data/
  preprocess_config.json
  manual_regions_page002_ccw05.json

In the GUI, I mark these regions on a sample page:

student number / name area
full problem area
problem 1 area
problem 2 area
problem 3 area
problem 4 area
problem 5 area

I also set class count, students per class, cover pages to skip, and the first student number. For example, if there are 4 classes, 16 students per class, and 1 cover page per class, the system can calculate which page is the first student page.

After saving the regions, image generation creates this structure.

data/student_problem_images_hq/
  1101/
    1101_1.png
    1101_2.png
    1101_3.png
    1101_4.png
    1101_5.png
  1102/
    1102_1.png
    ...

Now AI can review one student's answer to one problem, instead of receiving the whole PDF.

Step 2: define partial-credit criteria as JSON

Partial-credit criteria are the core of written-response grading. If the prompt only says “read the solution and score it,” the result varies by problem. So the problem text, answer, explanation, and scoring items are written into JSON.

The main file is:

data/grading_criteria.json

The basic structure is:

{
  "schema_version": "partial_credit_v1",
  "common_instructions": "common grading instructions",
  "notation_penalty": {
    "points": -0.5,
    "rule": "Deduct 0.5 points for each incorrect or unclear notation."
  },
  "problems": [
    {
      "problem": 1,
      "title": "problem title",
      "problem_text": "problem statement",
      "answer": "answer",
      "explanation": "solution explanation",
      "max_score": 5,
      "scoring_items": [
        {
          "id": "1-1",
          "label": "sub-criterion label",
          "points": 1,
          "criterion": "criterion for awarding points"
        }
      ],
      "partial_credit_note": "partial-credit notes"
    }
  ]
}

The current score structure is:

Problem	Max score
1	5
2	6
3	6
4	6
5	7
Total	30

The criteria JSON includes ambiguous cases as explicitly as possible.

Which equivalent expressions are accepted?
How many points should be deducted for notation errors?
Is marking a point on a graph enough as a conclusion?
How should D, D/4, and D' be treated for discriminants?
What happens when the calculation is correct but the final conclusion is missing?
How can the same error avoid being penalized twice?

The more concrete the criteria are, the more stable the AI first pass becomes.

Actual criteria example: 5-point polynomial division problem

Problem 1 was a polynomial division written-response item.

[Written-response 1]
For A=3x^3-4x^2+2x-5 and B=x^2-2x+1,
find the quotient Q and remainder R when A is divided by B,
and explain the process. (5 points)

The answer is:

Q = 3x + 2
R = 3x - 7

The item was split into five 1-point criteria.

ID	Item	Points	Criterion
1-1	first term of quotient	1	correctly finds the first term `3x`
1-2	first multiplication and subtraction	1	computes `3x(x^2-2x+1)` and subtracts to get `2x^2-x-5`
1-3	second term of quotient	1	correctly finds the second term `2`
1-4	second multiplication and remainder	1	computes `2(x^2-2x+1)` and subtracts to get `3x-7`
1-5	final answer statement	1	states both `Q=3x+2` and `R=3x-7` correctly

This split makes it possible to distinguish several cases: a student with the correct final answer but weak process, a student with mostly correct process but a wrong final statement, or a student who made a calculation error but still understood the division structure.

Common grading instructions: no guessing and no double penalty

The common instructions include rules that prevent the model from over-interpreting the answer sheet.

1. Grade only what is concretely visible in the answer sheet. Do not guess.
2. For written-response items, the final value or form must actually match, not just the style of conclusion.
3. If a criterion requires two elements and one is missing, award half credit.
4. Do not create a separate notation-penalty item.
   Instead, subtract 0.5 points directly from the most relevant scoring item.
   If the same error already caused that item to receive 0 points, do not deduct again.

This rule is meant to stop the model from saying “it seems roughly correct.” The double-penalty rule was especially important. If one mistake already makes a sub-criterion fail, applying an additional notation penalty for the same mistake can make the AI score too harsh compared with teacher grading.

Ambiguous criteria example: maximum and minimum in problem 5

Problem 5 involved a graph and maximum/minimum conclusions. For this item, marking points on the coordinate plane alone was not enough.

For problem 5, do not accept maximum/minimum conclusions based only on points marked on the graph.
The answer must state the x-value and the corresponding maximum or minimum value in text or equation form.

But the rule was not made overly strict. A complementary rule was added.

The conclusion does not have to be written as one perfect sentence.
If the calculation or written area clearly shows the relationship among x-value, function value,
and maximum/minimum wording, accept it.
For example, if x=6 and y=5 are visible near a maximum expression, accept the maximum criterion.
If x=3 and y=-4 are visible near a minimum expression, accept the minimum criterion.

This connects directly to the observed metrics. Problem 5 had 19 teacher revisions, and its mean absolute difference between AI and teacher score was 0.289 points, the largest among the five problems. That indicates this item especially needed precise criteria wording.

Example AI output JSON

The AI first-pass result does not store only a score. It also stores which sub-criteria were awarded and what evidence was used. The following is an anonymized format example.

{
  "student_id": "S-anon",
  "problem": 5,
  "score": 6.5,
  "max_score": 7,
  "scoring_results": [
    {
      "id": "5-1",
      "label": "vertex",
      "points": 2,
      "awarded_points": 2,
      "evidence": "y=(x-3)^2-4 and the vertex (3,-4) are visible",
      "rationale": "The vertex was correctly found and stated."
    },
    {
      "id": "5-5",
      "label": "maximum value",
      "points": 1,
      "awarded_points": 0.5,
      "evidence": "x=6 and y=5 are visible, but the maximum expression is unclear",
      "rationale": "The correspondence is visible, but the conclusion wording is not sufficient."
    }
  ],
  "notation_penalties": []
}

This format lets the teacher inspect why the AI gave the score. The teacher is not forced to accept a number; the teacher can review evidence and rationale per sub-criterion before confirming the final score.

Step 3: run AI first-pass grading

AI first-pass grading is executed through the CLI.

cd written_response_grader

# dry run
uv run wrg-grade --dry-run --students 1101 --problems 1

# grade all 64 students × 5 problems once
uv run wrg-grade --workers 4 --retries 2 --repeat 1

# repeat the whole grading 3 times
uv run wrg-grade --workers 4 --retries 2 --repeat 3

# grade one problem for all students
uv run wrg-grade --problems 4 --workers 4

Results are saved as student-problem JSON files.

data/grading_reviews/
  1101/
    1101_1.json
    1101_2.json
    1101_3.json
    1101_4.json
    1101_5.json

Each problem JSON accumulates AI grading runs in grading_runs. If the same student's problem 5 is graded three times, all three runs remain in the same JSON file.

This matters because I do not want to trust a single AI output blindly. Repeated runs reveal where the criteria or model behavior are unstable.

Step 4: check instability through repeated grading

AI can be unstable in written-response grading for many reasons.

The criteria are ambiguous.
A student's solution can be interpreted in multiple ways.
Notation errors and conceptual errors are mixed.
The conclusion is correct but the process is weak.
The process is correct but the final calculation has an error.

So repeated grading is compared.

I check questions like:

Does the same student-problem score vary widely?
Does a specific scoring item flip between awarded and not awarded?
Are notation penalties applied too aggressively or twice?
Are clearly correct answers penalized because of wording?
Does a criteria revision move the average in an unexpected way?

When a problem is found, I revise the criteria JSON and regrade only the affected problem or students.

uv run wrg-grade --students 1101,1111,1206 --problems 5 --workers 4 --retries 2 --repeat 1

This is less about “trusting AI” and more about using AI instability to find unclear criteria.

Step 5: teacher final confirmation

AI first-pass grading is reference material. The final score is confirmed by the teacher.

The current structure separates AI results from teacher-final evaluations.

DEFAULT_REVIEW_DIR = DATA_DIR / "grading_reviews"
DEFAULT_TEACHER_FINAL_DIR = DATA_DIR / "teacher_finals"

The roles are:

Path	Role
`data/grading_reviews`	AI grading results and repeated runs
`data/teacher_finals`	teacher-final evaluations

This separation matters. If AI regrading is run again, it must not overwrite scores already confirmed by the teacher.

In the GUI, the review flow is:

view the student-problem answer image;
inspect AI grading runs;
check each scoring item decision;
enter the final teacher score;
save teacher comments;
move to the next unconfirmed problem.

For problem-by-problem review, random IDs such as R01–R64 can reduce student-identification bias. The teacher can focus more on the answer and criteria than on the student number.

Step 6: statistics and export

Once final scores are saved, class averages and problem averages can be checked. The relevant logic is in wrg_stats.py and wrg_workflow.py.

The system can check:

student total score;
problem scores;
class average;
problem average by class;
grading completion rate;
missing problem JSON files;
final confirmation status.

The key rule is to distinguish AI scores from teacher-final scores in statistics and CSV exports. For actual grade calculation, teacher-final values must take priority.

What the actual results show

I do not publish student names, answer images, or individual feedback text. The following uses only anonymized aggregate counts from the data/ directory.

The current written-response grading data had this scale.

Item	Count
students	64
problems	5
student-problem answer images	320
AI review JSON files	320
teacher-final JSON files	320
students with complete grading	64
missing problem items	0

In other words, all 64 students × 5 problems were split into images, reviewed by AI, and confirmed by the teacher.

AI first-pass grading was accumulated 4 times per item.

Item	Value
total AI grading runs	1,280
AI runs per item	4
average score range across repeated runs	0.278 points
items where all 4 runs matched exactly	238 / 320
items with run range ≤ 0.5 points	262 / 320
items with run range ≤ 1 point	299 / 320

This does not mean AI is perfect. It means the system can identify where AI is stable and where the teacher should look more carefully. Out of 320 items, 238 had identical scores across four runs, and 299 stayed within 1 point. The remaining items became good candidates for closer teacher review or criteria refinement.

Comparing the latest AI score with the teacher-final score gave these results.

Item	Value
compared items	320
AI average score	4.570
teacher-final average score	4.583
mean difference	+0.013
mean absolute difference	0.212
items where AI and teacher score matched	248 / 320
items with difference ≤ 0.5 points	279 / 320
items with difference ≤ 1 point	307 / 320
items revised by the teacher	72 / 320
teacher increased the score	36
teacher decreased the score	36

The teacher revisions were not one-sided. There were 36 increases and 36 decreases, which suggests the AI was not simply too strict or too generous overall.

By problem, the comparison looked like this.

Problem	AI avg.	Teacher-final avg.	Mean diff.	Mean abs. diff.	Revised items
1	4.008	4.211	+0.203	0.234	15
2	4.508	4.367	-0.141	0.188	10
3	5.234	5.227	-0.008	0.102	9
4	4.203	4.172	-0.031	0.250	19
5	4.898	4.938	+0.039	0.289	19

Problem 3 had the smallest difference between AI and teacher decisions. Problems 4 and 5 required more teacher revisions. This is useful because it tells me which problem criteria need more careful wording.

Class-level totals based on teacher-final scores were:

Class	Students	Average total	Minimum	Maximum
11	16	25.31	15.5	29.5
12	16	21.75	2.0	30.0
13	16	24.69	10.0	30.0
14	16	19.91	2.0	30.0

The effects can be summarized in three points.

First, all 320 problem answers were organized into consistent image and JSON units. The teacher no longer has to navigate a full PDF to find one student's answer to one problem.

Second, 1,280 AI grading runs revealed which items were stable and which needed attention. Stable items and suspicious items became easier to separate.

Third, teacher-final scores were stored separately and could be compared with AI results. The mean difference was small, but 72 items were actually revised by the teacher. AI was not the final decision maker; it produced a first-pass draft that made teacher review faster and more structured.

Workflow dashboard

gui/workflow_dashboard.py groups the written-response workflow into one screen.

1 Preprocessing
  PDF selection → sample rendering → problem box setup → student-problem images

2 Criteria / AI grading
  criteria editor → CLI grading → refresh results

3 Statistics / results
  class averages → result check → final export

The working order is:

Select the answer PDF.
Set cover-page skipping, class count, student count, and first student number.
Mark problem boxes.
Generate student-problem images.
Check the criteria JSON.
Run AI first-pass grading.
Revise unstable criteria and regrade.
Confirm final teacher scores.
Check class averages and exported results.

This dashboard reduces the fragmentation of written-response grading. Preprocessing, criteria editing, AI grading, and result checking are connected in one workflow.

Remaining code-structure tasks

docs/AUDIT_REPORT.md records several structural issues to keep in mind.

First, AI scores and teacher-final scores must not be mixed. If statistics, CSV export, or completion rates only use AI scores, the final confirmation step loses meaning.

Second, common logic between CLI and GUI should be further separated. Prompt generation, API calls, result normalization, and saving logic should not be duplicated in multiple places.

Third, the role of problem JSON files and teacher-final evaluation files must stay clear. If the same score is stored in too many places, it becomes difficult to know which value is authoritative.

These are not failures of the idea. They are cleanup tasks that became visible after real use. Written-response grading has a complex data flow, so storage responsibility must be explicit.

Connection between AirClassGrading 1 and 2

The Teams assignment flow and the written-response grading flow use different inputs, but they address the same problem.

Category	AirClassGrading 1	AirClassGrading 2
Target	Teams assignment submissions	scanned written-response answer PDF
Input	Teams submissions and attachments	answer PDF, problem PDF, solution PDF
Grading unit	student submission / assignment	one student × one problem
Criteria	Teams instructions, rubric, snapshot	`grading_criteria.json` partial-credit criteria
Preprocessing	file download, PDF rendering, artifact normalization	PDF rendering and problem-level crop images
AI role	assignment review, feedback draft, score/grade suggestion	problem-level first-pass grading and partial-credit support
Teacher role	review confirmation, final score/comment	final score confirmation per problem
Output	Teams feedback and review DB	final scores, class averages, result files

The core idea is the same:

Let AI organize and review student work first, while keeping final judgment and criteria management in the teacher's hands.

Summary

Applying AI to written-response grading requires more than sending images to a model. The answer PDF must first be split into student-problem units. Partial-credit criteria must be structured. AI first-pass results must be repeated and checked for instability.

The important design choices were:

defining the grading unit as one student × one problem;
generating problem-level crop images from the answer PDF;
managing partial-credit criteria in grading_criteria.json;
repeating AI first-pass grading to find ambiguous criteria;
storing teacher-final scores separately from AI results;
connecting preprocessing, grading, and result checking through a workflow dashboard.

AirClassGrading 2 shows how AI can be placed inside written-response grading without handing over the final decision. The AI prepares a structured first pass, and the teacher makes the final judgment more quickly and consistently.