[AirClass Dev Log 6] AirClassGrading 1: Automated Grading for Teams Assignments

What this post covers

AirClassGrading is the grading support system I built to reduce the workload that comes after class. This sixth post focuses on the first major flow: automated grading for Microsoft Teams assignments.

The goal is not simply to ask an AI model for a score. The real goal is to turn the repetitive work around Teams assignments into a reproducible pipeline.

Fetch Teams assignments and submissions.
Download each student's attached files.
Normalize PDFs, photos, and code files into AI-readable artifacts.
Preserve the assignment instructions and rubrics as snapshots.
Run an AI-based first review.
Let the teacher confirm or revise the result in a review web UI.
Send feedback back to Teams when needed.

The overall flow looks like this.

direction: right

teams: "Microsoft Teams
Assignments / submissions"
graph: "Graph API
assignment and submission lookup"
prepare: "prepare step
download and render files"
context: "assignment context snapshot
instructions and rubric"
ai: "AI review
evidence and feedback draft"
db: "database
reviews / snapshots / notifications"
teacher: "teacher review UI
final score and comment"
feedback: "Teams feedback
queue / worker"

teams -> graph -> prepare -> context -> ai -> db -> teacher -> feedback

The written-response image grading flow is different enough that I separated it into the next post. This post focuses only on the Teams assignment grading flow.

Why Teams assignment grading needed automation

Teams assignments are useful. Students can upload files, and teachers can check submission status. But real grading work does not end with opening the Teams screen.

A teacher repeatedly has to do the following.

Check who submitted and who did not.
Check whether each submission is a PDF, photo, code file, or something else.
Confirm that the files open correctly.
Notice whether a student uploaded an old file or an unrelated file.
Re-read the assignment instructions.
Compare the submission against the criteria.
Write feedback for the student.

This becomes especially heavy for portfolio or performance-assessment tasks, where a single submission may contain multiple photos, PDFs, code files, and explanations. Before AI can grade anything, the system first needs to make the submissions gradeable.

So the first goal of AirClassGrading was this:

Collect the scattered Teams submissions and assignment criteria, then organize them so both AI and the teacher can review the same materials.

Code structure

The Teams assignment grading flow is organized under legacy_airclass_engine.

legacy_airclass_engine/
  core/
    engine_api.py
    review_runner.py
    review_store.py
    assignment_context.py
    assignment_loader.py
    grade_prepare_service.py
    prompt_builder.py
    rubric_loader.py
    second_pass_review_prompt_builder.py
    slim_prompt_builder.py
    web_review_app.py

  scripts/
    prepare_portfolio.py
    prepare_quiz.py
    finalize_portfolio.py
    finalize_quiz.py
    generate_grades_csv.py
    generate_quiz_review_csv.py
    run_worker.py
    auto_feedback_worker.py

  teams/
    teams_graph.py
    teams_feedback_writer.py
    teams_account_sender.py
    teams_bot_sender.py
    teams_assignment_live.py

The responsibilities are roughly split like this.

Area	Role
`core`	API, review execution, DB storage, prompt generation, assignment context management
`scripts`	submission collection, preprocessing, grading execution, CSV generation, worker execution
`teams`	Microsoft Teams / Graph API integration and feedback delivery
`docs`, `k8s`, `argocd`	operation documents and deployment configuration

The root documents docs/ENGINE_API_README.md, docs/WEB_REVIEW_README.md, and docs/TEAMS_GRAPH_API.md describe this flow in more operational detail.

engine_api.py: the center of the API and review UI

engine_api.py is the integrated FastAPI app for the Teams assignment grading flow. It is not just an API server. It also includes the prepare API, teacher review pages, snapshot management, and feedback queue registration.

Its main responsibilities are:

Teams submission preprocessing API
assignment context snapshot creation and comparison
AI review entry point
teacher review list, detail, and edit pages
review row synchronization
feedback notification queue registration
serving rendered submission artifacts such as images and PDFs

A local run looks like this.

python3 -m uvicorn engine_api:app --host 0.0.0.0 --port 8092

In Docker or k3s deployment, the app runs from /app, and /app/out is kept on a PVC or host volume. Rendered PDFs, manifests, and debug JSON files should not disappear when the container restarts.

Real assignment instructions become a prompt bundle

A key part of AirClassGrading 1 is that Teams instructions are not just reread manually. They are transformed into a bundle of prompts so AI can make separate judgments step by step.

For the actual week 1 portfolio assignment, one Teams assignment instruction became five prompt roles.

Prompt	Role
`fitPrompt`	decide whether this is a valid submission for the current assignment
`structurePrompt`	evaluate structure only
`diligencePrompt`	evaluate diligence only
`formatPrompt`	check format compliance only
`finalReviewPrompt`	synthesize the previous results into a final review

For example, fitPrompt does not assign a score. It only checks whether the submission fits the current assignment.

Assignment: Portfolio performance task, week 1 summary
Role: judge submission fit only.
Do not score structure, diligence, or format in this prompt.
Only decide whether this can be treated as a submission for the current assignment.

Return JSON only:
{
  "content_fit": "fit|mismatch|uncertain",
  "submission_verdict": "valid|ineligible|uncertain",
  "reasons": ["string"],
  "uncertainties": ["string"]
}

This step exists to prevent unrelated submissions from being graded as if they were valid. If the system evaluates structure or diligence before checking fit, the model can create plausible but misleading grading reasons.

Structure and diligence are evaluated with separate prompts

Portfolio grading uses two main dimensions: structure and diligence. I separated them because AI can otherwise borrow evidence from one dimension to justify the other.

The actual structurePrompt focuses only on structure.

Role: evaluate structure only.
Structure means the organization, separation, flow, and readability of concept notes.

Check:
- whether a concept-summary section is visibly separated
- whether key concepts are organized by item or heading
- whether titles, sections, and ordering make the flow readable
- whether the concept explanation is too empty or mixed together

Principles:
- use only organization, separation, and readability of concept notes as evidence
- do not use amount of problem solving, number of problems, page count, or ratio as structure evidence

The diligencePrompt focuses only on problem-solving effort.

Role: evaluate diligence only.
Diligence means problem numbers, number of problems, solution steps, and writing density.

Check:
- whether problem numbers or problem divisions are visible
- whether solution steps are written progressively
- whether the amount and density of writing are sufficient
- whether the required problem range appears to be reflected

Principles:
- use only problem count, solution process, and writing effort as diligence evidence
- do not use amount, ratio, or organization of concept notes as diligence evidence

This separation mattered in operation. A neatly organized concept note does not automatically prove strong problem-solving effort, and solving many problems does not automatically mean the concept summary is well structured.

Observe pages first, then produce the final review

There is also a stage where AI is not asked to grade at all. It first observes each page and structures only what is visible.

{
  "page_type": "concept_summary|problem_solving|retry_or_correction|mixed|unclear",
  "readability": "high|medium|low",
  "observations": [
    "visible fact 1",
    "visible fact 2"
  ],
  "signals": {
    "has_concept_heading": false,
    "has_structured_flow": false,
    "has_restarted_problem_numbers": false,
    "has_problem_source": false,
    "has_step_by_step_solution": false,
    "has_answer_only_pattern": false,
    "has_retry_or_correction": false,
    "shows_diligent_organization": false,
    "shows_sparse_or_low_effort_notes": false
  },
  "uncertainties": [
    "reading limitation or uncertain point"
  ]
}

The goal of this step is evidence collection, not grading. Instead of immediately producing A/B/C/D, the model first records signals such as whether concept headings are visible, whether the solution is step-by-step, or whether the page mostly contains answers only.

The final review prompt then synthesizes the earlier outputs.

Role: synthesize fit, structure, diligence, and format results.
You are the final reviewer and re-grader.
Use the first-pass results as references, but confirm the final grades and student feedback consistently.

Final review order:
1) check fit first to confirm this is a submission for the current assignment
2) review structure only through concept-note organization, separation, and readability
3) review diligence only through problem count, solution process, and writing effort
4) check whether format violations should affect the final result

Core principles:
- structure and diligence are independent areas
- do not move evidence from one area into the other area's score
- ineligible decisions must be based on fit and the main visible content of the submission

With this design, AI grading is not a single black-box answer. Fit, structure, diligence, format, and final synthesis each leave separate traces.

prepare: turning Teams submissions into gradeable artifacts

The first step of automated grading is prepare. This is not just a download step. It converts Teams submissions into a structure that AI can read reliably and teachers can trace later.

The flow is roughly:

Select the portfolio or quiz prepare flow depending on assignment type.
Fetch submissions and attached resources through Microsoft Graph API.
Render PDF pages into images.
Normalize photo orientation and safe file names.
Create manifest.json and analysisTargets.
Create or connect an assignment context snapshot based on the Teams instructions.
Return prompt bundles, parsed instructions, and grading prompt candidates.

In the Teams UI, one student's submission may look like a single item. In practice, it may contain many pieces.

Student A submission
  report.pdf
    page 1
    page 2
    page 3
  photo_1.jpg
  photo_2.jpg
  code.py

If this is sent to AI without structure, order and source tracking become unreliable. The prepare step breaks submissions into smaller analysis units.

Teams submission
  → resources / submittedResources
  → local files
  → normalized artifacts
  → analysisTargets
  → AI review

analysisTargets tracks which page or file came from which student's submission. This is the key structure that lets AI and the teacher review the same materials in the same order.

Assignment context snapshots

In automated grading, the criteria matter as much as the submitted files. Teams assignment descriptions can change, and each assignment may have different submission rules, deadlines, late policies, and rubrics.

AirClassGrading stores these criteria as snapshots.

Item	Meaning
live assignment	the current Teams assignment fetched through Graph API
instruction digest	a concise summary of the assignment instructions
parsed instructions	structured information such as deadline, format, and late policy
prompt bundle	prompts used for AI grading
assignment context snapshot	a stored version of the criteria at a specific point in time

Snapshots are necessary because the system must be able to answer: “What criteria were used when this submission was reviewed?”

In the teacher web UI, the current live assignment can be compared with stored snapshots, and snapshots can be compared with each other. This becomes important when the assignment instructions changed during operation.

AI review execution

The AI review does not simply return one score. Depending on the assignment type, it checks several aspects.

For portfolio assignments, the review usually looks at:

assignment relevance
conceptual understanding
structure of explanation or solution
effort and completeness
submission format
late policy application
student-facing feedback draft

For quiz assignments, the review can use checklist-style Pass / NP / uncertain decisions. Submissions unrelated to the assignment can be treated as NP or ineligible.

Review execution is connected to review_runner.py. The runner groups prepared targets by student and submission, then runs AI review using the assignment context and rubric.

The important point is to avoid a black-box “score only” workflow. The system stores reasoning and feedback drafts so the teacher can inspect the result.

DB storage and teacher review web

AirClassGrading does not end with a single output file. In operation, reviews, criteria, and notification states are stored separately.

Data	Role
`reviews`	AI review result, teacher-confirmed values, status
`assignment_context_snapshots`	assignment instructions and grading criteria snapshots
`review_notifications`	feedback queue and delivery status
`out/` artifacts	rendered PDF images, manifests, debug JSON files

The teacher review web provides these pages.

Page	Role
`/`	recent reviews, AI prediction, teacher confirmation status, feedback status
`/assignments`	Teams assignment list
`/assignments/{assignment_id}`	assignment detail, submission status, snapshots, prompt candidates
`/reviews/{review_id}`	artifact preview, AI evidence, final teacher score and comment

On the review detail page, the teacher can inspect rendered submission images, AI evidence, and the draft feedback. The teacher can then revise and save the final score and comment.

Feedback delivery is separated into queue and workers

Feedback delivery is separated from grading. A completed review is not sent to Teams immediately. It is added to a notification queue and processed by a worker.

Depending on configuration, supported channels can include:

Teams assignment feedback
Teams bot DM
delegated Teams account DM
DB placeholder
dry-run / none

This separation matters because the failure points are different. An AI review can succeed while the Teams API delivery fails. In that case, the system should retry only the feedback delivery, not the whole review.

Worker and deployment structure

The root contains k8s, Argo CD, and Jenkins files.

k8s/
  deployment.yaml
  grading-worker-deployment.yaml
  feedback-worker-deployment.yaml
  retry-worker-deployment.yaml
  regrade-job.yaml
  retry-late-job.yaml
  retry-pending-job.yaml
  service.yaml
  pvc.yaml

argocd/
  airclass-grading-application.yaml

jenkins/
  job-config.xml
Jenkinsfile

Operationally, the components are split like this.

Component	Role
API server	prepare requests, review web, artifact serving
grading worker	AI review execution
feedback worker	student feedback delivery
retry worker/job	retry failed or delayed tasks
PVC / volume	preserve `out/` artifacts

This shows that AirClassGrading is not just a local script. It is designed as an automated grading system with separate API and worker processes.

What the actual artifacts show

I do not include student names, accounts, original file names, or submitted images here. Instead, I only use anonymized aggregate counts from local artifacts.

From the portfolio performance-assessment artifacts under out/, the current counts were:

Item	Count	Meaning
portfolio weekly output folders	4	outputs were generated for weeks 1–4
`review_debug.json` files	221	AI reviews ran and stored traceable evidence
prepare `manifest.json` files	16	submission collection and preprocessing ran repeatedly
top-level manifest `analysisTargets`	42	submitted files were broken into AI-readable analysis units

The weekly distribution of review_debug.json was:

Week	Review debug records
Portfolio week 1	47
Portfolio week 2	76
Portfolio week 3	95
Portfolio week 4	3
Total	221

The key point is not the score itself. The value is that the workflow leaves reproducible artifacts:

submission collection and normalization are recorded through manifests and analysisTargets;
AI evidence is traceable through review_debug.json;
weekly distributions make it easier to see whether a problem was assignment-specific or systemic;
review artifacts can be reused by feedback workers and retry flows.

In other words, the practical effect of AirClassGrading 1 is that submission collection → preprocessing → AI review → teacher confirmation becomes a data flow that can be inspected and rerun.

What this flow achieved

The biggest result of the Teams grading flow was not just “AI grading.” It was the structure around it.

A reliable way to collect submissions.
A prepare step that turns files into AI-readable artifacts.
A snapshot mechanism for assignment criteria.
A DB structure that separates AI review and teacher confirmation.
A feedback queue and worker model for operation.

Changing the AI model is relatively easy. But without a stable flow for submissions, criteria, results, and feedback, the system cannot be used in a real classroom. AirClassGrading 1 was the attempt to organize that flow around Teams assignments.

The next problem

Teams assignment grading works well for file-submission tasks. But school assessment has another major format: written-response answer sheets.

Written-response sheets are different from Teams submissions.

One PDF can contain many students' answers.
One student page can contain several problems.
Each problem has its own partial-credit criteria.
AI first-pass results need repeated checks for stability.
Final scores must be confirmed per problem by the teacher.

The next post covers the second AirClassGrading flow: applying AI to written-response grading. It explains PDF preprocessing, problem-level image crops, partial-credit criteria JSON, repeated AI first-pass grading, and final teacher confirmation.