[AirClass Dev Log 6] AirClassGrading 1: Automated Grading for Teams Assignments

한국어 버전

What this post covers

AirClassGrading is the grading support system I built to reduce the workload that comes after class. This sixth post focuses on the first major flow: automated grading for Microsoft Teams assignments.

The goal is not simply to ask an AI model for a score. The real goal is to turn the repetitive work around Teams assignments into a reproducible pipeline.

  • Fetch Teams assignments and submissions.
  • Download each student's attached files.
  • Normalize PDFs, photos, and code files into AI-readable artifacts.
  • Preserve the assignment instructions and rubrics as snapshots.
  • Run an AI-based first review.
  • Let the teacher confirm or revise the result in a review web UI.
  • Send feedback back to Teams when needed.

The overall flow looks like this.

direction: right

teams: "Microsoft Teams
Assignments / submissions"
graph: "Graph API
assignment and submission lookup"
prepare: "prepare step
download and render files"
context: "assignment context snapshot
instructions and rubric"
ai: "AI review
evidence and feedback draft"
db: "database
reviews / snapshots / notifications"
teacher: "teacher review UI
final score and comment"
feedback: "Teams feedback
queue / worker"

teams -> graph -> prepare -> context -> ai -> db -> teacher -> feedback

The written-response image grading flow is different enough that I separated it into the next post. This post focuses only on the Teams assignment grading flow.

Why Teams assignment grading needed automation

Teams assignments are useful. Students can upload files, and teachers can check submission status. But real grading work does not end with opening the Teams screen.

A teacher repeatedly has to do the following.

  • Check who submitted and who did not.
  • Check whether each submission is a PDF, photo, code file, or something else.
  • Confirm that the files open correctly.
  • Notice whether a student uploaded an old file or an unrelated file.
  • Re-read the assignment instructions.
  • Compare the submission against the criteria.
  • Write feedback for the student.

This becomes especially heavy for portfolio or performance-assessment tasks, where a single submission may contain multiple photos, PDFs, code files, and explanations. Before AI can grade anything, the system first needs to make the submissions gradeable.

So the first goal of AirClassGrading was this:

Collect the scattered Teams submissions and assignment criteria, then organize them so both AI and the teacher can review the same materials.

Code structure

The Teams assignment grading flow is organized under legacy_airclass_engine.

legacy_airclass_engine/
  core/
    engine_api.py
    review_runner.py
    review_store.py
    assignment_context.py
    assignment_loader.py
    grade_prepare_service.py
    prompt_builder.py
    rubric_loader.py
    second_pass_review_prompt_builder.py
    slim_prompt_builder.py
    web_review_app.py

  scripts/
    prepare_portfolio.py
    prepare_quiz.py
    finalize_portfolio.py
    finalize_quiz.py
    generate_grades_csv.py
    generate_quiz_review_csv.py
    run_worker.py
    auto_feedback_worker.py

  teams/
    teams_graph.py
    teams_feedback_writer.py
    teams_account_sender.py
    teams_bot_sender.py
    teams_assignment_live.py

The responsibilities are roughly split like this.

Area Role
core API, review execution, DB storage, prompt generation, assignment context management
scripts submission collection, preprocessing, grading execution, CSV generation, worker execution
teams Microsoft Teams / Graph API integration and feedback delivery
docs, k8s, argocd operation documents and deployment configuration

The root documents docs/ENGINE_API_README.md, docs/WEB_REVIEW_README.md, and docs/TEAMS_GRAPH_API.md describe this flow in more operational detail.

engine_api.py: the center of the API and review UI

engine_api.py is the integrated FastAPI app for the Teams assignment grading flow. It is not just an API server. It also includes the prepare API, teacher review pages, snapshot management, and feedback queue registration.

Its main responsibilities are:

  • Teams submission preprocessing API
  • assignment context snapshot creation and comparison
  • AI review entry point
  • teacher review list, detail, and edit pages
  • review row synchronization
  • feedback notification queue registration
  • serving rendered submission artifacts such as images and PDFs

A local run looks like this.

python3 -m uvicorn engine_api:app --host 0.0.0.0 --port 8092

In Docker or k3s deployment, the app runs from /app, and /app/out is kept on a PVC or host volume. Rendered PDFs, manifests, and debug JSON files should not disappear when the container restarts.

Real assignment instructions become a prompt bundle

A key part of AirClassGrading 1 is that Teams instructions are not just reread manually. They are transformed into a bundle of prompts so AI can make separate judgments step by step.

For the actual week 1 portfolio assignment, one Teams assignment instruction became five prompt roles.

Prompt Role
fitPrompt decide whether this is a valid submission for the current assignment
structurePrompt evaluate structure only
diligencePrompt evaluate diligence only
formatPrompt check format compliance only
finalReviewPrompt synthesize the previous results into a final review

For example, fitPrompt does not assign a score. It only checks whether the submission fits the current assignment.

Assignment: Portfolio performance task, week 1 summary
Role: judge submission fit only.
Do not score structure, diligence, or format in this prompt.
Only decide whether this can be treated as a submission for the current assignment.

Return JSON only:
{
  "content_fit": "fit|mismatch|uncertain",
  "submission_verdict": "valid|ineligible|uncertain",
  "reasons": ["string"],
  "uncertainties": ["string"]
}

This step exists to prevent unrelated submissions from being graded as if they were valid. If the system evaluates structure or diligence before checking fit, the model can create plausible but misleading grading reasons.

Structure and diligence are evaluated with separate prompts

Portfolio grading uses two main dimensions: structure and diligence. I separated them because AI can otherwise borrow evidence from one dimension to justify the other.

The actual structurePrompt focuses only on structure.

Role: evaluate structure only.
Structure means the organization, separation, flow, and readability of concept notes.

Check:
- whether a concept-summary section is visibly separated
- whether key concepts are organized by item or heading
- whether titles, sections, and ordering make the flow readable
- whether the concept explanation is too empty or mixed together

Principles:
- use only organization, separation, and readability of concept notes as evidence
- do not use amount of problem solving, number of problems, page count, or ratio as structure evidence

The diligencePrompt focuses only on problem-solving effort.

Role: evaluate diligence only.
Diligence means problem numbers, number of problems, solution steps, and writing density.

Check:
- whether problem numbers or problem divisions are visible
- whether solution steps are written progressively
- whether the amount and density of writing are sufficient
- whether the required problem range appears to be reflected

Principles:
- use only problem count, solution process, and writing effort as diligence evidence
- do not use amount, ratio, or organization of concept notes as diligence evidence

This separation mattered in operation. A neatly organized concept note does not automatically prove strong problem-solving effort, and solving many problems does not automatically mean the concept summary is well structured.

Observe pages first, then produce the final review

There is also a stage where AI is not asked to grade at all. It first observes each page and structures only what is visible.

{
  "page_type": "concept_summary|problem_solving|retry_or_correction|mixed|unclear",
  "readability": "high|medium|low",
  "observations": [
    "visible fact 1",
    "visible fact 2"
  ],
  "signals": {
    "has_concept_heading": false,
    "has_structured_flow": false,
    "has_restarted_problem_numbers": false,
    "has_problem_source": false,
    "has_step_by_step_solution": false,
    "has_answer_only_pattern": false,
    "has_retry_or_correction": false,
    "shows_diligent_organization": false,
    "shows_sparse_or_low_effort_notes": false
  },
  "uncertainties": [
    "reading limitation or uncertain point"
  ]
}

The goal of this step is evidence collection, not grading. Instead of immediately producing A/B/C/D, the model first records signals such as whether concept headings are visible, whether the solution is step-by-step, or whether the page mostly contains answers only.

The final review prompt then synthesizes the earlier outputs.

Role: synthesize fit, structure, diligence, and format results.
You are the final reviewer and re-grader.
Use the first-pass results as references, but confirm the final grades and student feedback consistently.

Final review order:
1) check fit first to confirm this is a submission for the current assignment
2) review structure only through concept-note organization, separation, and readability
3) review diligence only through problem count, solution process, and writing effort
4) check whether format violations should affect the final result

Core principles:
- structure and diligence are independent areas
- do not move evidence from one area into the other area's score
- ineligible decisions must be based on fit and the main visible content of the submission

With this design, AI grading is not a single black-box answer. Fit, structure, diligence, format, and final synthesis each leave separate traces.

prepare: turning Teams submissions into gradeable artifacts

The first step of automated grading is prepare. This is not just a download step. It converts Teams submissions into a structure that AI can read reliably and teachers can trace later.

The flow is roughly:

  1. Select the portfolio or quiz prepare flow depending on assignment type.
  2. Fetch submissions and attached resources through Microsoft Graph API.
  3. Render PDF pages into images.
  4. Normalize photo orientation and safe file names.
  5. Create manifest.json and analysisTargets.
  6. Create or connect an assignment context snapshot based on the Teams instructions.
  7. Return prompt bundles, parsed instructions, and grading prompt candidates.

In the Teams UI, one student's submission may look like a single item. In practice, it may contain many pieces.

Student A submission
  report.pdf
    page 1
    page 2
    page 3
  photo_1.jpg
  photo_2.jpg
  code.py

If this is sent to AI without structure, order and source tracking become unreliable. The prepare step breaks submissions into smaller analysis units.

Teams submission
  → resources / submittedResources
  → local files
  → normalized artifacts
  → analysisTargets
  → AI review

analysisTargets tracks which page or file came from which student's submission. This is the key structure that lets AI and the teacher review the same materials in the same order.

Assignment context snapshots

In automated grading, the criteria matter as much as the submitted files. Teams assignment descriptions can change, and each assignment may have different submission rules, deadlines, late policies, and rubrics.

AirClassGrading stores these criteria as snapshots.

Item Meaning
live assignment the current Teams assignment fetched through Graph API
instruction digest a concise summary of the assignment instructions
parsed instructions structured information such as deadline, format, and late policy
prompt bundle prompts used for AI grading
assignment context snapshot a stored version of the criteria at a specific point in time

Snapshots are necessary because the system must be able to answer: “What criteria were used when this submission was reviewed?”

In the teacher web UI, the current live assignment can be compared with stored snapshots, and snapshots can be compared with each other. This becomes important when the assignment instructions changed during operation.

AI review execution

The AI review does not simply return one score. Depending on the assignment type, it checks several aspects.

For portfolio assignments, the review usually looks at:

  • assignment relevance
  • conceptual understanding
  • structure of explanation or solution
  • effort and completeness
  • submission format
  • late policy application
  • student-facing feedback draft

For quiz assignments, the review can use checklist-style Pass / NP / uncertain decisions. Submissions unrelated to the assignment can be treated as NP or ineligible.

Review execution is connected to review_runner.py. The runner groups prepared targets by student and submission, then runs AI review using the assignment context and rubric.

The important point is to avoid a black-box “score only” workflow. The system stores reasoning and feedback drafts so the teacher can inspect the result.

DB storage and teacher review web

AirClassGrading does not end with a single output file. In operation, reviews, criteria, and notification states are stored separately.

Data Role
reviews AI review result, teacher-confirmed values, status
assignment_context_snapshots assignment instructions and grading criteria snapshots
review_notifications feedback queue and delivery status
out/ artifacts rendered PDF images, manifests, debug JSON files

The teacher review web provides these pages.

Page Role
/ recent reviews, AI prediction, teacher confirmation status, feedback status
/assignments Teams assignment list
/assignments/{assignment_id} assignment detail, submission status, snapshots, prompt candidates
/reviews/{review_id} artifact preview, AI evidence, final teacher score and comment

On the review detail page, the teacher can inspect rendered submission images, AI evidence, and the draft feedback. The teacher can then revise and save the final score and comment.

Feedback delivery is separated into queue and workers

Feedback delivery is separated from grading. A completed review is not sent to Teams immediately. It is added to a notification queue and processed by a worker.

Depending on configuration, supported channels can include:

  • Teams assignment feedback
  • Teams bot DM
  • delegated Teams account DM
  • DB placeholder
  • dry-run / none

This separation matters because the failure points are different. An AI review can succeed while the Teams API delivery fails. In that case, the system should retry only the feedback delivery, not the whole review.

Worker and deployment structure

The root contains k8s, Argo CD, and Jenkins files.

k8s/
  deployment.yaml
  grading-worker-deployment.yaml
  feedback-worker-deployment.yaml
  retry-worker-deployment.yaml
  regrade-job.yaml
  retry-late-job.yaml
  retry-pending-job.yaml
  service.yaml
  pvc.yaml

argocd/
  airclass-grading-application.yaml

jenkins/
  job-config.xml
Jenkinsfile

Operationally, the components are split like this.

Component Role
API server prepare requests, review web, artifact serving
grading worker AI review execution
feedback worker student feedback delivery
retry worker/job retry failed or delayed tasks
PVC / volume preserve out/ artifacts

This shows that AirClassGrading is not just a local script. It is designed as an automated grading system with separate API and worker processes.

What the actual artifacts show

I do not include student names, accounts, original file names, or submitted images here. Instead, I only use anonymized aggregate counts from local artifacts.

From the portfolio performance-assessment artifacts under out/, the current counts were:

Item Count Meaning
portfolio weekly output folders 4 outputs were generated for weeks 1–4
review_debug.json files 221 AI reviews ran and stored traceable evidence
prepare manifest.json files 16 submission collection and preprocessing ran repeatedly
top-level manifest analysisTargets 42 submitted files were broken into AI-readable analysis units

The weekly distribution of review_debug.json was:

Week Review debug records
Portfolio week 1 47
Portfolio week 2 76
Portfolio week 3 95
Portfolio week 4 3
Total 221

The key point is not the score itself. The value is that the workflow leaves reproducible artifacts:

  • submission collection and normalization are recorded through manifests and analysisTargets;
  • AI evidence is traceable through review_debug.json;
  • weekly distributions make it easier to see whether a problem was assignment-specific or systemic;
  • review artifacts can be reused by feedback workers and retry flows.

In other words, the practical effect of AirClassGrading 1 is that submission collection → preprocessing → AI review → teacher confirmation becomes a data flow that can be inspected and rerun.

What this flow achieved

The biggest result of the Teams grading flow was not just “AI grading.” It was the structure around it.

  • A reliable way to collect submissions.
  • A prepare step that turns files into AI-readable artifacts.
  • A snapshot mechanism for assignment criteria.
  • A DB structure that separates AI review and teacher confirmation.
  • A feedback queue and worker model for operation.

Changing the AI model is relatively easy. But without a stable flow for submissions, criteria, results, and feedback, the system cannot be used in a real classroom. AirClassGrading 1 was the attempt to organize that flow around Teams assignments.

The next problem

Teams assignment grading works well for file-submission tasks. But school assessment has another major format: written-response answer sheets.

Written-response sheets are different from Teams submissions.

  • One PDF can contain many students' answers.
  • One student page can contain several problems.
  • Each problem has its own partial-credit criteria.
  • AI first-pass results need repeated checks for stability.
  • Final scores must be confirmed per problem by the teacher.

The next post covers the second AirClassGrading flow: applying AI to written-response grading. It explains PDF preprocessing, problem-level image crops, partial-credit criteria JSON, repeated AI first-pass grading, and final teacher confirmation.

💬 댓글

이 글에 대한 의견을 남겨주세요