What this post covers
AirClassGrading is the grading support system I built to reduce the workload that comes after class. This sixth post focuses on the first major flow: automated grading for Microsoft Teams assignments.
The goal is not simply to ask an AI model for a score. The real goal is to turn the repetitive work around Teams assignments into a reproducible pipeline.
- Fetch Teams assignments and submissions.
- Download each student's attached files.
- Normalize PDFs, photos, and code files into AI-readable artifacts.
- Preserve the assignment instructions and rubrics as snapshots.
- Run an AI-based first review.
- Let the teacher confirm or revise the result in a review web UI.
- Send feedback back to Teams when needed.
The overall flow looks like this.
direction: right
teams: "Microsoft Teams
Assignments / submissions"
graph: "Graph API
assignment and submission lookup"
prepare: "prepare step
download and render files"
context: "assignment context snapshot
instructions and rubric"
ai: "AI review
evidence and feedback draft"
db: "database
reviews / snapshots / notifications"
teacher: "teacher review UI
final score and comment"
feedback: "Teams feedback
queue / worker"
teams -> graph -> prepare -> context -> ai -> db -> teacher -> feedback
The written-response image grading flow is different enough that I separated it into the next post. This post focuses only on the Teams assignment grading flow.
Why Teams assignment grading needed automation
Teams assignments are useful. Students can upload files, and teachers can check submission status. But real grading work does not end with opening the Teams screen.
A teacher repeatedly has to do the following.
- Check who submitted and who did not.
- Check whether each submission is a PDF, photo, code file, or something else.
- Confirm that the files open correctly.
- Notice whether a student uploaded an old file or an unrelated file.
- Re-read the assignment instructions.
- Compare the submission against the criteria.
- Write feedback for the student.
This becomes especially heavy for portfolio or performance-assessment tasks, where a single submission may contain multiple photos, PDFs, code files, and explanations. Before AI can grade anything, the system first needs to make the submissions gradeable.
So the first goal of AirClassGrading was this:
Collect the scattered Teams submissions and assignment criteria, then organize them so both AI and the teacher can review the same materials.
Code structure
The Teams assignment grading flow is organized under legacy_airclass_engine.
legacy_airclass_engine/
core/
engine_api.py
review_runner.py
review_store.py
assignment_context.py
assignment_loader.py
grade_prepare_service.py
prompt_builder.py
rubric_loader.py
second_pass_review_prompt_builder.py
slim_prompt_builder.py
web_review_app.py
scripts/
prepare_portfolio.py
prepare_quiz.py
finalize_portfolio.py
finalize_quiz.py
generate_grades_csv.py
generate_quiz_review_csv.py
run_worker.py
auto_feedback_worker.py
teams/
teams_graph.py
teams_feedback_writer.py
teams_account_sender.py
teams_bot_sender.py
teams_assignment_live.py
The responsibilities are roughly split like this.
| Area | Role |
|---|---|
core |
API, review execution, DB storage, prompt generation, assignment context management |
scripts |
submission collection, preprocessing, grading execution, CSV generation, worker execution |
teams |
Microsoft Teams / Graph API integration and feedback delivery |
docs, k8s, argocd |
operation documents and deployment configuration |
The root documents docs/ENGINE_API_README.md, docs/WEB_REVIEW_README.md, and docs/TEAMS_GRAPH_API.md describe this flow in more operational detail.
engine_api.py: the center of the API and review UI
engine_api.py is the integrated FastAPI app for the Teams assignment grading flow. It is not just an API server. It also includes the prepare API, teacher review pages, snapshot management, and feedback queue registration.
Its main responsibilities are:
- Teams submission preprocessing API
- assignment context snapshot creation and comparison
- AI review entry point
- teacher review list, detail, and edit pages
- review row synchronization
- feedback notification queue registration
- serving rendered submission artifacts such as images and PDFs
A local run looks like this.
python3 -m uvicorn engine_api:app --host 0.0.0.0 --port 8092
In Docker or k3s deployment, the app runs from /app, and /app/out is kept on a PVC or host volume. Rendered PDFs, manifests, and debug JSON files should not disappear when the container restarts.
Real assignment instructions become a prompt bundle
A key part of AirClassGrading 1 is that Teams instructions are not just reread manually. They are transformed into a bundle of prompts so AI can make separate judgments step by step.
For the actual week 1 portfolio assignment, one Teams assignment instruction became five prompt roles.
| Prompt | Role |
|---|---|
fitPrompt |
decide whether this is a valid submission for the current assignment |
structurePrompt |
evaluate structure only |
diligencePrompt |
evaluate diligence only |
formatPrompt |
check format compliance only |
finalReviewPrompt |
synthesize the previous results into a final review |
For example, fitPrompt does not assign a score. It only checks whether the submission fits the current assignment.
Assignment: Portfolio performance task, week 1 summary
Role: judge submission fit only.
Do not score structure, diligence, or format in this prompt.
Only decide whether this can be treated as a submission for the current assignment.
Return JSON only:
{
"content_fit": "fit|mismatch|uncertain",
"submission_verdict": "valid|ineligible|uncertain",
"reasons": ["string"],
"uncertainties": ["string"]
}
This step exists to prevent unrelated submissions from being graded as if they were valid. If the system evaluates structure or diligence before checking fit, the model can create plausible but misleading grading reasons.
Structure and diligence are evaluated with separate prompts
Portfolio grading uses two main dimensions: structure and diligence. I separated them because AI can otherwise borrow evidence from one dimension to justify the other.
The actual structurePrompt focuses only on structure.
Role: evaluate structure only.
Structure means the organization, separation, flow, and readability of concept notes.
Check:
- whether a concept-summary section is visibly separated
- whether key concepts are organized by item or heading
- whether titles, sections, and ordering make the flow readable
- whether the concept explanation is too empty or mixed together
Principles:
- use only organization, separation, and readability of concept notes as evidence
- do not use amount of problem solving, number of problems, page count, or ratio as structure evidence
The diligencePrompt focuses only on problem-solving effort.
Role: evaluate diligence only.
Diligence means problem numbers, number of problems, solution steps, and writing density.
Check:
- whether problem numbers or problem divisions are visible
- whether solution steps are written progressively
- whether the amount and density of writing are sufficient
- whether the required problem range appears to be reflected
Principles:
- use only problem count, solution process, and writing effort as diligence evidence
- do not use amount, ratio, or organization of concept notes as diligence evidence
This separation mattered in operation. A neatly organized concept note does not automatically prove strong problem-solving effort, and solving many problems does not automatically mean the concept summary is well structured.
Observe pages first, then produce the final review
There is also a stage where AI is not asked to grade at all. It first observes each page and structures only what is visible.
{
"page_type": "concept_summary|problem_solving|retry_or_correction|mixed|unclear",
"readability": "high|medium|low",
"observations": [
"visible fact 1",
"visible fact 2"
],
"signals": {
"has_concept_heading": false,
"has_structured_flow": false,
"has_restarted_problem_numbers": false,
"has_problem_source": false,
"has_step_by_step_solution": false,
"has_answer_only_pattern": false,
"has_retry_or_correction": false,
"shows_diligent_organization": false,
"shows_sparse_or_low_effort_notes": false
},
"uncertainties": [
"reading limitation or uncertain point"
]
}
The goal of this step is evidence collection, not grading. Instead of immediately producing A/B/C/D, the model first records signals such as whether concept headings are visible, whether the solution is step-by-step, or whether the page mostly contains answers only.
The final review prompt then synthesizes the earlier outputs.
Role: synthesize fit, structure, diligence, and format results.
You are the final reviewer and re-grader.
Use the first-pass results as references, but confirm the final grades and student feedback consistently.
Final review order:
1) check fit first to confirm this is a submission for the current assignment
2) review structure only through concept-note organization, separation, and readability
3) review diligence only through problem count, solution process, and writing effort
4) check whether format violations should affect the final result
Core principles:
- structure and diligence are independent areas
- do not move evidence from one area into the other area's score
- ineligible decisions must be based on fit and the main visible content of the submission
With this design, AI grading is not a single black-box answer. Fit, structure, diligence, format, and final synthesis each leave separate traces.
prepare: turning Teams submissions into gradeable artifacts
The first step of automated grading is prepare. This is not just a download step. It converts Teams submissions into a structure that AI can read reliably and teachers can trace later.
The flow is roughly:
- Select the portfolio or quiz prepare flow depending on assignment type.
- Fetch submissions and attached resources through Microsoft Graph API.
- Render PDF pages into images.
- Normalize photo orientation and safe file names.
- Create
manifest.jsonandanalysisTargets. - Create or connect an assignment context snapshot based on the Teams instructions.
- Return prompt bundles, parsed instructions, and grading prompt candidates.
In the Teams UI, one student's submission may look like a single item. In practice, it may contain many pieces.
Student A submission
report.pdf
page 1
page 2
page 3
photo_1.jpg
photo_2.jpg
code.py
If this is sent to AI without structure, order and source tracking become unreliable. The prepare step breaks submissions into smaller analysis units.
Teams submission
→ resources / submittedResources
→ local files
→ normalized artifacts
→ analysisTargets
→ AI review
analysisTargets tracks which page or file came from which student's submission. This is the key structure that lets AI and the teacher review the same materials in the same order.
Assignment context snapshots
In automated grading, the criteria matter as much as the submitted files. Teams assignment descriptions can change, and each assignment may have different submission rules, deadlines, late policies, and rubrics.
AirClassGrading stores these criteria as snapshots.
| Item | Meaning |
|---|---|
| live assignment | the current Teams assignment fetched through Graph API |
| instruction digest | a concise summary of the assignment instructions |
| parsed instructions | structured information such as deadline, format, and late policy |
| prompt bundle | prompts used for AI grading |
| assignment context snapshot | a stored version of the criteria at a specific point in time |
Snapshots are necessary because the system must be able to answer: “What criteria were used when this submission was reviewed?”
In the teacher web UI, the current live assignment can be compared with stored snapshots, and snapshots can be compared with each other. This becomes important when the assignment instructions changed during operation.
AI review execution
The AI review does not simply return one score. Depending on the assignment type, it checks several aspects.
For portfolio assignments, the review usually looks at:
- assignment relevance
- conceptual understanding
- structure of explanation or solution
- effort and completeness
- submission format
- late policy application
- student-facing feedback draft
For quiz assignments, the review can use checklist-style Pass / NP / uncertain decisions. Submissions unrelated to the assignment can be treated as NP or ineligible.
Review execution is connected to review_runner.py. The runner groups prepared targets by student and submission, then runs AI review using the assignment context and rubric.
The important point is to avoid a black-box “score only” workflow. The system stores reasoning and feedback drafts so the teacher can inspect the result.
DB storage and teacher review web
AirClassGrading does not end with a single output file. In operation, reviews, criteria, and notification states are stored separately.
| Data | Role |
|---|---|
reviews |
AI review result, teacher-confirmed values, status |
assignment_context_snapshots |
assignment instructions and grading criteria snapshots |
review_notifications |
feedback queue and delivery status |
out/ artifacts |
rendered PDF images, manifests, debug JSON files |
The teacher review web provides these pages.
| Page | Role |
|---|---|
/ |
recent reviews, AI prediction, teacher confirmation status, feedback status |
/assignments |
Teams assignment list |
/assignments/{assignment_id} |
assignment detail, submission status, snapshots, prompt candidates |
/reviews/{review_id} |
artifact preview, AI evidence, final teacher score and comment |
On the review detail page, the teacher can inspect rendered submission images, AI evidence, and the draft feedback. The teacher can then revise and save the final score and comment.
Feedback delivery is separated into queue and workers
Feedback delivery is separated from grading. A completed review is not sent to Teams immediately. It is added to a notification queue and processed by a worker.
Depending on configuration, supported channels can include:
- Teams assignment feedback
- Teams bot DM
- delegated Teams account DM
- DB placeholder
- dry-run / none
This separation matters because the failure points are different. An AI review can succeed while the Teams API delivery fails. In that case, the system should retry only the feedback delivery, not the whole review.
Worker and deployment structure
The root contains k8s, Argo CD, and Jenkins files.
k8s/
deployment.yaml
grading-worker-deployment.yaml
feedback-worker-deployment.yaml
retry-worker-deployment.yaml
regrade-job.yaml
retry-late-job.yaml
retry-pending-job.yaml
service.yaml
pvc.yaml
argocd/
airclass-grading-application.yaml
jenkins/
job-config.xml
Jenkinsfile
Operationally, the components are split like this.
| Component | Role |
|---|---|
| API server | prepare requests, review web, artifact serving |
| grading worker | AI review execution |
| feedback worker | student feedback delivery |
| retry worker/job | retry failed or delayed tasks |
| PVC / volume | preserve out/ artifacts |
This shows that AirClassGrading is not just a local script. It is designed as an automated grading system with separate API and worker processes.
What the actual artifacts show
I do not include student names, accounts, original file names, or submitted images here. Instead, I only use anonymized aggregate counts from local artifacts.
From the portfolio performance-assessment artifacts under out/, the current counts were:
| Item | Count | Meaning |
|---|---|---|
| portfolio weekly output folders | 4 | outputs were generated for weeks 1–4 |
review_debug.json files |
221 | AI reviews ran and stored traceable evidence |
prepare manifest.json files |
16 | submission collection and preprocessing ran repeatedly |
top-level manifest analysisTargets |
42 | submitted files were broken into AI-readable analysis units |
The weekly distribution of review_debug.json was:
| Week | Review debug records |
|---|---|
| Portfolio week 1 | 47 |
| Portfolio week 2 | 76 |
| Portfolio week 3 | 95 |
| Portfolio week 4 | 3 |
| Total | 221 |
The key point is not the score itself. The value is that the workflow leaves reproducible artifacts:
- submission collection and normalization are recorded through manifests and
analysisTargets; - AI evidence is traceable through
review_debug.json; - weekly distributions make it easier to see whether a problem was assignment-specific or systemic;
- review artifacts can be reused by feedback workers and retry flows.
In other words, the practical effect of AirClassGrading 1 is that submission collection → preprocessing → AI review → teacher confirmation becomes a data flow that can be inspected and rerun.
What this flow achieved
The biggest result of the Teams grading flow was not just “AI grading.” It was the structure around it.
- A reliable way to collect submissions.
- A prepare step that turns files into AI-readable artifacts.
- A snapshot mechanism for assignment criteria.
- A DB structure that separates AI review and teacher confirmation.
- A feedback queue and worker model for operation.
Changing the AI model is relatively easy. But without a stable flow for submissions, criteria, results, and feedback, the system cannot be used in a real classroom. AirClassGrading 1 was the attempt to organize that flow around Teams assignments.
The next problem
Teams assignment grading works well for file-submission tasks. But school assessment has another major format: written-response answer sheets.
Written-response sheets are different from Teams submissions.
- One PDF can contain many students' answers.
- One student page can contain several problems.
- Each problem has its own partial-credit criteria.
- AI first-pass results need repeated checks for stability.
- Final scores must be confirmed per problem by the teacher.
The next post covers the second AirClassGrading flow: applying AI to written-response grading. It explains PDF preprocessing, problem-level image crops, partial-credit criteria JSON, repeated AI first-pass grading, and final teacher confirmation.
💬 댓글
이 글에 대한 의견을 남겨주세요