[MAICE Dev Log 4] QAC checklist development: making educational quality measurable

1. How can we recognize a good conversation?

By the end of Post 3, the basic MAICE flow was in place. Students could enter formulas, the system could classify questions, ask back when needed, and generate answers.

But that immediately raised another problem. More conversations do not automatically mean better learning. A long and kind answer is not always good learning support. Sometimes one short follow-up question can move a student's thinking more effectively.

MAICE therefore needed a way to judge a "good conversation." Instead of saying that a session simply looked fine, we needed a framework that could separate the question, the answer, and the surrounding context.

That framework was the QAC checklist. QAC stands for Question, Answer, and Context. It looks at conversation quality by examining the student's question, the AI's answer, and the learning flow around them.

2. Why QAC was needed

The quality of an educational AI cannot be reduced to factual correctness alone. Even a correct answer may not help much if it does not fit the student's level. Even a friendly explanation can miss the point if it does not respond to what the student was actually asking.

MAICE needed to examine three things.

Is the student's question mathematically meaningful and specific?
Does the AI answer fit the student's level and context?
Does the whole conversation support the student's thinking process?

Early work also used student persona tests. Asking questions from several imagined student types helped find edge cases. But personas are exploratory tools. To interpret research results, we needed a more consistent evaluation standard.

3. QAC checklist structure

QAC divides a conversation into three parts.

Q, Question: whether the student question is mathematically sufficient
A, Answer: whether the answer is structured for the learner
C, Context: whether the dialogue flow supports the learning process

The QAC checklist used in the thesis had a 40-point structure.

Area	Points	What it examines
A area, question	15	Mathematical expertise, question structuring, learning context
B area, answer	15	Learner fit, explanation structure, learning expansion
C area, context	10	Dialogue coherence, learning-process support

This makes it possible to say more than "MAICE was good." We can ask which part improved: whether questions became more specific, answers fit learners better, or the conversation supported thinking more effectively.

4. How it was used in the study

The study used QAC in two ways.

First, LLM evaluation was used to inspect all 284 sessions. Human raters could not evaluate every session in full detail, so LLM evaluation helped identify broad patterns.

Then, two external mathematics teachers evaluated a sample of 100 sessions. This was used to check whether the LLM evaluation pointed in an educationally meaningful direction.

The total-score correlation between LLM evaluation and teacher evaluation was r=0.754, p<0.001. However, the LLM tended to score more generously than teachers, meaning it often awarded higher scores than human raters for comparable sessions. So LLM scores should not be treated as absolute grades. They are better interpreted as tools for relative comparison and pattern exploration.

5. The code contains two kinds of evaluation tools

The implementation also shows that there was not just one evaluation tool. This is an important place to avoid confusion.

The first is the manual teacher-evaluation tool. The teacher page uses the QAC v4.3 rubric, with A1/A2/A3, B1/B2/B3, and C1/C2 items. It calculates a total of 40 points. The thesis QAC evaluation is connected to this 40-point rubric.

The second is the automatic LLM-evaluation tool. The backend EvaluationService sends the full session conversation to a Gemini model and stores scores for three question criteria and three answer criteria, each scored from 0 to 5, for a total of 30 points. In other words, this is a separate 3+3 scoring system, not a direct copy of the 40-point teacher rubric. This tool is closer to a way of quickly scanning patterns across many sessions.

As a development note, the teacher-facing 40-point rubric can be seen in front/src/routes/teacher/+page.svelte, while the automatic 30-point evaluation is implemented in back/app/services/evaluation_service.py.

The important point is that these two totals should not be mixed as if they were the same score. The teacher 40-point rubric is closer to educational judgment and feedback. The automatic 30-point score is closer to operational support for reviewing many sessions. If the two are compared, the difference in scale must be explained first or a separate normalization rule is needed.

6. Evaluation standards make improvement possible

After QAC was defined, the direction for improving MAICE became clearer. We no longer had to ask only whether the answer became longer or friendlier. We could ask which item was weak.

For example, if C2 learning-process support improves, that suggests more conversations checked students' thinking and guided their next step. If A3 learning context is low, it may mean that the clarification process did not preserve enough of the student's personal learning context.

In that sense, QAC was not a score sheet for decorating the result. It was closer to a map for finding the next improvement.

7. The next problem was stability

Once an evaluation standard exists, the next requirement is collecting stable data. If the service stops while students are asking questions, the conversation log is broken, and the data to compare through QAC disappears.

So the next post turns to an operational issue that may not look educational at first: Blue-Green deployment and service stability.

Next post: [MAICE Dev Log 5] Zero-downtime deployment for heavy AI containers

Source

Kyu-Bong Kim, Development and Effectiveness Analysis of an AI Agent Supporting Question Clarification in High School Mathematics Learning, master's thesis, Pusan National University Graduate School of Education.