[MAICE Dev Log 6] How we validated educational impact: thesis-based summary

1. Did anything actually change?

The previous five posts covered why MAICE was needed, how question clarification became a system, and what evaluation and operational tools were prepared for the experiment.

The final question is more direct.

When MAICE helped students clarify their questions, did the quality of the actual dialogue change?

This post is not about saying that the demo looked good. It summarizes what signals appeared in the experimental data and how cautiously those signals should be interpreted. Code is discussed only as the operating, collection, and evaluation environment behind the data. Educational effects are interpreted from the thesis data.

2. First, separate the units of N

The first thing to watch in this study is the unit of N. Student count, session count, and evaluation sample count appear together.

Assigned students: 58 grade-2 students (Agent 28 / Freepass 30)
Valid participants: 55 students
Period: 3 weeks (2025-10-20 to 2025-11-08)
Valid sessions: 284
LLM evaluation: all 284 sessions
Teacher evaluation: 100 sampled sessions
Evaluation frame: QAC 40-point checklist

LLM evaluation was useful for seeing broad patterns across many sessions. Teacher evaluation was important for checking educational validity. Both were used because either one alone would have been insufficient.

However, the LLM tended to score more generously than teachers. Therefore, LLM scores should be read as support for relative comparison and pattern exploration, not as absolute grades.

3. Look at actual questions before the numbers

Statistics can feel distant if we see only tables. So before the results, let us look at actual collected questions. The examples below are question scenes from question-answer and teacher-evaluation data with identifying information removed. They are not full multi-turn session transcripts.

The main experiment took place in the high-school grade 2 Math I unit on mathematical induction. For that reason, the examples here were selected only from questions related to induction, inductive proof, or proof strategy.

In the teacher-evaluation and analysis data, questions with little context were often very short.

근데 어떻게 증명한거야? ("But how did you prove it?")

This question shows that the student is stuck, but the object of the proof is missing. Without the previous conversation, both a teacher and an AI would first need to ask, "Which statement are you trying to prove?" or "Are you stuck at the base case or the induction step?"

Another example looks like a mathematical induction problem at first.

2^n+3^n이 항상 5의 배수임을 증명하라 ("Prove that 2^n + 3^n is always a multiple of 5.")

This example is kept as written because the statement is not true for all natural numbers. When $n=2$ , $2^2+3^2=13$ , which is not a multiple of 5. In this kind of question, the system should not rush into an induction proof. It should first check small cases and ask whether a condition is missing, such as "all natural numbers" or "odd natural numbers." A corrected version would be: "For every odd natural number $n$ , $2^n+3^n$ is a multiple of 5." Question clarification is not just polishing language. It can also involve checking whether the statement to be proved is correct.

There were also clearer conceptual questions within the unit.

귀납법이랑 수학적 귀납법이랑 차이가 있어? 내가 알고있는건 그냥 귀납법이라서 좀 궁금한데 ("Is there a difference between induction and mathematical induction? I only know ordinary induction, so I am curious.")

This question shows where the student is confused. It asks not merely for an explanation of mathematical induction, but for the difference between everyday inductive reasoning and mathematical induction. The answer can naturally narrow down to empirical generalization, mathematical proof, the base case, and the induction step.

Another example was:

수학적 귀납법을 설명해줘 예시를 들어서 ("Explain mathematical induction with an example.")

This question does not contain a detailed personal learning context, but it does state that the student wants a definition and an example. MAICE could help one step further by asking, "Should we use an equality proof such as $1+2+\cdots+n=\frac{n(n+1)}{2}$ ?", "Should we use an inequality proof?", or "Would you like to write the base case and induction step yourself?"

These examples are not independent proof of the study's effects. The same limits still apply: one school, one unit, a short period, LLM scoring generosity, and multiple sub-indicators. But they show what kinds of dialogue differences may be connected to C2 learning-process support, B3 learning expansion, and the stronger lower-quartile signal discussed below.

These examples were selected from induction/proof-related questions in MAICEAnalysis question-answer and teacher-evaluation data, together with MAICEFIND analysis data. Therefore, they are described here as question scenes from collected question-answer and evaluation data, not as full session transcripts.

4. Where did the effect appear?

The first result to examine is C2, learning-process support. In the LLM evaluation, Agent mode scored higher than Freepass mode on C2.

C2 learning support: +0.28 points, p=0.004, d=0.353
B3 learning expansion: +0.22 points, p=0.041, d=0.245

Here, effect size d is in the Agent - Freepass direction. A positive value means Agent advantage; a negative value means Freepass advantage.

This result suggests that the clarification flow may have helped with thinking guidance and understanding checks more than immediate answer generation did. However, the mean difference itself was not large. Because multiple sub-indicators were examined together, the p-values should not be overinterpreted in isolation.

5. The clearest signal appeared among lower-quartile learners

The stronger signal appeared in the lower-quartile group (Q1), rather than in the overall average.

LLM C2: p<0.001, d=0.855
LLM total score: +2.26 points, p=0.032, d=0.499
Teacher total score: +6.32 points, p=0.013, d=0.992

This suggests that MAICE may be especially helpful for students who struggle with how to start a question. For these students, "first clarify where you are stuck" may have been more useful than receiving an immediate answer.

However, subgroup results require caution. The sample was small, and the result came from a specific school and a specific unit. This signal is better read as a hypothesis to be checked again in later research.

6. Agent mode was not always better

The results did not all point in one direction. On A3, learning context, Freepass scored higher.

A3 learning context: Freepass advantage, d=-0.425, p=0.001

This is important. Question clarification may make a student question more mathematically structured, but it can also reduce some of the personal context the student originally had.

For example, if a student says, "I got this far and got stuck here," but the clarification flow focuses only on mathematical conditions, the student's personal learning context can become weaker. Future improvement needs to preserve both mathematical clarity and the learner's own context.

7. Changes seen in repeated use

Among students who participated in multiple sessions, several items improved in Agent mode.

A1 mathematical expertise: +0.57 (p=0.006)
A2 question structuring: +0.71 (p=0.003)
B1 learner fit: +0.93 (p=0.001)
B2 explanation structure: +0.93 (p=0.015)
C1 dialogue coherence: +0.64 (p=0.010)
Total score: +3.45 (p=0.016)

This suggests that repeated use may have gradually organized both student questions and conversation quality. Still, these were multiple sub-indicators. Each p-value should not be treated as a definitive conclusion by itself. It is safer to interpret the overall pattern exploratorily.

8. Teacher evaluation and student perception

In teacher evaluation, the overall average difference was not statistically significant.

Total score: Agent 21.73 / Freepass 19.48
Difference: +2.25, p=0.085, d=0.349

However, Agent mode was significantly higher in the answer domain (B).

B area: +1.28, p=0.017, d=0.488

The student survey also showed positive signals. Interaction quality, question ability, conceptual understanding, and system satisfaction all had high averages, and students also showed a preference for clarification. But survey data is self-reported. It can be affected by the novelty of the system and by the research setting itself.

9. What the code and analysis tools contributed

The actual code was not a magic box that produced the results. It provided the infrastructure for recording conversations, evaluating them, and organizing them into analyzable data.

The operating system used the SvelteKit frontend, FastAPI backend, Redis Streams, and agent worker to preserve question-answer flow at the session level. The teacher interface allowed 40-point QAC rubric evaluations, and the backend automatic evaluation service recorded a Gemini-based 30-point evaluation with three question criteria and three answer criteria.

Separate tools and datasets such as MAICEAnalysis, MAICEFIND, and MAICEsurvay were used in the analysis stage. But those tools themselves do not prove the effect. The effect must be interpreted through the research design, evaluation data, and statistical analysis together.

10. Limits are also part of the result

This study showed a possibility, but it should not be generalized too quickly.

One school, in the context of a software meister high school
Grade 2 Math I, mathematical induction unit
A short three-week experimental period
Possible score inflation in LLM evaluation
Limited teacher-evaluation sample
Possible response bias in surveys
OCR and UI improvements were not separated as independent treatments

Therefore, the result does not mean that the same effect will appear in every school. A more accurate conclusion is that a math AI designed around question clarification showed the possibility of improving learning-process support under specific conditions.

11. Conclusion: not an answer machine, but a system that returns students to their questions

The most important lesson from MAICE was that a math question is not just an input value. A question is where students reveal their current understanding, and it is also where learning begins.

MAICE was not meant to be a system that gives answers faster. It was an attempt to help students look again at where they were stuck, make the question clearer, and receive an answer based on that clarified question.

The effects were limited, and the limitations were clear. Still, the signal among lower-quartile learners and the improvement in learning-process support suggest that helping students refine questions before giving answers can have educational meaning.

Source

Kyu-Bong Kim, Development and Effectiveness Analysis of an AI Agent Supporting Question Clarification in High School Mathematics Learning, master's thesis, Pusan National University Graduate School of Education.