[MAICE Dev Log 7] How we validated educational impact: thesis-based summary

1. Research questions and validation targets

This post focuses on three questions:

Does question clarification (Agent mode) improve learning-process support?
Is the effect stronger for specific groups (especially lower-quartile learners)?
Can LLM scoring be interpreted meaningfully when anchored with teacher evaluation?

The goal is not a system overview, but evidence-based interpretation of educational effects.

2. Evaluation design

Participants: 58 students (Agent 28 / Freepass 30)
Period: 3 weeks (2025-10-20 to 2025-11-08)
Valid sessions: 284
Evaluation frame: QAC (40 points)
- LLM evaluation: N=284 (Gemini / Claude / GPT-5-mini)
- Teacher evaluation: N=100 (2 external math teachers)

Why dual evaluation:

LLM is strong for large-scale pattern discovery
teacher scoring is strong for educational validity

Using both improves interpretive reliability.

3. Quantitative results: where effects appeared

3.1 Learning-support effects (C2)

Agent mode outperformed on C2 in LLM evaluation: +0.28, p=0.004, d=0.353
B3 (learning expansion) also showed significance: +0.22, p=0.041, d=0.245

This suggests clarification improved thinking support/understanding checks beyond answer delivery.

3.2 Lower-quartile (Q1) effects

LLM C2 (Q1): p<0.001, d=0.855
LLM total score (Q1): +2.26, p=0.032, d=0.499
Teacher total score (Q1): +6.32, p=0.013, d=0.992

The strongest effect signal in this study appears in lower-performing learners.

For Q2-Q4, effects were smaller or non-significant, so impact was asymmetric by group.

3.3 Opposite-direction signal also exists

A3 (learning context) favored Freepass: d=-0.425, p=0.001

So clarification does not improve every dimension simultaneously. The data show a trade-off between learning-process support and explicit context retention.

3.4 Repeated-session (longitudinal) changes

In repeated users, Agent mode showed significant gains in multiple areas:

A1 +0.57 (p=0.006)
A2 +0.71 (p=0.003)
B1 +0.93 (p=0.001)
B2 +0.93 (p=0.015)
C1 +0.64 (p=0.010)
total +3.45 (p=0.016)

Freepass showed fewer significant item-level changes and no significant total-score gain.

4. Teacher-side results and learner-perceived outcomes

4.1 Teacher-side findings

In teacher evaluation (N=100):

total score: +2.25 (p=0.085, non-significant)
answer domain(B): +1.28 (p=0.017, d=0.488, significant)
Q1 total effect: +6.32 (p=0.013, d=0.992, very large)

Teacher-side evidence highlights strongest impact in lower quartile + answer quality.

4.2 Learner-perceived outcomes

Post survey (N=47):

interaction quality: 4.37/5.0
concept understanding: 4.39/5.0
system satisfaction: 4.62/5.0
clarification-mode preference (A/B clear responses): 68.4%

Qualitative responses repeatedly mentioned: “I could identify what I did not understand” and “my questions became more specific.”

5. Educational mechanism from qualitative logs (1,589)

Repeated pattern in high-quality sessions:

vague question
clarification turns for problem re-definition
K2 -> K3 -> K4 transition
explicit verbalization of where the learner is stuck

In other words, the main gain was not just “better final answers,” but better structuring of student thinking.

Common high-score session traits:

2-3 clarification turns before solution phase
conceptual -> procedural -> metacognitive progression
more cause/strategy feedback in error correction

6. Meaning of dual evaluation (LLM + teacher)

LLM-teacher correlation: r=0.754 (p<0.001)
LLM score inflation vs teachers: +5.46 points on average

Interpretation rule used in this study:

do not treat LLM as absolute grading replacement
use LLM for scalable pattern detection and relative comparison
keep final educational interpretation anchored by teacher evaluation

Reliability indicators were also acceptable:

LLM ICC(3,k)=0.872
teacher ICC(3,k)=0.739

So LLM is best treated as an extensible evaluation support layer.

7. Limitations

single-context setting limits generalization
baseline differences were small but not zero
one-turn sessions reduce clarification observability
LLM evaluation should be interpreted with teacher-anchored validation
survey response bias cannot be fully excluded
teacher sample size (N=100) limits broad generalization

8. Conclusion (effect-focused)

This study does not claim “smarter AI” in general. It shows that question clarification as an instructional intervention can improve learning-process support, with stronger effects for lower-quartile students.

The practical value of MAICE is not model showmanship, but learning-process design that helps students structure their own thinking.

One-line operational takeaway:

Clarification is not a UX add-on for answer delivery; it functions as an educational intervention that can raise learning-process quality, especially for lower-performing learners.

Source

Master’s thesis by Kim Kyubong (Graduate School of Education, Pusan National University, 2026)