[MAICE Dev Log 7] How we validated educational impact: thesis-based summary
1. Research questions and validation targets
This post focuses on three questions:
- Does question clarification (Agent mode) improve learning-process support?
- Is the effect stronger for specific groups (especially lower-quartile learners)?
- Can LLM scoring be interpreted meaningfully when anchored with teacher evaluation?
The goal is not a system overview, but evidence-based interpretation of educational effects.
2. Evaluation design
- Participants: 58 students (Agent 28 / Freepass 30)
- Period: 3 weeks (2025-10-20 to 2025-11-08)
- Valid sessions: 284
- Evaluation frame: QAC (40 points)
- LLM evaluation: N=284 (Gemini / Claude / GPT-5-mini)
- Teacher evaluation: N=100 (2 external math teachers)
Why dual evaluation:
- LLM is strong for large-scale pattern discovery
- teacher scoring is strong for educational validity
Using both improves interpretive reliability.
3. Quantitative results: where effects appeared
3.1 Learning-support effects (C2)
- Agent mode outperformed on C2 in LLM evaluation: +0.28, p=0.004, d=0.353
- B3 (learning expansion) also showed significance: +0.22, p=0.041, d=0.245
This suggests clarification improved thinking support/understanding checks beyond answer delivery.
3.2 Lower-quartile (Q1) effects
- LLM C2 (Q1): p<0.001, d=0.855
- LLM total score (Q1): +2.26, p=0.032, d=0.499
- Teacher total score (Q1): +6.32, p=0.013, d=0.992
The strongest effect signal in this study appears in lower-performing learners.
For Q2-Q4, effects were smaller or non-significant, so impact was asymmetric by group.
3.3 Opposite-direction signal also exists
- A3 (learning context) favored Freepass: d=-0.425, p=0.001
So clarification does not improve every dimension simultaneously. The data show a trade-off between learning-process support and explicit context retention.
3.4 Repeated-session (longitudinal) changes
In repeated users, Agent mode showed significant gains in multiple areas:
- A1 +0.57 (p=0.006)
- A2 +0.71 (p=0.003)
- B1 +0.93 (p=0.001)
- B2 +0.93 (p=0.015)
- C1 +0.64 (p=0.010)
- total +3.45 (p=0.016)
Freepass showed fewer significant item-level changes and no significant total-score gain.
4. Teacher-side results and learner-perceived outcomes
4.1 Teacher-side findings
In teacher evaluation (N=100):
- total score: +2.25 (p=0.085, non-significant)
- answer domain(B): +1.28 (p=0.017, d=0.488, significant)
- Q1 total effect: +6.32 (p=0.013, d=0.992, very large)
Teacher-side evidence highlights strongest impact in lower quartile + answer quality.
4.2 Learner-perceived outcomes
Post survey (N=47):
- interaction quality: 4.37/5.0
- concept understanding: 4.39/5.0
- system satisfaction: 4.62/5.0
- clarification-mode preference (A/B clear responses): 68.4%
Qualitative responses repeatedly mentioned: “I could identify what I did not understand” and “my questions became more specific.”
5. Educational mechanism from qualitative logs (1,589)
Repeated pattern in high-quality sessions:
- vague question
- clarification turns for problem re-definition
- K2 -> K3 -> K4 transition
- explicit verbalization of where the learner is stuck
In other words, the main gain was not just “better final answers,” but better structuring of student thinking.
Common high-score session traits:
- 2-3 clarification turns before solution phase
- conceptual -> procedural -> metacognitive progression
- more cause/strategy feedback in error correction
6. Meaning of dual evaluation (LLM + teacher)
- LLM-teacher correlation: r=0.754 (p<0.001)
- LLM score inflation vs teachers: +5.46 points on average
Interpretation rule used in this study:
- do not treat LLM as absolute grading replacement
- use LLM for scalable pattern detection and relative comparison
- keep final educational interpretation anchored by teacher evaluation
Reliability indicators were also acceptable:
- LLM ICC(3,k)=0.872
- teacher ICC(3,k)=0.739
So LLM is best treated as an extensible evaluation support layer.
7. Limitations
- single-context setting limits generalization
- baseline differences were small but not zero
- one-turn sessions reduce clarification observability
- LLM evaluation should be interpreted with teacher-anchored validation
- survey response bias cannot be fully excluded
- teacher sample size (N=100) limits broad generalization
8. Conclusion (effect-focused)
This study does not claim “smarter AI” in general. It shows that question clarification as an instructional intervention can improve learning-process support, with stronger effects for lower-quartile students.
The practical value of MAICE is not model showmanship, but learning-process design that helps students structure their own thinking.
One-line operational takeaway:
Clarification is not a UX add-on for answer delivery; it functions as an educational intervention that can raise learning-process quality, especially for lower-performing learners.
Source
- Master’s thesis by Kim Kyubong (Graduate School of Education, Pusan National University, 2026)
💬 댓글
이 글에 대한 의견을 남겨주세요