ai

[MAICE Dev Log 7] How we validated educational impact: thesis-based summary

#agent #education #research

1. Research questions and validation targets

This post focuses on three questions:

  1. Does question clarification (Agent mode) improve learning-process support?
  2. Is the effect stronger for specific groups (especially lower-quartile learners)?
  3. Can LLM scoring be interpreted meaningfully when anchored with teacher evaluation?

The goal is not a system overview, but evidence-based interpretation of educational effects.


2. Evaluation design

  • Participants: 58 students (Agent 28 / Freepass 30)
  • Period: 3 weeks (2025-10-20 to 2025-11-08)
  • Valid sessions: 284
  • Evaluation frame: QAC (40 points)
    • LLM evaluation: N=284 (Gemini / Claude / GPT-5-mini)
    • Teacher evaluation: N=100 (2 external math teachers)

Why dual evaluation:

  • LLM is strong for large-scale pattern discovery
  • teacher scoring is strong for educational validity

Using both improves interpretive reliability.


3. Quantitative results: where effects appeared

3.1 Learning-support effects (C2)

  • Agent mode outperformed on C2 in LLM evaluation: +0.28, p=0.004, d=0.353
  • B3 (learning expansion) also showed significance: +0.22, p=0.041, d=0.245

This suggests clarification improved thinking support/understanding checks beyond answer delivery.

3.2 Lower-quartile (Q1) effects

  • LLM C2 (Q1): p<0.001, d=0.855
  • LLM total score (Q1): +2.26, p=0.032, d=0.499
  • Teacher total score (Q1): +6.32, p=0.013, d=0.992

The strongest effect signal in this study appears in lower-performing learners.

For Q2-Q4, effects were smaller or non-significant, so impact was asymmetric by group.

3.3 Opposite-direction signal also exists

  • A3 (learning context) favored Freepass: d=-0.425, p=0.001

So clarification does not improve every dimension simultaneously. The data show a trade-off between learning-process support and explicit context retention.

3.4 Repeated-session (longitudinal) changes

In repeated users, Agent mode showed significant gains in multiple areas:

  • A1 +0.57 (p=0.006)
  • A2 +0.71 (p=0.003)
  • B1 +0.93 (p=0.001)
  • B2 +0.93 (p=0.015)
  • C1 +0.64 (p=0.010)
  • total +3.45 (p=0.016)

Freepass showed fewer significant item-level changes and no significant total-score gain.


4. Teacher-side results and learner-perceived outcomes

4.1 Teacher-side findings

In teacher evaluation (N=100):

  • total score: +2.25 (p=0.085, non-significant)
  • answer domain(B): +1.28 (p=0.017, d=0.488, significant)
  • Q1 total effect: +6.32 (p=0.013, d=0.992, very large)

Teacher-side evidence highlights strongest impact in lower quartile + answer quality.

4.2 Learner-perceived outcomes

Post survey (N=47):

  • interaction quality: 4.37/5.0
  • concept understanding: 4.39/5.0
  • system satisfaction: 4.62/5.0
  • clarification-mode preference (A/B clear responses): 68.4%

Qualitative responses repeatedly mentioned: “I could identify what I did not understand” and “my questions became more specific.”


5. Educational mechanism from qualitative logs (1,589)

Repeated pattern in high-quality sessions:

  1. vague question
  2. clarification turns for problem re-definition
  3. K2 -> K3 -> K4 transition
  4. explicit verbalization of where the learner is stuck

In other words, the main gain was not just “better final answers,” but better structuring of student thinking.

Common high-score session traits:

  • 2-3 clarification turns before solution phase
  • conceptual -> procedural -> metacognitive progression
  • more cause/strategy feedback in error correction

6. Meaning of dual evaluation (LLM + teacher)

  • LLM-teacher correlation: r=0.754 (p<0.001)
  • LLM score inflation vs teachers: +5.46 points on average

Interpretation rule used in this study:

  • do not treat LLM as absolute grading replacement
  • use LLM for scalable pattern detection and relative comparison
  • keep final educational interpretation anchored by teacher evaluation

Reliability indicators were also acceptable:

  • LLM ICC(3,k)=0.872
  • teacher ICC(3,k)=0.739

So LLM is best treated as an extensible evaluation support layer.

7. Limitations

  • single-context setting limits generalization
  • baseline differences were small but not zero
  • one-turn sessions reduce clarification observability
  • LLM evaluation should be interpreted with teacher-anchored validation
  • survey response bias cannot be fully excluded
  • teacher sample size (N=100) limits broad generalization

8. Conclusion (effect-focused)

This study does not claim “smarter AI” in general. It shows that question clarification as an instructional intervention can improve learning-process support, with stronger effects for lower-quartile students.

The practical value of MAICE is not model showmanship, but learning-process design that helps students structure their own thinking.

One-line operational takeaway:

Clarification is not a UX add-on for answer delivery; it functions as an educational intervention that can raise learning-process quality, especially for lower-performing learners.

Source

  • Master’s thesis by Kim Kyubong (Graduate School of Education, Pusan National University, 2026)

💬 댓글

이 글에 대한 의견을 남겨주세요