[MAICE Dev Log 1] Building a Teaching AI, not just an Answer Bot (Project Overview)

1. Opening: look at the question before the answer

Imagine a student asking an AI system, "How do I do this?" The AI can immediately produce a solution. But the question itself is missing the information that matters most. We do not know the unit, how far the student has tried, or exactly where the student is stuck.

The core idea of this post is simple. MAICE was not built to give answers faster. It was built to help students make their own questions clearer first.

If the system answers this kind of question immediately, the conversation may look fast. But learning does not necessarily improve. The student does not get a chance to rethink where the difficulty is, and the AI gives a generic explanation without knowing the student's current understanding.

The pilot study showed a similar signal. In a pilot dataset of N = 385 question-answer records, 72.3% of student questions lacked explicit learning context by the study criteria, such as current progress, attempted solution, or the exact point of difficulty. Question quality and answer quality also showed a positive correlation (r=0.691, p<0.001).

Of course, that number does not prove that "better questions always cause better answers." But it was enough to make one design problem clear: when building a math AI, we should not look only at answer generation. We also need to look at how the question is formed.

That is why MAICE (Mathematical AI Chatbot for Education) began not as a chatbot that gives faster answers, but as a math AI that helps students clarify their questions.

2. Classifying a question, then asking back

Two educational ideas shaped MAICE's design. One was Bloom's knowledge dimension. The other was Dewey's reflective thinking.

Bloom's knowledge dimension helps us ask what kind of knowing a student's question requires.

K1 factual knowledge: asking for a formula, definition, or name
K2 conceptual knowledge: asking why something is true or how concepts relate
K3 procedural knowledge: asking about a solution process or proof procedure
K4 metacognitive knowledge: asking students to check their own solution or thinking process

For example, "Tell me the mathematical induction formula" is close to K1. But "Why is it enough to prove the case $n=k+1$ after assuming $n=k$ ?" involves both conceptual and procedural knowledge. "Check which step in my solution is wrong" is closer to a question about monitoring one's own thinking.

In Dewey's reflective thinking, MAICE focused especially on the stage of defining what the problem is. When a student says "I don't get this," the system should not immediately show a solution. Instead, it should help the student specify the problem: "Which part is confusing?", "What have you tried?", or "Are you stuck at the base case or the induction step?"

3. Not one chatbot, but a system with separated roles

At first, it looked as if this process could be handled with one prompt: read the question, classify the knowledge dimension, ask back if needed, and generate an answer.

But real conversations were not that simple. Judging a question, asking a clarification question, and generating an answer are different kinds of decisions. If one model handles everything at once, it may answer when it should ask back, or ask unnecessary follow-up questions when the original question is already clear enough.

So MAICE was designed not as one giant prompt, but as an agent system with separated roles. A student's question is classified first. If it is ambiguous, it goes through clarification. Once it is answerable, it moves to answer generation.

As a development note, the actual MAICE repository divides this flow across a SvelteKit frontend, a FastAPI backend, Redis/PostgreSQL, and a separate agent worker. The core agents launched in agent/worker.py are QuestionClassifier, QuestionImprovement, AnswerGenerator, Observer, and FreeTalker.

There is one important caution. In the thesis, Agent mode meant the clarification-centered flow, while Freepass mode meant an immediate-answer flow without the same clarification step. The current code path keeps the assigned user mode as a reference value while runtime processing is fixed to the Agent flow. Therefore, the A/B comparison should be read in the context of the experiment design and analysis, while the current implementation should be understood as operating the question-clarification-centered Agent flow.

4. How the experiment was viewed

To evaluate MAICE, the study ran for three weeks in a high-school grade 2 Math I unit on mathematical induction.

Participants: 58 assigned grade-2 students, 55 valid participants
Comparison: Agent mode vs Freepass mode
Valid sessions: 284
Evaluation frame: QAC 40-point checklist
Additional data: 1,589 dialogue logs, post-survey data, and a teacher-evaluated sample

This post will not interpret the numbers in detail. That is the role of Post 6. Here, the more important goal is to understand what MAICE tried to do and why question clarification was built into the system.

5. A short preview of the results

If the overall result is reduced to one sentence, it would be this: MAICE's Agent mode was not uniformly better on every indicator, but it showed meaningful signals in learning-process support and among lower-performing students.

For example, in the LLM evaluation, Agent mode scored higher than Freepass on the C2 learning-support item, and the effect size was larger in the lower-quartile group (Q1). The teacher evaluation also showed a large difference in the lower-quartile sample.

However, these results came from a single school, one unit, and a short experimental period. Implementation elements such as OCR, UI improvements, and deployment stability supported the experimental environment, but they should not be interpreted as independently causing the learning effects.

6. What this series covers

This series is not meant to be only a thesis summary. It follows the problems we encountered, the structures we built, and the educational meaning of those structures.

Post	Topic	Core question
1	Project Overview	Why focus on question clarification before answers?
2	Multi-Agent System	Why split roles instead of using one prompt?
3	SvelteKit Interface	What is needed for students to actually enter math questions?
4	QAC Checklist	How can a good learning conversation be evaluated?
5	Blue-Green Deployment	What was needed to keep the service running during the experiment?
6	Educational Impact	What effects and limits appeared in the actual data?

The next post takes up the second question: why did MAICE become a system of collaborating agents instead of one smart chatbot?