This is the story of UniMind — a student-facing mental-health chatbot we built at the University of Northampton. As part of our second-year group project module, we were given two main objectives:
It began as a rule-driven RASA assistant and grew into a safety-first, multi-agent AI that plans the conversation before it speaks.
University counselling teams work incredibly hard — and they’re also at capacity. Our goal with UniMind was not to replace clinicians but to offer safe, 24/7 first-line support that complements existing services and seamlessly routes students to real help. The final system hits those targets across multiple metrics — including layered crisis detection with ~99.99% reliability (internal) (100% on our scenario set; ≥99.9% with a single-provider outage; ~99% worst-case local fallback), 8.2/10 satisfaction, and an ~89% cost reduction.
We began with a conventional RASA stack and built a deep intent hierarchy (200+ intents) with stories covering crisis indicators, academic stress, finances, relationships and campus services. In controlled tests we reached 100% story-level accuracy, but real-input intent classification averaged ~5.24%, a critical limitation for this domain.
We rebuilt UniMind as a three-agent therapeutic architecture:
| Agent | Primary Function | Technology | Input | Output |
|---|---|---|---|---|
| PSY-OVERSEER-1 | Strategic Planning | Gemini 2.0 Flash | User conversation + context | Therapeutic plan steps |
| PSY-OVERSEER-2 | Tactical Guidance | Integrated | Plan step + user state | Response guidance |
| PSY_mini | Conversation Execution | Psychotherapy-LLM (8B) | Guidance + user message | Empathetic response |
| RESOURCES_assist | University Integration | OpenAI Assistant API | Conversation content | UON resources |
We curated therapist dialogues and mental-health Q&A from public datasets, cleaned for student themes. Cleaned CSVs drive prompt design, evaluation scenarios and guidance structure.
A strict system prompt set a warm persona, kept replies short and added a safety override for crisis language. An action threshold delayed concrete skills/advice until sufficient understanding.
Evaluation covered quantitative and qualitative signals across internal sessions, a public demo, and external participants. Reliability and performance summaries are detailed in the technical report.
We first analysed Kaggle‑curated conversations and used an LLM scorer to rank micro‑elements (validation, reflection, open question, psychoeducation→application, tone, list‑avoidance). Those insights seeded our initial system prompt and micro‑structure before iterative evaluation and tuning.
We chose this quantifiable route so changes could be defended with data rather than handcrafted rules. Using base data comparisons (original therapy vs student responses) and the exported metrics (results.json, student_results.json), we targeted the upper quartile for human‑likeness and tone, then encoded the highest‑impact micro‑elements into guidance. See analysis assets in the repo’s Analysis_of_Test_Cycles.
Short therapy‑style test sessions provided qualitative checks on warmth, brevity and progression, plus safety behaviour under crisis language. We used these logs and reports to refine style guards, the action threshold and plan transitions. Session artefacts live in experiments/therapy_sessions and the framework notes in psychocounsel_testing_framework.md.
| Type | Detail |
|---|---|
| Strength | Multi‑layer crisis safety (deterministic crisis copy + layered detection); warm, concise tone with structured micro‑moves (validate → reflect → single open question → optional psychoeducation‑to‑application); clear plan stages; UON resource integration. |
| Strength | Neutral, jargon‑free language aiding inclusivity; consistent structure reduces ad‑hoc bias; explicit safety overrides. |
| Weakness | UK/UON‑centric resources by default; needs broader localisation for non‑UK contexts and international students (including crisis lines and services). |
| Weakness | Prompts lack explicit cultural‑sensitivity cues; missing gentle checks for cultural/identity context when relevant. |
| Weakness | Occasional edge‑case tone misreads (e.g., “disrespectful” perception) and handling of abrupt topic shifts; plan‑alignment smoothing needed. |
See supporting analyses: Model_Bias_Analysis.md, Model_Inclusivity_Analysis.md, and Therapy_Session_Summary.md.
Earlier demo-day numbers (~9s average response and RASA limits) explain the pivot. Current stack returns ~2.5s on warm paths while preserving safety.