A hybrid deterministic and AI-driven backend that provides real-time, clinically grounded feedback to nursing students in a Virtual Reality wound care simulation — replacing the dependency on human supervisors.
Final year undergraduate researchers from the Department of Computer Engineering, University of Peradeniya.
This system bridges the gap between VR simulation technology and intelligent clinical education — serving multiple stakeholders in nursing education.
Practice clinical wound care procedures in a safe, immersive VR environment and receive instant, detailed feedback on both clinical knowledge and communication quality — without waiting for a human supervisor.
Monitor student performance across sessions through the Teacher Portal — view per-student session logs, critical safety flags, step-by-step action timelines, and manage clinical scenarios with no code changes required.
Deploy scalable clinical training that augments or supplements human supervision. Reduce bottlenecks caused by limited supervisor availability while maintaining consistent, evidence-based assessment standards.
Explore a replicable multi-agent LLM architecture with a clear hybrid deterministic-AI design pattern — applicable to any procedural clinical skill beyond wound care, including medication administration and IV cannulation.
What makes this system stand out from conventional nursing simulation tools.
Safety-critical pass/fail decisions are governed by hardcoded deterministic logic — the LLM cannot hallucinate a verdict. No student passes a skipped prerequisite because of AI error.
Students receive instant voice-driven feedback during the simulation — patient responses at ~3s, action validation under 2s — maintaining immersion without breaking the VR experience.
All feedback is grounded in authoritative clinical guidelines via RAG — not general AI training knowledge. Uploaded guidelines instantly enrich the knowledge base for all agents.
Clinical educators manage scenarios, update guidelines, and review student performance through the Teacher Portal — no developers needed. New scenarios go live instantly.
All six agents share a common BaseAgent interface. New agents, scenarios, or clinical procedures can be added without restructuring the core system architecture.
Three distinct TTS voices — patient, staff nurse, feedback narrator — create a believable, multi-character VR environment powered by Groq Orpheus v1 English.
Nursing education today has critical gaps that prevent scalable, intelligent training at the quality required for clinical safety.
Traditional clinical training relies entirely on human supervisors — limited availability and scalability prevents training large student cohorts effectively.
VR nursing simulations exist but lack intelligent real-time feedback. Students finish simulations with no understanding of what was done correctly or incorrectly.
Students receive generic, delayed, or no feedback after simulation sessions — missing the critical window for learning correction that formative assessment requires.
Four core research objectives drove the design of this system.
Design a multi-agent LLM framework that evaluates multiple dimensions of nursing competence simultaneously using six specialised AI agents.
Integrate Retrieval-Augmented Generation (RAG) to ground all feedback in evidence-based clinical guidelines rather than the model's training knowledge alone.
Build a complete VR-connected backend with real-time voice interaction, automated step evaluation, and a teacher management portal for runtime content updates.
Evaluate the system rigorously across six complementary pillars: deterministic logic, integration, AI agent quality, performance, fault tolerance, and speech accuracy.
Two wound care scenarios are implemented as structured JSON documents in Firebase Firestore. The system is scenario-agnostic — new scenarios can be added at runtime via the Teacher Portal.
Mr. Sunil Perera, 52-year-old male with hypertension. Post-operative clean surgical wound on the left forearm. Known allergies to Penicillin and Latex. Standard wound healing risk profile.
Patient with Type 2 Diabetes Mellitus. Elevated infection risk, impaired wound healing. Additional expected clinical reasoning — blood sugar control, HbA1c assessment, and diabetes-specific risk factor evaluation during history taking.
The student's simulation is governed by a linear state machine — steps must be completed in strict order before the session reaches COMPLETED.
Student interviews the virtual AI patient using voice or text. Must confirm identity, check allergies, assess pain, take medical history, and explain the procedure. Diabetic patients require additional risk factor questions.
Student answers Multiple Choice Questions about the wound shown in VR — wound type, anatomical location, exudate amount/type, tissue colour, and signs of infection. Evaluated deterministically.
Student performs nine sequential physical actions in VR: hand hygiene, trolley cleaning, solution and dressing selection and verification, materials arrangement, and trolley transport to patient.
Each agent extends a shared BaseAgent class wrapping the OpenAI Responses API. All agents have a non-LLM fallback to prevent cascading failures.
Simulates the virtual patient. Responses strictly grounded in scenario data — no invented facts. Temperature 0.0 for full determinism. Conditionally discloses sensitive information only when explicitly asked.
Conversational supervising nurse. Operates in two modes: guidance mode (step explanations when asked) and verification mode (triggered by keywords — returns approved / rejected / incomplete verdict).
RAG-grounded evaluation of the full history transcript. Returns boolean checklist: identity, allergies, pain, medical history, procedure explained, and (for diabetic scenarios) risk factors assessed.
Evaluates communication quality and style — self-introduction, empathy, open vs. closed questions, jargon avoidance, and turn count. Heuristic: ≥4 turns = Appropriate, ≥2 = Partially Appropriate.
Hybrid architecture. Pass/fail is 100% deterministic via a hardcoded prerequisite map — the LLM cannot override the verdict. The LLM is invoked only to explain why a skipped step matters, personalised to patient risk profile.
Synthesises all raw agent outputs into one supportive student-facing paragraph. Acknowledges strengths first, embeds the score naturally, closes with encouragement. Explicitly avoids punitive language.
| Criterion | With Risk | Without |
|---|---|---|
| Identity confirmed | 15% | 15% |
| Allergies checked | 25% | 30% |
| Pain assessed | 20% | 20% |
| Medical history taken | 20% | 20% |
| Procedure explained | 10% | 15% |
| Risk factors assessed | 10% | — |
A six-pillar evaluation framework designed for AI-enabled educational systems. All evaluations implemented as automated scripts for full reproducibility.
Deterministic logic correctness via pytest — state machine, MCQ evaluator, scoring engine, and Clinical Agent prerequisite validation.
Full API lifecycle and session flow via FastAPI TestClient — REST endpoints, WebSockets, RAG pipeline, student log persistence.
Golden dataset approach with known ground truth labels. LLM-as-judge rubric for the Feedback Narrator Agent across five quality criteria.
P50/P95 latency profiling via time.perf_counter() across 20 iterations per operation for all major system endpoints.
Fault injection by mocking all four external services — OpenAI LLM API, Vector Store, Groq STT/TTS, and Firebase Firestore.
STT accuracy via Word Error Rate on nursing dialogue samples. TTS round-trip intelligibility test via re-transcription and comparison.
| Operation | P50 Latency | P95 Latency | Notes |
|---|---|---|---|
| Patient response (LLM + TTS) | 7.09 s | — | Real-time VR interaction |
| Action validation — Clinical Agent | ~1.2 s | ~1.7 s | Deterministic path — near-instant |
| Nurse verification response | 4.89 s | — | Acceptable for step pacing |
| History evaluation (RAG + 2 agents + narrator) | 77.54 s | — | End-of-step batch — sequential API calls |
| MCQ evaluation (deterministic) | 0.02 s | 0.05 s | Near-instantaneous |
| STT transcription (Groq Whisper) | 0.67 s | 0.89 s | Suitable for real-time VR |
| TTS synthesis (Groq Orpheus) | 0.82 s | 0.98 s | Scales linearly with text length |
Model: Groq Whisper Large v3. Evaluated on recorded nursing dialogue samples with known ground truth transcripts using Word Error Rate (WER).
Model: Groq Orpheus v1 English. Round-trip intelligibility test: text → TTS → audio → STT → compare. Three distinct character voices.
A transparent assessment of what this work achieves, where it falls short, and where it goes next.
Dive into the full codebase, explore the VR Unity assets, or reach out to the team to learn more about the system and collaboration opportunities.