Team Members

Final year undergraduate researchers from the Department of Computer Engineering, University of Peradeniya.

Ms. Ramya Ekanayake
Project Supervisor · Faculty of Allied Health Sciences
Prof. Damayanthi Dassanayake
Project Supervisor · Faculty of Allied Health Sciences

Abstract

Automated evaluation in virtual reality (VR)-based nursing education presents significant challenges in delivering clinically accurate and pedagogically sound feedback comparable to expert nurse assessments. Existing AI-driven systems often lack reliability, transparency, and alignment with established nursing competency frameworks, limiting their effectiveness in high-stakes educational settings. This paper presents a multi-agent Large Language Model (LLM) framework that addresses these limitations through a hybrid deterministic and LLM architecture deployed in a VR wound care nursing simulation. Six specialized agents evaluate distinct dimensions of nursing competence simultaneously: patient interaction fidelity, procedural prerequisite compliance, clinical knowledge coverage, communication quality, material verification, and formative feedback synthesis. Safety-critical pass/fail decisions are governed exclusively by deterministic logic to prevent hallucination, while the LLM is used for educational explanation and feedback narration. All agent feedback is grounded in verified clinical guidelines through a Retrieval-Augmented Generation (RAG) pipeline. Results demonstrate that the hybrid multi-agent approach successfully decomposes complex nursing competence into evaluable dimensions while maintaining clinical safety guarantees.

Who Is This For?

This system bridges the gap between VR simulation technology and intelligent clinical education — serving multiple stakeholders in nursing education.

Nursing Students

Practice clinical wound care procedures in a safe, immersive VR environment and receive instant, detailed feedback on both clinical knowledge and communication quality — without waiting for a human supervisor.

Clinical Educators

Monitor student performance across sessions through the Teacher Portal — view per-student session logs, critical safety flags, step-by-step action timelines, and manage clinical scenarios with no code changes required.

Nursing Schools & Hospitals

Deploy scalable clinical training that augments or supplements human supervision. Reduce bottlenecks caused by limited supervisor availability while maintaining consistent, evidence-based assessment standards.

AI & Healthcare Researchers

Explore a replicable multi-agent LLM architecture with a clear hybrid deterministic-AI design pattern — applicable to any procedural clinical skill beyond wound care, including medication administration and IV cannulation.

Key Benefits

What makes this system stand out from conventional nursing simulation tools.

Clinically Safe by Design

Safety-critical pass/fail decisions are governed by hardcoded deterministic logic — the LLM cannot hallucinate a verdict. No student passes a skipped prerequisite because of AI error.

Real-Time Feedback

Students receive instant voice-driven feedback during the simulation — patient responses at ~3s, action validation under 2s — maintaining immersion without breaking the VR experience.

Evidence-Based Assessment

All feedback is grounded in authoritative clinical guidelines via RAG — not general AI training knowledge. Uploaded guidelines instantly enrich the knowledge base for all agents.

Educator-Controlled Content

Clinical educators manage scenarios, update guidelines, and review student performance through the Teacher Portal — no developers needed. New scenarios go live instantly.

Modular & Extensible

All six agents share a common BaseAgent interface. New agents, scenarios, or clinical procedures can be added without restructuring the core system architecture.

Multi-Voice Immersion

Three distinct TTS voices — patient, staff nurse, feedback narrator — create a believable, multi-character VR environment powered by Groq Orpheus v1 English.

The Problem We Solve

Nursing education today has critical gaps that prevent scalable, intelligent training at the quality required for clinical safety.

No Scalable Feedback

Traditional clinical training relies entirely on human supervisors — limited availability and scalability prevents training large student cohorts effectively.

Passive VR Simulations

VR nursing simulations exist but lack intelligent real-time feedback. Students finish simulations with no understanding of what was done correctly or incorrectly.

Generic or Delayed Feedback

Students receive generic, delayed, or no feedback after simulation sessions — missing the critical window for learning correction that formative assessment requires.

Research Question: How can we automate meaningful, clinically grounded feedback in a VR nursing training environment using Large Language Models — while maintaining the safety guarantees required in clinical education?

Methodology

Four core research objectives drove the design of this system.

01

Design a multi-agent LLM framework that evaluates multiple dimensions of nursing competence simultaneously using six specialised AI agents.

02

Integrate Retrieval-Augmented Generation (RAG) to ground all feedback in evidence-based clinical guidelines rather than the model's training knowledge alone.

03

Build a complete VR-connected backend with real-time voice interaction, automated step evaluation, and a teacher management portal for runtime content updates.

04

Evaluate the system rigorously across six complementary pillars: deterministic logic, integration, AI agent quality, performance, fault tolerance, and speech accuracy.

Training Scenarios

Two wound care scenarios are implemented as structured JSON documents in Firebase Firestore. The system is scenario-agnostic — new scenarios can be added at runtime via the Teacher Portal.

Scenario 001

Post-Operative Clean Wound

Mr. Sunil Perera, 52-year-old male with hypertension. Post-operative clean surgical wound on the left forearm. Known allergies to Penicillin and Latex. Standard wound healing risk profile.

Scenario 002

Diabetic Patient Wound Care

Patient with Type 2 Diabetes Mellitus. Elevated infection risk, impaired wound healing. Additional expected clinical reasoning — blood sugar control, HbA1c assessment, and diabetes-specific risk factor evaluation during history taking.

Three-Step State Machine

The student's simulation is governed by a linear state machine — steps must be completed in strict order before the session reaches COMPLETED.

Step 1

History Taking

Student interviews the virtual AI patient using voice or text. Must confirm identity, check allergies, assess pain, take medical history, and explain the procedure. Diabetic patients require additional risk factor questions.

Step 2

Wound Assessment

Student answers Multiple Choice Questions about the wound shown in VR — wound type, anatomical location, exudate amount/type, tissue colour, and signs of infection. Evaluated deterministically.

Step 3

Cleaning & Dressing

Student performs nine sequential physical actions in VR: hand hygiene, trolley cleaning, solution and dressing selection and verification, materials arrangement, and trolley transport to patient.

Six Specialised Agents

Each agent extends a shared BaseAgent class wrapping the OpenAI Responses API. All agents have a non-LLM fallback to prevent cascading failures.

Agent 01 · History Step

Patient Agent

Simulates the virtual patient. Responses strictly grounded in scenario data — no invented facts. Temperature 0.0 for full determinism. Conditionally discloses sensitive information only when explicitly asked.

temp = 0.0 · scenario-grounded
Agent 02 · Cleaning Step

Staff Nurse Agent

Conversational supervising nurse. Operates in two modes: guidance mode (step explanations when asked) and verification mode (triggered by keywords — returns approved / rejected / incomplete verdict).

guidance + verification
Agent 03 · Post-History

Knowledge Agent

RAG-grounded evaluation of the full history transcript. Returns boolean checklist: identity, allergies, pain, medical history, procedure explained, and (for diabetic scenarios) risk factors assessed.

60% of History score
Agent 04 · Post-History

Communication Agent

Evaluates communication quality and style — self-introduction, empathy, open vs. closed questions, jargon avoidance, and turn count. Heuristic: ≥4 turns = Appropriate, ≥2 = Partially Appropriate.

40% of History score
Agent 05 · Cleaning Step · Hybrid

Clinical Agent

Hybrid architecture. Pass/fail is 100% deterministic via a hardcoded prerequisite map — the LLM cannot override the verdict. The LLM is invoked only to explain why a skipped step matters, personalised to patient risk profile.

deterministic verdict + LLM explanation
Agent 06 · Post-Step

Feedback Narrator Agent

Synthesises all raw agent outputs into one supportive student-facing paragraph. Acknowledges strengths first, embeds the score naturally, closes with encouragement. Explicitly avoids punitive language.

formative synthesis

RAG Pipeline

How It Works

Student action or transcript triggers evaluation
Dynamic RAG query built from step, scenario & clinical context
HyDE generator creates hypothetical guideline paragraph
Semantic similarity search in OpenAI Vector Store
Relevant guideline passages injected into agent system prompt
Evidence-based, grounded feedback generated
📄 Diabetic Wound Care Guidelines
📄 History Taking Evaluation Guidelines
📄 Wound Cleaning & Dressing Guidelines

History Step Rubric

Criterion With Risk Without
Identity confirmed 15%15%
Allergies checked 25%30%
Pain assessed 20%20%
Medical history taken 20%20%
Procedure explained 10%15%
Risk factors assessed 10%
Knowledge Agent: 60%  ·  Communication Agent: 40%
Excellent ≥85% Good ≥70% Adequate ≥50% Needs Improvement <50%

Technology Stack

Backend
FastAPI · Python
LLM Engine
OpenAI GPT
RAG Store
OpenAI Vector Stores
Speech-to-Text
Groq Whisper v3
Text-to-Speech
Groq Orpheus v1
Database
Firebase Firestore
Real-time
WebSockets + REST
VR Client
Unity
Teacher Portal
React + Vite
Unit Tests
pytest
Integration
FastAPI TestClient
Visualisation
Matplotlib + Seaborn

Results & Analysis

A six-pillar evaluation framework designed for AI-enabled educational systems. All evaluations implemented as automated scripts for full reproducibility.

01

Unit Testing

Deterministic logic correctness via pytest — state machine, MCQ evaluator, scoring engine, and Clinical Agent prerequisite validation.

02

Integration Testing

Full API lifecycle and session flow via FastAPI TestClient — REST endpoints, WebSockets, RAG pipeline, student log persistence.

03

AI Agent Evaluation

Golden dataset approach with known ground truth labels. LLM-as-judge rubric for the Feedback Narrator Agent across five quality criteria.

04

Performance Testing

P50/P95 latency profiling via time.perf_counter() across 20 iterations per operation for all major system endpoints.

05

Reliability Testing

Fault injection by mocking all four external services — OpenAI LLM API, Vector Store, Groq STT/TTS, and Firebase Firestore.

06

Speech Interface

STT accuracy via Word Error Rate on nursing dialogue samples. TTS round-trip intelligibility test via re-transcription and comparison.

100%
Unit Test Pass Rate
22 tests · 4 components
0.94
Knowledge Agent F1 Score
Precision 0.96 · Recall 0.93
1.00
Communication Agent Accuracy
95% consistency rate
100%
Fault Injection Recovery
0 crashes · 0 unhandled errors

Performance Latency (P50 / P95)

Operation P50 Latency P95 Latency Notes
Patient response (LLM + TTS) 7.09 s Real-time VR interaction
Action validation — Clinical Agent ~1.2 s ~1.7 s Deterministic path — near-instant
Nurse verification response 4.89 s Acceptable for step pacing
History evaluation (RAG + 2 agents + narrator) 77.54 s End-of-step batch — sequential API calls
MCQ evaluation (deterministic) 0.02 s 0.05 s Near-instantaneous
STT transcription (Groq Whisper) 0.67 s 0.89 s Suitable for real-time VR
TTS synthesis (Groq Orpheus) 0.82 s 0.98 s Scales linearly with text length

Speech Interface Evaluation

🎙️ Speech-to-Text (STT)

Model: Groq Whisper Large v3. Evaluated on recorded nursing dialogue samples with known ground truth transcripts using Word Error Rate (WER).

0.13
Average WER
(Word Error Rate)
0.67s
P50 Latency
(P95: 0.89s)

🔊 Text-to-Speech (TTS)

Model: Groq Orpheus v1 English. Round-trip intelligibility test: text → TTS → audio → STT → compare. Three distinct character voices.

0.13
Round-Trip WER
(intelligibility confirmed)
0.82s
P50 Latency
(P95: 0.98s)

Impact, Limitations & Future Work

A transparent assessment of what this work achieves, where it falls short, and where it goes next.

✅ Impact & Contributions

  • First multi-agent LLM framework for automated formative feedback in VR nursing education
  • Proves hybrid deterministic + LLM architecture can safely evaluate clinical competence without hallucination risk
  • RAG grounding makes feedback quality independent of model training data recency
  • Teacher Portal enables curriculum expansion at runtime — zero developer involvement
  • Architecture is transferable to other clinical procedures and domains

⚠️ Limitations

  • History evaluation pipeline at 77.54s P50 — driven by sequential external API calls
  • Communication Agent has 5% inconsistency rate on borderline transcripts (inherent LLM non-determinism)
  • Only two clinical wound care scenarios currently implemented
  • Depends entirely on external APIs — availability and cost constrain institutional deployment
  • No formal user study with real nursing students or clinical educators yet conducted

🚀 Future Work

  • Parallelise agent LLM calls to significantly reduce history evaluation latency
  • Expand to medication administration, IV cannulation, and other procedures
  • Implement adaptive difficulty that scales with student performance over time
  • Build longitudinal student progress dashboards for clinical educators
  • Conduct formal user studies with nursing students to validate pedagogical effectiveness

Explore the Project

Dive into the full codebase, explore the VR Unity assets, or reach out to the team to learn more about the system and collaboration opportunities.