VR Nursing AI · Multi-Agent LLM Framework

People

Team Members

Final year undergraduate researchers from the Department of Computer Engineering, University of Peradeniya.

Malintha K.M.K.

E/20/243

Student Researcher

e20243@eng.pdn.ac.lk

Fernando A.I.

E/20/100

Student Researcher

e20100@eng.pdn.ac.lk

Wickramaarachchi P.A.

E/20/434

Student Researcher

e20434@eng.pdn.ac.lk

Supervisors

Mrs. Yasodha Vimukthi

Project Supervisor · Dept. of CE

yasodhav@eng.pdn.ac.lk

Dr. Upul Jayasinghe

Project Supervisor · Dept. of CE

upuljm@eng.pdn.ac.lk

Ms. Ramya Ekanayake

Project Supervisor · Faculty of Allied Health Sciences

Prof. Damayanthi Dassanayake

Project Supervisor · Faculty of Allied Health Sciences

Project Summary

Abstract

Automated evaluation in virtual reality (VR)-based nursing education presents significant challenges in delivering clinically accurate and pedagogically sound feedback comparable to expert nurse assessments. Existing AI-driven systems often lack reliability, transparency, and alignment with established nursing competency frameworks, limiting their effectiveness in high-stakes educational settings. This paper presents a multi-agent Large Language Model (LLM) framework that addresses these limitations through a hybrid deterministic and LLM architecture deployed in a VR wound care nursing simulation. Six specialized agents evaluate distinct dimensions of nursing competence simultaneously: patient interaction fidelity, procedural prerequisite compliance, clinical knowledge coverage, communication quality, material verification, and formative feedback synthesis. Safety-critical pass/fail decisions are governed exclusively by deterministic logic to prevent hallucination, while the LLM is used for educational explanation and feedback narration. All agent feedback is grounded in verified clinical guidelines through a Retrieval-Augmented Generation (RAG) pipeline. Results demonstrate that the hybrid multi-agent approach successfully decomposes complex nursing competence into evaluable dimensions while maintaining clinical safety guarantees.

Audience

Who Is This For?

This system bridges the gap between VR simulation technology and intelligent clinical education — serving multiple stakeholders in nursing education.

Nursing Students

Practice clinical wound care procedures in a safe, immersive VR environment and receive instant, detailed feedback on both clinical knowledge and communication quality — without waiting for a human supervisor.

Clinical Educators

Monitor student performance across sessions through the Teacher Portal — view per-student session logs, critical safety flags, step-by-step action timelines, and manage clinical scenarios with no code changes required.

Nursing Schools & Hospitals

Deploy scalable clinical training that augments or supplements human supervision. Reduce bottlenecks caused by limited supervisor availability while maintaining consistent, evidence-based assessment standards.

AI & Healthcare Researchers

Explore a replicable multi-agent LLM architecture with a clear hybrid deterministic-AI design pattern — applicable to any procedural clinical skill beyond wound care, including medication administration and IV cannulation.

Value Proposition

Key Benefits

What makes this system stand out from conventional nursing simulation tools.

Clinically Safe by Design

Safety-critical pass/fail decisions are governed by hardcoded deterministic logic — the LLM cannot hallucinate a verdict. No student passes a skipped prerequisite because of AI error.

Real-Time Feedback

Students receive instant voice-driven feedback during the simulation — patient responses at ~3s, action validation under 2s — maintaining immersion without breaking the VR experience.

Evidence-Based Assessment

All feedback is grounded in authoritative clinical guidelines via RAG — not general AI training knowledge. Uploaded guidelines instantly enrich the knowledge base for all agents.

Educator-Controlled Content

Clinical educators manage scenarios, update guidelines, and review student performance through the Teacher Portal — no developers needed. New scenarios go live instantly.

Modular & Extensible

All six agents share a common BaseAgent interface. New agents, scenarios, or clinical procedures can be added without restructuring the core system architecture.

Multi-Voice Immersion

Three distinct TTS voices — patient, staff nurse, feedback narrator — create a believable, multi-character VR environment powered by Groq Orpheus v1 English.

Research Problem

The Problem We Solve

Nursing education today has critical gaps that prevent scalable, intelligent training at the quality required for clinical safety.

No Scalable Feedback

Traditional clinical training relies entirely on human supervisors — limited availability and scalability prevents training large student cohorts effectively.

Passive VR Simulations

VR nursing simulations exist but lack intelligent real-time feedback. Students finish simulations with no understanding of what was done correctly or incorrectly.

Generic or Delayed Feedback

Students receive generic, delayed, or no feedback after simulation sessions — missing the critical window for learning correction that formative assessment requires.

Research Question: How can we automate meaningful, clinically grounded feedback in a VR nursing training environment using Large Language Models — while maintaining the safety guarantees required in clinical education?

System Design

Methodology

Four core research objectives drove the design of this system.

Design a multi-agent LLM framework that evaluates multiple dimensions of nursing competence simultaneously using six specialised AI agents.

Integrate Retrieval-Augmented Generation (RAG) to ground all feedback in evidence-based clinical guidelines rather than the model's training knowledge alone.

Build a complete VR-connected backend with real-time voice interaction, automated step evaluation, and a teacher management portal for runtime content updates.

Evaluate the system rigorously across six complementary pillars: deterministic logic, integration, AI agent quality, performance, fault tolerance, and speech accuracy.

Clinical Scenarios

Training Scenarios

Two wound care scenarios are implemented as structured JSON documents in Firebase Firestore. The system is scenario-agnostic — new scenarios can be added at runtime via the Teacher Portal.

Scenario 001

Post-Operative Clean Wound

Mr. Sunil Perera, 52-year-old male with hypertension. Post-operative clean surgical wound on the left forearm. Known allergies to Penicillin and Latex. Standard wound healing risk profile.

Scenario 002

Diabetic Patient Wound Care

Patient with Type 2 Diabetes Mellitus. Elevated infection risk, impaired wound healing. Additional expected clinical reasoning — blood sugar control, HbA1c assessment, and diabetes-specific risk factor evaluation during history taking.

Clinical Workflow

Three-Step State Machine

The student's simulation is governed by a linear state machine — steps must be completed in strict order before the session reaches COMPLETED.

Step 1

History Taking

Student interviews the virtual AI patient using voice or text. Must confirm identity, check allergies, assess pain, take medical history, and explain the procedure. Diabetic patients require additional risk factor questions.

Step 2

Wound Assessment

Student answers Multiple Choice Questions about the wound shown in VR — wound type, anatomical location, exudate amount/type, tissue colour, and signs of infection. Evaluated deterministically.

Step 3

Cleaning & Dressing

Student performs nine sequential physical actions in VR: hand hygiene, trolley cleaning, solution and dressing selection and verification, materials arrangement, and trolley transport to patient.

Multi-Agent Framework

Six Specialised Agents

Each agent extends a shared BaseAgent class wrapping the OpenAI Responses API. All agents have a non-LLM fallback to prevent cascading failures.

Agent 01 · History Step

Patient Agent

Simulates the virtual patient. Responses strictly grounded in scenario data — no invented facts. Temperature 0.0 for full determinism. Conditionally discloses sensitive information only when explicitly asked.

temp = 0.0 · scenario-grounded

Agent 02 · Cleaning Step

Staff Nurse Agent

Conversational supervising nurse. Operates in two modes: guidance mode (step explanations when asked) and verification mode (triggered by keywords — returns approved / rejected / incomplete verdict).

guidance + verification

Agent 03 · Post-History

Knowledge Agent

RAG-grounded evaluation of the full history transcript. Returns boolean checklist: identity, allergies, pain, medical history, procedure explained, and (for diabetic scenarios) risk factors assessed.

60% of History score

Agent 04 · Post-History

Communication Agent

Evaluates communication quality and style — self-introduction, empathy, open vs. closed questions, jargon avoidance, and turn count. Heuristic: ≥4 turns = Appropriate, ≥2 = Partially Appropriate.

40% of History score

Agent 05 · Cleaning Step · Hybrid

Clinical Agent

Hybrid architecture. Pass/fail is 100% deterministic via a hardcoded prerequisite map — the LLM cannot override the verdict. The LLM is invoked only to explain why a skipped step matters, personalised to patient risk profile.

deterministic verdict + LLM explanation

Agent 06 · Post-Step

Feedback Narrator Agent

Synthesises all raw agent outputs into one supportive student-facing paragraph. Acknowledges strengths first, embeds the score naturally, closes with encouragement. Explicitly avoids punitive language.

formative synthesis

Knowledge Retrieval

RAG Pipeline

How It Works

Student action or transcript triggers evaluation

Dynamic RAG query built from step, scenario & clinical context

HyDE generator creates hypothetical guideline paragraph

Semantic similarity search in OpenAI Vector Store

Relevant guideline passages injected into agent system prompt

Evidence-based, grounded feedback generated

📄 Diabetic Wound Care Guidelines

📄 History Taking Evaluation Guidelines

📄 Wound Cleaning & Dressing Guidelines

Scoring Model

History Step Rubric

Criterion	With Risk	Without
Identity confirmed	15%	15%
Allergies checked	25%	30%
Pain assessed	20%	20%
Medical history taken	20%	20%
Procedure explained	10%	15%
Risk factors assessed	10%	—

Knowledge Agent: 60% · Communication Agent: 40%
Excellent ≥85% Good ≥70% Adequate ≥50% Needs Improvement <50%

Implementation

Technology Stack

Backend

FastAPI · Python

LLM Engine

OpenAI GPT

RAG Store

OpenAI Vector Stores

Speech-to-Text

Groq Whisper v3

Text-to-Speech

Groq Orpheus v1

Database

Firebase Firestore

Real-time

WebSockets + REST

VR Client

Unity

Teacher Portal

React + Vite

Unit Tests

pytest

Integration

FastAPI TestClient

Visualisation

Matplotlib + Seaborn

Evaluation

Results & Analysis

A six-pillar evaluation framework designed for AI-enabled educational systems. All evaluations implemented as automated scripts for full reproducibility.

Unit Testing

Deterministic logic correctness via pytest — state machine, MCQ evaluator, scoring engine, and Clinical Agent prerequisite validation.

Integration Testing

Full API lifecycle and session flow via FastAPI TestClient — REST endpoints, WebSockets, RAG pipeline, student log persistence.

AI Agent Evaluation

Golden dataset approach with known ground truth labels. LLM-as-judge rubric for the Feedback Narrator Agent across five quality criteria.

Performance Testing

P50/P95 latency profiling via time.perf_counter() across 20 iterations per operation for all major system endpoints.

Reliability Testing

Fault injection by mocking all four external services — OpenAI LLM API, Vector Store, Groq STT/TTS, and Firebase Firestore.

Speech Interface

STT accuracy via Word Error Rate on nursing dialogue samples. TTS round-trip intelligibility test via re-transcription and comparison.

100%

Unit Test Pass Rate
22 tests · 4 components

0.94

Knowledge Agent F1 Score
Precision 0.96 · Recall 0.93

1.00

Communication Agent Accuracy
95% consistency rate

100%

Fault Injection Recovery
0 crashes · 0 unhandled errors

Performance Latency (P50 / P95)

Operation	P50 Latency	P95 Latency	Notes
Patient response (LLM + TTS)	7.09 s	—	Real-time VR interaction
Action validation — Clinical Agent	~1.2 s	~1.7 s	Deterministic path — near-instant
Nurse verification response	4.89 s	—	Acceptable for step pacing
History evaluation (RAG + 2 agents + narrator)	77.54 s	—	End-of-step batch — sequential API calls
MCQ evaluation (deterministic)	0.02 s	0.05 s	Near-instantaneous
STT transcription (Groq Whisper)	0.67 s	0.89 s	Suitable for real-time VR
TTS synthesis (Groq Orpheus)	0.82 s	0.98 s	Scales linearly with text length

Speech Interface Evaluation

🎙️ Speech-to-Text (STT)

Model: Groq Whisper Large v3. Evaluated on recorded nursing dialogue samples with known ground truth transcripts using Word Error Rate (WER).

0.13

Average WER
(Word Error Rate)

0.67s

P50 Latency
(P95: 0.89s)

🔊 Text-to-Speech (TTS)

Model: Groq Orpheus v1 English. Round-trip intelligibility test: text → TTS → audio → STT → compare. Three distinct character voices.

0.13

Round-Trip WER
(intelligibility confirmed)

0.82s

P50 Latency
(P95: 0.98s)

Discussion

Impact, Limitations & Future Work

A transparent assessment of what this work achieves, where it falls short, and where it goes next.

✅ Impact & Contributions

First multi-agent LLM framework for automated formative feedback in VR nursing education
Proves hybrid deterministic + LLM architecture can safely evaluate clinical competence without hallucination risk
RAG grounding makes feedback quality independent of model training data recency
Teacher Portal enables curriculum expansion at runtime — zero developer involvement
Architecture is transferable to other clinical procedures and domains

⚠️ Limitations

History evaluation pipeline at 77.54s P50 — driven by sequential external API calls
Communication Agent has 5% inconsistency rate on borderline transcripts (inherent LLM non-determinism)
Only two clinical wound care scenarios currently implemented
Depends entirely on external APIs — availability and cost constrain institutional deployment
No formal user study with real nursing students or clinical educators yet conducted

🚀 Future Work

Parallelise agent LLM calls to significantly reduce history evaluation latency
Expand to medication administration, IV cannulation, and other procedures
Implement adaptive difficulty that scales with student performance over time
Build longitudinal student progress dashboards for clinical educators
Conduct formal user studies with nursing students to validate pedagogical effectiveness

Get Involved

Explore the Project

Dive into the full codebase, explore the VR Unity assets, or reach out to the team to learn more about the system and collaboration opportunities.

View on GitHub 🥽 Unity VR Assets ✉️ Contact Team

Resources

Links

Project Repository github.com/cepdnaclk Unity VR Assets FYP-WoundCareSim-Unity Project Page cepdnaclk.github.io Dept. of CE ce.pdn.ac.lk

Team Members

Abstract

Who Is This For?

Nursing Students

Clinical Educators

Nursing Schools & Hospitals

AI & Healthcare Researchers

Key Benefits

Clinically Safe by Design

Real-Time Feedback

Evidence-Based Assessment

Educator-Controlled Content

Modular & Extensible

Multi-Voice Immersion

The Problem We Solve

No Scalable Feedback

Passive VR Simulations

Generic or Delayed Feedback

Related Works

🤖 Multi-Agent LLM Systems

📚 Retrieval-Augmented Generation (RAG)

🔒 Hybrid Deterministic-LLM Architectures

🏥 AI in Clinical Skills Education

Methodology

Training Scenarios

Post-Operative Clean Wound

Diabetic Patient Wound Care

Three-Step State Machine

History Taking

Wound Assessment

Cleaning & Dressing

Six Specialised Agents

Patient Agent

Staff Nurse Agent

Knowledge Agent

Communication Agent

Clinical Agent

Feedback Narrator Agent

RAG Pipeline

How It Works

History Step Rubric

Technology Stack

Results & Analysis

Unit Testing

Integration Testing

AI Agent Evaluation

Performance Testing

Reliability Testing

Speech Interface

Performance Latency (P50 / P95)

Speech Interface Evaluation

🎙️ Speech-to-Text (STT)

🔊 Text-to-Speech (TTS)

Impact, Limitations & Future Work

✅ Impact & Contributions

⚠️ Limitations

🚀 Future Work

Explore the Project

Links