| Program code | CAER |
| Level | Advanced (post-foundation specialisation) |
| Format | 12 weeks · part-time · cohort-based |
| Prerequisite | HEALTHCARE AI TRAINER & DATA ANNOTATION PROGRAM |
| Awarded by | Shevs Connect Institute (SCI) |
Introduction
Healthcare AI is no longer only about prediction. A new generation of large language models and multimodal clinical assistants now generate text, summarise records, answer patient questions and support clinical reasoning. For these systems, the most valuable human contribution is no longer drawing boxes on images — it is judging the quality, safety and honesty of what the model produces, and using that judgement to shape the model’s behaviour. The Clinical AI Evaluation & RLHF Specialist Program trains learners to do exactly this.
Reinforcement Learning from Human Feedback (RLHF) and related techniques sit at the heart of modern AI alignment. Behind every well-behaved clinical assistant is a team of skilled human evaluators writing rubrics, grading responses, ranking outputs, building preference data and red-teaming the model for unsafe behaviour. This program demystifies that pipeline and turns learners into rigorous, employable evaluation specialists who understand both the craft and the clinical stakes.
Delivered by Shevs Connect Institute (SCI), the program assumes completion of the HEALTHCARE AI TRAINER & DATA ANNOTATION PROGRAM and builds on that foundation with a deep focus on generative AI, evaluation methodology, preference data and AI safety in healthcare contexts. Graduates are prepared to contribute to RLHF and model-evaluation projects on SCILabel and with global AI data partners.
| Complete this first
HEALTHCARE AI TRAINER & DATA ANNOTATION PROGRAM |
This specialist program is designed to be taken after the foundation program above. The foundation course establishes the core concepts and working practices that this program builds upon.
Learning Outcomes
On successful completion of this program, graduates will be able to:
- Explain how modern clinical AI systems and large language models work at a level sufficient to evaluate them.
- Describe the full RLHF pipeline (supervised fine-tuning, reward modelling, policy optimisation) and the human’s role at each stage.
- Design clear, operational evaluation rubrics for helpfulness, harmlessness and honesty in clinical settings.
- Grade individual model responses consistently against a guideline.
- Produce high-quality pairwise and multi-response preference data, including rationale capture.
- Recognise and mitigate annotation artifacts such as length bias, sycophancy and reward hacking.
- Measure inter-rater reliability for subjective evaluation tasks and improve calibration.
- Conduct structured red-teaming of clinical AI and score harms by severity.
- Test models for bias, fairness and equity across patient populations.
- Plan and run an end-to-end evaluation campaign and report results to inform model decisions.
Course Features
- Six in-depth modules totalling 60 structured lessons plus 5 major hands-on assignments.
- Practical exercises grading and ranking real model outputs against rubrics.
- Hands-on red-teaming labs with a severity-scoring framework for clinical harms.
- Calibration sessions and inter-rater reliability analysis that mirror professional RLHF teams.
- Coverage of current methods including RLHF, RLAIF, constitutional approaches and DPO.
- A portfolio-ready capstone: a complete evaluation campaign with a written report.
- Direct pathway into evaluation and RLHF projects on SCILabel and with partners.
- A Certificate of Completion in Clinical AI Evaluation & RLHF from Shevs Connect Institute.
Curriculum
- 6 Sections
- 60 Lessons
- 10 Weeks
- Section 1 — (Lesson 1-10): Foundations of Clinical AI & EvaluationUnderstand what you are evaluating: how clinical AI systems work and why human judgement is indispensable.10
- 1.11. From predictive to generative AI in healthcare
- 1.22. Anatomy of a large language model and a clinical AI system
- 1.33. Why human evaluation is essential to safe AI
- 1.44. The evaluation lifecycle: an overview
- 1.55. Capabilities versus safety versus alignment
- 1.66. Clinical use cases: triage, documentation and decision support
- 1.77. Failure modes: hallucination, omission and overconfidence
- 1.88. Evaluation metrics 101: helpful, harmless and honest
- 1.99. The evaluator’s mindset and common cognitive biases
- 1.1010. Regulatory and ethical context for clinical AI evaluation
- Section 2 —(Lesson 11-20): Human Feedback & RLHF FundamentalsSee the full RLHF pipeline end to end and locate exactly where human annotators create value.10
- 2.111. What RLHF is and why it works
- 2.212. The RLHF pipeline: SFT, reward model and policy optimisation
- 2.313. Supervised fine-tuning data and writing demonstrations
- 2.414. Preference data and comparison labelling
- 2.515. Reward models explained for annotators
- 2.616. Policy optimisation (PPO) explained conceptually
- 2.717. RLAIF and constitutional approaches to feedback
- 2.818. Direct Preference Optimisation (DPO) and newer methods
- 2.919. The role of human annotators across the pipeline
- 2.1020. Why data quality has an outsized impact on model behaviour
- Section 3 — (Lesson 21-30): Rubrics, Guidelines & Prompt EvaluationLearn to turn fuzzy quality judgements into consistent, repeatable evaluation work.10
- 3.121. Designing effective evaluation rubrics
- 3.222. Defining helpfulness, harmlessness and honesty operationally
- 3.323. Likert scales, binary judgements and rankings
- 3.424. Writing unambiguous annotation guidelines
- 3.525. Running calibration sessions with worked examples
- 3.626. Single-response grading workflows
- 3.727. Prompt taxonomy and designing for coverage
- 3.828. Domain-specific clinical evaluation criteria
- 3.929. Handling subjectivity and legitimate disagreement
- 3.1030. Documentation and guideline versioning
- Section 4 — (Sewction 31-40): Preference Data, Ranking & Reward ModellingProduce the comparison data that teaches models what 'better' means — without introducing bias.10
- 4.131. Pairwise comparison fundamentals
- 4.232. Multi-response ranking workflows
- 4.333. Handling ties and “both responses are poor” cases
- 4.434. Capturing rationale and chain-of-thought feedback
- 4.535. Detecting and avoiding reward hacking
- 4.636. Length bias, sycophancy and other annotation artifacts
- 4.737. Inter-rater reliability for preference data
- 4.838. Building balanced, representative preference datasets
- 4.939. From human annotations to reward-model training (overview)
- 4.1040. Evaluating reward-model quality
- Section 5 — (Lesson 41-50): Red-Teaming, Safety & Clinical RiskProbe models for unsafe behaviour and learn to score clinical harms responsibly.10
- 5.141. What red-teaming for AI involves
- 5.242. Adversarial prompting techniques
- 5.343. Awareness of jailbreaks and prompt injection
- 5.444. A taxonomy of clinical safety harms
- 5.545. Medical misinformation and harm-severity scoring
- 5.646. Evaluating dangerous capabilities in health contexts
- 5.747. Bias, fairness and equity testing
- 5.848. Sensitive edge cases: emergencies, mental health and self-harm content
- 5.949. Escalation, documentation and responsible disclosure
- 5.1050. Building a structured red-team test suite
- Section 6 — (Lesson 51-60):Production RLHF Pipelines, Metrics & DeploymentBring it together: run a real evaluation campaign and interpret results that guide model decisions.10
- 6.151. Designing an evaluation campaign end to end
- 6.252. Sampling strategies and sourcing prompts
- 6.353. Annotation tooling for RLHF at scale
- 6.454. Balancing throughput, quality and cost
- 6.555. Aggregating human judgements into metrics
- 6.656. A/B model comparison and win-rate analysis
- 6.757. Monitoring clinical AI in deployment
- 6.858. Continuous evaluation and regression testing
- 6.959. Audit trails, reproducibility and governance
- 6.1060. Capstone planning: scoping an evaluation and feedback project