BERTopic Mapping of Clinical Teaching Activities in Residency and Medical Internship Training

Chujie Chen; Jun Li

doi:10.17161/sjm.v3i2.25497

Authors

Chujie Chen The Seventh Affiliated Hospital of Sun Yat-sen University Author https://orcid.org/0000-0001-8985-6654
Jun Li Author https://orcid.org/0000-0001-8985-6654

DOI:

https://doi.org/10.17161/sjm.v3i2.25497

Keywords:

clinical teaching, topic modeling, BERTopic, natural language processing, Chinese text mining; medical education, residency training, medical internship, large language model annotation, curriculum monitoring

Abstract

Introduction: Electronic teaching platforms record thousands of free-text descriptions of clinical teaching activities, but rarely at scale. We applied BERTopic to discover the topic structure of teaching narratives in two parallel training systems and subjected the taxonomy to a multi-LLM zero-shot encoding-reproducibility check using independent commercial large language models (LLMs).

Methods: We analyzed 4,811 de-identified records (2,890 residency, 1,921 medical internship) from the CCMTV platform at a single-center tertiary teaching hospital (2022–2025). After jieba tokenization and 46,313 PHI placeholder substitutions, BERTopic with BAAI/bge-base-zh-v1.5 embeddings was fit per system and reduced to 24 topics each. Coherence (c_v, NPMI), topic diversity, 5-seed bootstrap (ARI/NMI), and a min_cluster_size grid characterized robustness. Two LLMs (Codex GPT-5.2; Gemini 3) annotated 200 stratified-sampled records under identical zero-shot prompts; inter-LLM Cohen’s κ quantified encoding reproducibility. Cross-system topic correspondence used cosine similarity of topic centroids with Hungarian matching.

Results: Reduced models reached c_v 0.505–0.548, diversity 0.86–0.90, outlier rates 14–23%, ARI 0.60–0.67, NMI ≥ 0.90. Inter-LLM Cohen’s κ = 0.709 [95% CI 0.64–0.78]; 0.878 [0.82–0.92] excluding -1 residuals. Cross-system Hungarian matches yielded cosine 0.783±0.132; 17 of 24 pairs (71%) reached ≥ 0.70, but a 1000-permutation merged-relabel null showed this count was not significantly above chance (p = 0.76), supporting descriptive interpretation only. Per-topic binomial GLM (year predictor, BH-FDR over 24 topics per system) identified significant year trends in 17 of 24 residency and 15 of 24 internship topics.

Conclusions: BERTopic with multi-LLM inter-LLM encoding-reproducibility checking offers a scalable, descriptive framework for monitoring the topic structure of clinical teaching and screening for potential topic-share changes, though population-level overlap claims should be read cautiously given the permutation-null result.

BERTopic Mapping of Clinical Teaching Activities in Residency and Medical Internship Training

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Data Availability Statement

Issue

Section

License

How to Cite

Most read articles by the same author(s)

Latest publications

Information

Language