BERTopic Mapping of Clinical Teaching Activities in Residency and Medical Internship Training
DOI:
https://doi.org/10.17161/sjm.v3i2.25497Keywords:
clinical teaching, topic modeling, BERTopic, natural language processing, Chinese text mining; medical education, residency training, medical internship, large language model annotation, curriculum monitoringAbstract
Introduction: Electronic teaching platforms record thousands of free-text descriptions of clinical teaching activities, but rarely at scale. We applied BERTopic to discover the topic structure of teaching narratives in two parallel training systems and subjected the taxonomy to a multi-LLM zero-shot encoding-reproducibility check using independent commercial large language models (LLMs).
Methods: We analyzed 4,811 de-identified records (2,890 residency, 1,921 medical internship) from the CCMTV platform at a single-center tertiary teaching hospital (2022–2025). After jieba tokenization and 46,313 PHI placeholder substitutions, BERTopic with BAAI/bge-base-zh-v1.5 embeddings was fit per system and reduced to 24 topics each. Coherence (c_v, NPMI), topic diversity, 5-seed bootstrap (ARI/NMI), and a min_cluster_size grid characterized robustness. Two LLMs (Codex GPT-5.2; Gemini 3) annotated 200 stratified-sampled records under identical zero-shot prompts; inter-LLM Cohen’s κ quantified encoding reproducibility. Cross-system topic correspondence used cosine similarity of topic centroids with Hungarian matching.
Results: Reduced models reached c_v 0.505–0.548, diversity 0.86–0.90, outlier rates 14–23%, ARI 0.60–0.67, NMI ≥ 0.90. Inter-LLM Cohen’s κ = 0.709 [95% CI 0.64–0.78]; 0.878 [0.82–0.92] excluding -1 residuals. Cross-system Hungarian matches yielded cosine 0.783±0.132; 17 of 24 pairs (71%) reached ≥ 0.70, but a 1000-permutation merged-relabel null showed this count was not significantly above chance (p = 0.76), supporting descriptive interpretation only. Per-topic binomial GLM (year predictor, BH-FDR over 24 topics per system) identified significant year trends in 17 of 24 residency and 15 of 24 internship topics.
Conclusions: BERTopic with multi-LLM inter-LLM encoding-reproducibility checking offers a scalable, descriptive framework for monitoring the topic structure of clinical teaching and screening for potential topic-share changes, though population-level overlap claims should be read cautiously given the permutation-null result.
Downloads
Published
Data Availability Statement
The aggregated topic-quality reports, topic labels (Chinese and English), cross-system mapping summaries, temporal-drift summaries, prompt templates, complete stop-word list, custom lexicon, and the de-identified 200-document validation set are available on reasonable request from the corresponding author, subject to institutional data governance restrictions on the underlying primary teaching activity records.
Issue
Section
License
Copyright (c) 2026 Chujie Chen, Jun Li (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.