Development of a Named Entity Framework for Thyroid Cancer Staging and Risk Level Classification using Local, Lightweight Large Language Models

This abstract has open access

Abstract Description

Submission ID :

HAC228

Submission Type

Authors (including presenting author) :

Fung MMH*(1), Tang EHM*(1)(2), Wu T*(2), Luk Y(1), Au ICH(1), Liu X(1)(2), Lee VHF(1), Wong CK(1), Wei Z(2), Cheng WY(2), Tai ICY(1), Ho JWK(1)(2), Wong JWH(1), Lang BHH(1), Leung KSM(1)(2)(3)(4), Wong ZSY(5), Wu JTK†(1)(2)(3)(4), Wong CKH†(1)(2)(3)(6) *Co-first author, †Co-corresponding author

Affiliation :

(1)The University of Hong Kong, (2)Laboratory of Data Discovery for Health, Hong Kong Science Park, (3)The Hong Kong Jockey Club Global Health Institute, HKSAR, China

(4)The University of Hong Kong – Shenzhen Hospital, Shenzhen, China, (5)Graduate School of Public Health, St. Luke’s International University, (6)Department of Infectious Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, United Kingdom

Introduction :

Disease information that classifies cancer staging and risk level is usually stored in lengthy unstructured or semi-structured clinical notes in patients with cancer. Considerable time and labour are needed to retrieve it from multiple clinical notes, which might cause delay in healthcare delivery and result in suboptimal patient outcomes.

Objectives :

(1) to develop a named entity (NE) framework to efficiently extract disease information and classify the American Thyroid Association (ATA) risk and the American Joint Committee on Cancer (AJCC)/TNM staging in patients with thyroid cancer; (2) to evaluate the performance of NE framework using offline, lightweight, pre-trained large language models (LLMs) which ensure the data privacy.

Methodology :

The NE framework includes 1) annotation guideline co-developed by clinicians and researchers, 2) annotation by two independent annotators, 3) ground truth labelling by clinicians, 4) design of prompts with various strategies including chain-of-thought and few-shot for local LLMs to extract information from clinical notes, and 5) classification rules for determining cancer staging and risk level using the LLM outputs. 50 representative cases with different cancer staging and one pathology report each from the Cancer Genome Atlas Thyroid Cancer (TCGA-THCA) database were selected as the development set, and remaining 289 TCGA-THCA cases and 35 pseudo cases who were with at least one pathology report and one operation record each were used for validation. Four common LLMs namely Mistral-7B-Instruct-v0.3, Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct and Qwen-2.5-7B-Instruct were used to evaluate the performance of the NE framework.

Result & Outcome :

The kappa agreement rate between annotators was 84.3%, indicating a satisfactory agreement. Ensemble-like majority-vote strategy achieved F1-scores of 94.1% for AJCC and 100.0% for ATA in the development set, and 90.4-98.1% for AJCC and 88.5-95.5% for ATA in the validation set. Our proposed NE framework with the use of offline, lightweight, pre-trained LLMs is secure, efficient and accurate to extract disease information from clinical notes for classification of the risk and cancer staging.