Introduction
Cutaneous melanoma is the fifth most common cancer in both the United States and the Netherlands1,2. Over the last decades, the incidence rate of melanoma has consistently increased. In 2023, more than 97,000 people in the United States and more than 8000 people in the Netherlands were diagnosed with this type of skin cancer. Because of the relatively high risk of metastasis in melanoma patients, after which the prognosis significantly worsens, it is crucial to combine early detection, correct diagnosis, and fast treatment3.
Melanocytic skin lesions are clusters of pigment-producing cells called melanocytes, which range in terms of biological behavior from benign (nevus, also known as mole), intermediate (melanocytoma), to malignant (melanoma)4. Most of these lesions are benign and can be confidently diagnosed based only on microscopic examination of hematoxylin and eosin (H&E)-stained tissue cross-sections. However, differentiating between some subtypes of melanocytic tumors is more challenging, with studies reporting low inter- and intra-observer agreement5,6. Particularly for these ambiguous cases, immunohistochemical (IHC) staining and molecular testing can be helpful in making the correct diagnosis, at the cost of prolonging the turnaround time by days to weeks7,8. While all melanocytic lesions should ideally be examined by specialized dermatopathologists, in many centers also general pathologists have to contribute to the examination of these lesions. To make matters worse, due to a growing volume of cases and the need for more comprehensive diagnoses, the workload for pathologists is expected to further increase9.
The transition to fully digital pathology departments enables the implementation of artificial intelligence (AI) models for workflow optimization to reduce the workload of pathologists and improve patient care9,10. Research on AI models for melanocytic lesions has focused on discriminating nevi from melanomas for diagnostic assistance11,12,13,14,15,16,17. Most of these models have only been developed and evaluated using between a hundred and a thousand specimens, covering a limited range of melanocytic lesion subtypes, if the subtypes were reported at all. Despite the strong discriminative performance that was shown in many studies, none have estimated or demonstrated benefit in clinical practice yet, raising the question whether other use cases could bring more clinical benefit in the short term.
One promising direction is the application of AI models for automated triaging of cases before initial examination by a pathologist. Melanocytic lesion diagnostics is an attractive domain for triaging because of the substantial caseload with lesions ranging in diagnostic complexity. Categorizing all incoming cases based on the predicted complexity or urgency can be leveraged to reduce turnaround time by alleviating several workflow bottlenecks: (1) direct referral of all high complexity cases to the pathologist with most expertise can prevent double examinations; (2) prioritizing high over low urgency cases can minimize delay for the cases for which it is most important; (3) directly ordering additional IHC staining or molecular testing for high complexity cases can shorten the time to a definitive diagnosis and therefore treatment18. Whereas the first use case may only benefit more general pathology departments, the second and third use case can also be advantageous in specialized centers with only dermatopathologists. Moreover, the risk of diagnostic error due to the application of AI models for triaging is expected to be low, as all cases remain to be examined by a pathologist according to the current standard of practice, resulting in a lower threshold for clinical acceptance and integration than AI models for assisted or automated diagnosis. Furthermore, enabling pathologists to start the day with all high complexity cases may lower the risk of diagnostic error by mitigating fatigue19.
To this end, we present an AI model for triaging cutaneous melanocytic lesions based on H&E-stained whole slide images (WSIs) (Fig. 1). This study is one of the first to investigate AI-based triaging for workflow optimization in digital pathology departments. The model was developed and validated based on a retrospective cohort of 27,167 unique specimens comprising 52,202 WSIs, which is, to the best of our knowledge, the largest melanocytic lesion dataset to date. Moreover, we publicly release the code and trained model parameters, enabling other researchers to reuse and build upon our work.
Each case, usually consisting of multiple slides, was digitized to obtain whole slide images. Tissue regions were segmented in each image to guide the tessellation, resulting in one set of tiles for the entire case. All tiles were converted to feature vectors using the encoder of the Hierarchical Image Pyramid Transformer. An ensemble of five Vision Transformers was used to predict the diagnostic complexity of the case based on the set of extracted feature vectors for assignment to a general or expert pathologist.
Results
Dataset statistics
A retrospectively collected dataset from the University Medical Center Utrecht, the Netherlands, was used for development and evaluation of the AI-based triaging model. After curation, the dataset consisted of 27,167 unique specimens of cutaneous melanocytic lesions acquired from 20,707 patients. In total, 52,202 H&E-stained WSIs were collected for these specimens, which occupied 23.1 terabyte of image data. The melanocytic lesions were subdivided into a high complexity (13.4%) and low complexity category (86.6%) for triaging. The median patient age for the high complexity category was 53 (interquartile range, 30; range, 1–96) and 53.5% were female. In comparison, the median patient age for the low complexity category was 36 (interquartile range, 22; range, 0–97) and 59.8% were female. The distribution of diagnoses per category is provided in Table 1. A subset of 889 specimens (3.3%) were consultation or revision cases, of which 81.3% were of high diagnostic complexity. Preprocessing of the complete dataset resulted in a total of 1,584,976 tiles of 4096 × 4096 pixels extracted from tissue regions of the WSIs, with a median of 32 (range, 1–3726) tiles per specimen.
The dataset was split on a patient level into a set for model development and a set for evaluation. The development set contained 80% of the patients (21,730 specimens with 13.6% in the high complexity category). The remaining 20% of the patients were assigned to the test set for independent evaluation of the final model performance. The test set was divided into two parts: (1) All specimens that reflect the same distribution as the development set (4957 specimens with 13.5% in the high complexity category) for evaluation of the in-distribution performance; (2) All specimens with non-melanocytic skin pathologies in addition to a melanocytic lesion (480 specimens with 8.0% in the high complexity category), which were set aside from the start to study model robustness. Because the model was not presented with non-melanocytic skin pathologies during training, these cases are considered to be outside of the development data distribution, and the results on this set are referred to as the out-of-distribution performance.
Predictive performance and calibration
The predictive performance of the AI model was measured in terms of the area under the receiver operating characteristic curve (AUROC), the area under the precision–recall curve (AUPRC), and the specificity at thresholds resulting in a sensitivity of 0.95, 0.98, and 0.99. Model calibration was assessed using reliability diagrams and quantified using the expected calibration error (ECE), which measures to what extent the predicted probability reflects the true correctness likelihood20. Stratified bootstrapping (R = 10,000 samples) was used to calculate 95% confidence intervals (CIs). Model evaluation was performed on the test sets with data combined from two scanner types unless otherwise specified.
The results of the AI-based triaging model evaluated on the in-distribution test set are shown in Fig. 2. The model reached an AUROC of 0.966 (95% CI, 0.960–0.972) and an AUPRC of 0.857 (95% CI, 0.836–0.877). Furthermore, the model achieved specificity values of 0.657 (95% CI, 0.539–0.718) at a sensitivity of 0.99, 0.714 (95% CI, 0.672–0.791) at a sensitivity of 0.98, and 0.831 (95% CI, 0.797–0.869) at a sensitivity of 0.95. The AUROC and AUPRC results for the in-distribution test set partitioned per scanner period are provided in Supplementary Fig. 1. The model performed better on the WSIs scanned starting from 2016 using the Hamamatsu NanoZoomer 2.0-XR scanner, in comparison to the WSIs scanned before 2016 using the Aperio ScanScope XT scanner. Predicted probabilities by the AI model were well-calibrated based on the reliability diagram, with an ECE of 0.010 (95% CI, 0.009–0.018). At 0.99 sensitivity, the seven false negative predictions comprised of two WNT-activated melanocytomas, two BAP1-inactivated melanocytomas, two lesions categorized as ambiguous, and a recurrent nevus. Visual inspection of the false positive predictions for common nevi revealed cases with intense inflammation or pigmentation, the presence of scar tissue, or an uncommon morphological appearance (e.g., ballooning or due to artifacts).
a Receiver operating characteristic curve. b Precision–recall curve. c Reliability diagram. d Predicted probability histogram.
The results of the evaluation on the out-of-distribution test set are shown in Fig. 3. The AI model obtained an AUROC of 0.899 (95% CI, 0.860–0.934), an AUPRC of 0.498 (95% CI, 0.360–0.639), and an ECE of 0.160 (95% CI, 0.136–0.187). For some cases that include both a common nevus and a non-melanocytic skin pathology, which were labeled as low complexity, the AI model predicted the case to be of high complexity. These cases are considered false positive predictions in the evaluation, which is reflected by the lower AUPRC. For example, at 0.95 sensitivity on the in-distribution dataset, 48 out of 64 (75%) cases with a common nevus and basal cell carcinoma, and 4 out of 4 (100%) cases with a common nevus and squamous cell carcinoma, were predicted to be of high complexity.
a Receiver operating characteristic curve. b Precision–recall curve. c Reliability diagram. d Predicted probability histogram.
Several example cases with corresponding attention maps and classification results are shown in Fig. 4. The attention maps indicate the weight, and therefore importance, assigned by the model to each tile for the prediction at case level. Tiles that were assigned the highest attention weight consistently showed melanocytic lesion tissue for the correctly classified examples. However, not all tiles with melanocytic lesion tissue are assigned a high attention weight. For some false positive cases, the highest attention weight was assigned to tiles with non-melanocytic tissue, such as scar tissue or squamous cell carcinoma (out-of-distribution), whereas tiles showing common nevus tissue in the same case received low weights.
Per case from top to bottom: the most representative whole slide image for that case, the extracted tiles colored based on the attention weights assigned by the AI model, the tile with the largest attention weight at a higher magnification, and the classification result. Classification decisions were obtained using a threshold corresponding to a sensitivity of 0.95 on the in-distribution test set. Images for cases shown in the two leftmost columns were acquired using the ScanScope XT scanner (Aperio) and in the two rightmost columns using the NanoZoomer 2.0-XR scanner (Hamamatsu). a Correct predictions for cases from the low complexity category. From left to right: dermal nevus, compound nevus, dermal nevus, and (acral) junctional nevus. b Correct predictions for cases from the high complexity category. From left to right: superficial spreading melanoma, nodular melanoma, WNT-activated melanocytoma, and Spitz nevus. c Incorrect predictions. From left to right: dermal nevus and squamous cell carcinoma (out-of-distribution), compound nevus and scar tissue, dermal nevus with uncommon morphology (heavily pigmented and likely congenital), and WNT-activated melanocytoma.
Simulation experiment
To study the effect of implementing the AI model for triaging in the pathology department workflow, a simulation experiment was performed using the model predictions for the combined test sets. For the simulation, we assumed one expert pathologist and four general pathologists, which approximately reflects the ratio in most pathology departments in the Netherlands. Per iteration, 500 cases were randomly sampled with replacement from the test sets and assigned to one of the pathologists, resulting in 100 cases per pathologist. Four methods of assignment were compared: (1) Baseline: assigning each case to a random pathologist; (2) Baseline excluding consultation: assigning all consultation and revision cases directly to the expert pathologist, followed by assigning each of the remaining cases to a random pathologist; (3) AI-based triaging: ranking the cases based on the predicted probability of being a high complexity case, assigning the most complex cases to the expert pathologist, followed by distributing the remaining cases over the general pathologists; (4) AI-based triaging excluding consultation: assigning all consultation and revision cases directly to the expert pathologist, followed by the AI-based triaging method for the remaining cases. All methods were repeated for 10,000 iterations. The simulation results are reported as the mean and 95% CI of the number of high complexity cases that were assigned to the expert and general pathologists per assignment method.
Using random case assignment as a baseline for the simulation experiment, all five pathologists were assigned 13.0 (95% CI, 7–20) high complexity cases out of 100, which reflects the prevalence in the dataset. When accounting for consultation and revision cases, the single expert pathologist and the four general pathologists received 22.6 (95% CI, 15–31) and 10.6 (95% CI, 5–17) high complexity cases out of 100, respectively. Using AI-based triaging instead, the expert pathologist was assigned 56.8 (95% CI, 45–68) high complexity cases out of 100, and the four general pathologists each received 2.0 (95% CI, 0–5) high complexity cases out of 100. Assigning consultation and revision cases to the expert pathologist before AI-based triaging showed similar results, with 57.1 (95% CI, 45–69) and 2.0 (95% CI, 0–5) high complexity cases out of 100 assigned to the expert and to the four general pathologists, respectively. Under the assumption that the general pathologists would identify all complex cases and refer these cases to the expert pathologist, 43.9 (95% CI, 36–55) initial examinations of high complexity cases by general pathologists could be prevented per 500 cases using AI-based triaging instead of random case assignment. With consultation and revision cases accounted for, a total of 34.5 (95% CI, 25–44) initial examinations could be prevented.
Discussion
In this study, we developed and validated an AI model for melanocytic lesion triaging using H&E-stained WSIs. Pathologists are facing an increasing workload due to the need for more comprehensive diagnoses and a growing volume of cases9. This problem can potentially be alleviated using AI-based triaging10. The developed AI model showed a strong predictive performance in differentiating between melanocytic lesions of high and low complexity. Using a simulation experiment, we demonstrated that implementing AI-based triaging for case assignment could substantially reduce workload and increase efficiency in daily routine pathology practice.
The AI model correctly distinguished between most melanocytic lesions of high and low complexity, as evidenced by the AUROC of 0.966 and the AUPRC of 0.857 on the in-distribution test set. We investigated including also clinical information (i.e., age, sex, and anatomical location) as part of the input to the model, but this did not show benefit in preliminary experiments. At a high sensitivity, false negative predictions were seen for WNT-activated and BAP1-inactivated melanocytomas, which might be because these lesions often co-occur with a common nevus and have a limited representation in the high complexity category. Inspection of false positive classifications with the highest predicted probability revealed a few cases originally diagnosed as common nevus, yet, in retrospect, were highly suspicious for a Spitz nevus or WNT-activated melanocytoma. Other false positives were likely caused by the presence of scar tissue. Because recurrent nevi and melanoma re-excisions usually show scar tissue close to the remaining lesion, we hypothesize that the model learned to associate scar tissue with the high complexity class, even in the absence of remaining melanocytes. Although common nevi with atypia due to strong inflammation were also seen among the false positive predictions, these lesions are in practice often also challenging to diagnose, which can be seen as a limitation of the categorization.
For correctly classified cases, tiles with the highest attention weights consistently showed melanocytic lesion tissue, which implies that the model has learned to predict based on the relevant WSI region. Not all tiles with melanocytic lesion tissue were assigned high attention weights, however, which is a consequence of the model architecture and training procedure21. In addition to verification, the attention maps were also helpful in identifying possible causes of incorrect classification predictions, such as the presence of scar tissue without remaining melanocytic lesion tissue.
The out-of-distribution performance of the AI model was studied on an independent subset of cases with both melanocytic lesions and non-melanocytic skin pathologies. For some of these cases, which were labeled as low complexity because of the common nevus, the AI model predicted the case to be of high complexity based on the non-melanocytic pathology. These cases are considered false positive predictions in the evaluation, which explains the lower AUPRC. A higher false positive rate can be acceptable, in practice, since tissue specimens with both a melanocytic lesion and a non-melanocytic skin pathology occur infrequently (1.8% of cases in our dataset). For the purpose of triaging, maintaining a low false negative rate is generally more important.
Skin tissue specimens usually arrive at the pathology department in batches with mixed pathologies. Although preliminary clinical (differential) diagnoses by general practitioners or dermatologists are often provided, a perfect separation of melanocytic and non-melanocytic cases based on these diagnoses is impossible since the clinical impression is not always correct. In the future, effective deployment of the AI model for melanocytic lesion triaging requires either another model that separates melanocytic lesions from the rest or expanding the development set by including other common skin pathologies to improve robustness.
Along these lines, Sankarapandian et al.18 trained and validated a hierarchical skin biopsy classification system using a reference set of 7685 WSIs from 3511 specimens, of which 1079 concern melanocytic lesions. A two-stage classification model was trained to differentiate between lesions of basaloid, squamous, melanocytic (low, intermediate, and high risk), and other origin. The authors report a strong performance of the system in discriminating between lesions of different origins, but predicting the risk level for melanocytic lesions was demonstrated to be more difficult. We deliberately decided not to group the melanocytic lesions based on the risk of malignancy because differentiating between, for example, benign Spitz nevi and malignant superficial spreading melanomas based only on H&E-stained slides can be extremely challenging5. Since both of these lesion subtypes would typically require additional IHC staining or molecular testing for definitive diagnosis, in our view, it is preferable to consider both as high complexity cases for triaging.
Through simulation, we estimated that, on average, 43.9 initial examinations of high complexity cases by general pathologists could be prevented with AI-based triaging per 500 cases, using five pathologists of which one expert and random case assignment as baseline. When accounting for consultation and revision cases by assigning these to the expert pathologist first, an average of 34.5 initial examinations could be prevented. In addition to the simulation configuration, these estimates also depend on factors such as population prevalence of melanocytic lesion subtypes, the number of consultation and revision cases, and the origin of tissue specimens. Information about where a skin biopsy or excision was performed (e.g., a general practice or dermatology department) can allow for more informed case assignment. Potential benefits from case prioritization or requesting additional diagnostic tests before initial examination were not investigated. While there are also possible risks of deploying an AI-based triaging system in clinical practice, such as lower vigilance of general pathologists and more difficulty in building expertise, these could be mitigated by regularly assigning a high complexity case to a general pathologist and determining afterwards if the case was referred to an expert or if an expert agrees with the diagnosis that was made as quality control. Impact assessment studies and cost-effectiveness analyses are important next steps before clinical implementation, as well as further validation focused on rare and easily misdiagnosed melanocytic lesions (e.g., nevoid melanomas, which were included in our dataset but not separately categorized). For broader adoption of the AI model, validation on datasets from external centers would also be necessary.
In conclusion, this work describes the development and validation of an AI model for triaging cutaneous melanocytic lesions, based on the largest melanocytic lesion dataset to date. The model achieved a strong predictive performance in differentiating between high and low complexity melanocytic lesions. The potential benefit of implementing AI-based triaging for case assignment was demonstrated using a simulation experiment.
Methods
Study design
This retrospective cohort study was performed using an internal dataset from the University Medical Center (UMC) Utrecht, the Netherlands. All pathology reports accessioned between January 1, 2013, and December 31, 2020, with any melanocytic lesion PALGA22 code attached were queried from a database at the pathology department. The study was conducted in compliance with the UMC Utrecht’s research ethics committee guidelines. All reports and images were de-identified prior to analysis and model development. Informed consent was waived due to the retrospective nature of the study. Reports for patients who opted out of the use of their data for research purposes were excluded.
Dataset curation
The specimen selection is schematically shown in Fig. 5. A total of 26,746 pathology reports were retrieved, containing text descriptions of histopathological and molecular findings, diagnoses, and treatment recommendations for one or multiple specimens. Corresponding clinical information included the sex, age, and pseudonymized identification number of the patient, as well as the anatomical location of the specimens and the presence of additional, non-melanocytic skin pathologies (e.g., basal cell carcinoma).
Flowchart of the specimen selection with exclusion criteria.
The reports were manually checked and divided into a separate report for every specimen if applicable. The specimens comprised of shave and punch biopsies, excisions, and amputated digits. Re-excisions were only included if sufficient tumor tissue remained for diagnosis. Specimens with mucosal or uveal melanoma were excluded. The pathology reports were made shortly after the time of accession as part of routine clinical practice by different general pathologists and more specialized dermatopathologists. Whereas the oldest specimens were diagnosed based only on H&E-stained slides, IHC stains and molecular tests were more often used for diagnosis in recent years. For all specimens, the diagnostic code was manually checked and corrected if inconsistent with the diagnosis in text. Moreover, to improve the consistency in the diagnoses over time, two pathologists (G.B. and W.B.) reviewed a subset of 2088 specimens, focused primarily on the oldest and diagnostically most challenging cases, and revised the diagnoses when necessary. Ambiguous lesions for which a definitive diagnosis could not be rendered were either assigned multiple codes to reflect the differential diagnosis, or the lesion was labeled as “ambiguous”. More than one diagnostic code was also assigned to specimens with combined or multiple distinct lesions of different subtypes. Specifically for Spitz lesions, only specimens confirmed by either convincing positive IHC staining or molecular analysis as being of Spitz lineage were labeled as such. In contrast, all suspected Spitz lesions without confirmation were labeled as ambiguous.
Image acquisition was performed using either a ScanScope XT scanner (Aperio, Vista, CA, USA) at 20× magnification with a resolution of 0.50 μm per pixel (slides scanned before 2016) or a NanoZoomer 2.0-XR scanner (Hamamatsu photonics, Hamamatsu, Shizuoka, Japan) at 40× magnification with a resolution of 0.23 μm per pixel (slides scanned starting from 2016). In summary, the curated dataset consisted of 52,202 H&E-stained WSIs from 27,167 unique specimens, acquired from 20,707 patients, and occupied 23.1 terabyte of image data.
Triaging dataset
Melanocytic lesions from the curated dataset were grouped based on the diagnostic codes into a high complexity and low complexity category for the purpose of triaging. All specimens with only common nevi (i.e., junctional, dermal, and compound common nevi) were assigned to the low complexity category, as these lesions can usually be easily diagnosed by general pathologists using H&E-stained slides only. Specimens with all other melanocytic lesion subtypes (i.e., non-common nevi, melanocytomas, and melanomas) were assigned to the high complexity category, as these lesions often require expertise in dermatopathology and/or additional IHC staining or molecular testing for definitive diagnosis and optimal treatment recommendation. Specimens labeled as ambiguous or with a differential diagnosis were also assigned to the high complexity category. The distribution of diagnoses per category is provided in Table 1.
For each specimen, one or more slides with H&E-stained tissue cross-sections were prepared and scanned to obtain WSIs. Because triaging is intended to be performed before initial examination by a pathologist, no information is available yet about which WSI best shows the lesion. This precludes selecting the most representative slide, which is why model training and inference were performed at case level using all unique WSIs at once per specimen, including WSIs without lesion tissue.
The dataset was split on a patient level into a set for model development and a set for evaluation. The development set contained 80% of the patients, which was further subdivided into five folds for cross-validation. The remaining 20% of the patients were assigned to the test set for independent evaluation of the final model performance. The test set was divided into two parts: (1) All specimens that reflect the same distribution as the development set for evaluation of the in-distribution performance; (2) All specimens with non-melanocytic skin pathologies in addition to a melanocytic lesion, which were set aside from the start to study model robustness. These cases are considered to be outside of the development data distribution, because the model was not presented with non-melanocytic skin pathologies during training. Results on this set are referred to as the out-of-distribution performance.
Feature representation
An overview of the methodology is shown in Fig. 1. Tissue cross-sections and pen markings were segmented in each WSI at 1.25× magnification using SlideSegmenter23. The resulting tissue segmentation map was used to guide the slide tessellation. Non-overlapping tiles of 4096 × 4096 pixels were extracted from the WSIs at 20× magnification. Tiles for less than 5% covered by tissue were excluded. Because pen markings in WSIs are a potential source of bias during model development24, tiles with identified pen markings were also excluded.
We built upon the Hierarchical Image Pyramid Transformer (HIPT) proposed by Chen et al.25. This model consists of three concatenated Vision Transformers (ViTs)26. The first two ViTs were trained consecutively on tiles extracted from a pan-cancer dataset of 10,678 WSIs from The Genome Cancer Atlas (TCGA)27 using DINO28. This self-supervised learning method trains a model to recognize image patterns in a label-agnostic manner. In brief, a student model learns by minimizing the cross-entropy loss between the representations of several augmented image crops generated by itself and a teacher model. The parameters of the teacher model are an exponential moving average of the student model parameters, making it a self-distillation framework using no labeled data. The first two ViTs form the encoder, which was used to extract 192-dimensional feature vectors for all tiles.
Model training
The third ViT of the HIPT model was trained specifically for the triaging task using the extracted feature vectors for all specimens in the melanocytic lesion development set from both scanner types. Training was repeated five times, each using a different fold for validation and the remaining four folds for training. Because the number of feature vectors per specimen varies, only a single specimen was used per iteration. The model was trained by minimizing the cross-entropy loss for 1,000,000 iterations starting from randomly initialized parameters using the AdamW29 optimization algorithm (β1 = 0.9, β2 = 0.999). Gradients were accumulated over every 500 iterations. The learning rate was 0.0005 at the start and halved after every 100,000 iterations. The network parameters that resulted in the smallest loss on the validation fold were saved, which was evaluated after every 10,000 iterations. The model was trained with attention dropout (p = 0.5). In addition, all feature vectors for a cross-section were randomly excluded during training (p = 0.5). Hyperparameters were tuned based on the average performance on the five validation folds. The model and training procedure were implemented in the Pytorch30 framework. Probabilities predicted by the five model instances at inference time were averaged to obtain model ensemble predictions.
Data availability
All relevant data supporting the findings of this study are available within the paper and its Supplementary Information. Raw data that support the findings of this study are not openly available because of patient privacy reasons, but can be made available upon reasonable request. Requests for access can be directed to the corresponding author.
Code availability
Code for model development and evaluation, as well as the trained model parameters, are available at https://github.com/RTLucassen/melanocytic_lesion_triaging. A custom software tool which was developed and used for slide selection is available at https://github.com/RTLucassen/selection_tool.
References
Siegel, R. L., Miller, K. D., Wagle, N. S. & Jemal, A. Cancer statistics, 2023. CA Cancer J. Clin. 73, 17–48 (2023).
Integraal Kankercentrum Nederland. NKR cijfers. https://nkr-cijfers.iknl.nl/viewer. (Accessed Feb 2, 2024).
Gershenwald, J. E. et al. Melanoma staging: evidence-based changes in the American Joint Committee on Cancer eighth edition cancer staging manual. CA Cancer J. Clin. 67, 472–492 (2017).
Bastian, B. C. The molecular pathology of melanoma: an integrated taxonomy of melanocytic neoplasia. Annu. Rev. Pathol. 9, 239–271 (2014).
Gerami, P. et al. Histomorphologic assessment and interobserver diagnostic reproducibility of atypical Spitzoid melanocytic neoplasms with long-term follow-up. Am. J. Surg. Pathol. 38, 934–940 (2014).
Elmore, J. G. et al. Pathologists’ diagnosis of invasive melanoma and melanocytic proliferations: observer accuracy and reproducibility study. Br. Med. J. 357 (2017).
Davis, L. E., Shalin, S. C. & Tackett, A. J. Current state of melanoma diagnosis and treatment. Cancer Biol. Ther. 20, 1366–1379 (2019).
Benton, S. et al. Impact of next-generation sequencing on interobserver agreement and diagnosis of Spitzoid neoplasms. Am. J. Surg. Pathol. 45, 1597–1605 (2021).
Van der Laak, J., Litjens, G. & Ciompi, F. Deep learning in histopathology: the path to the clinic. Nat. Med. 27, 775–784 (2021).
Berbís, M. A. et al. Computational pathology in 2030: a Delphi study forecasting the role of AI in pathology within the next decade. EBioMedicine 88 (2023).
Van Zon, M. et al. Segmentation and classification of melanoma and nevus in whole slide images. In 2020 IEEE 17th International Symposium on Biomedical Imaging, 263–266 (IEEE, 2020).
Höhn, J. et al. Combining CNN-based histologic whole slide image analysis and patient data to improve skin cancer classification. Eur. J. Cancer 149, 94–101 (2021).
Ba, W. et al. Diagnostic assessment of deep learning for melanocytic lesions using whole-slide pathological images. Transl. Oncol. 14, 101161 (2021).
Li, M., Abe, M., Nakano, S. & Tsuneki, M. Deep learning approach to classify cutaneous melanoma in a whole slide image. Cancers 15, 1907 (2023).
Haggenmüller, S. et al. Federated learning for decentralized artificial intelligence in melanoma diagnostics. JAMA Dermatol. 160, 303–311 (2024).
Wies, C. et al. Evaluating deep learning-based melanoma classification using immunohistochemistry and routine histology: a three center study. PLoS ONE 19, e0297146 (2024).
Maher, N. G. et al. Weakly supervised deep learning image analysis can differentiate melanoma from naevi on haematoxylin and eosin-stained histopathology slides. J. Eur. Acad. Dermatol. Venereol. 38, 2250–2258 (2024).
Sankarapandian, S. et al. A pathology deep learning system capable of triage of melanoma specimens utilizing dermatopathologist consensus as ground truth. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 629–638 (2021).
Reiner, B. I. & Krupinski, E. The insidious problem of fatigue in medical imaging practice. J. Digit. Imaging 25, 3–6 (2012).
Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, 1321–1330 (2017).
Tourniaire, P., Ilie, M., Hofman, P., Ayache, N. & Delingette, H. MS-CLAM: mixed supervision for the classification and localization of tumors in whole slide images. Med. Image Anal. 85, 102763 (2023).
Casparie, M. et al. Pathology databanking and biobanking in The Netherlands, a central role for PALGA, the nationwide histopathology and cytopathology data network and archive. Cell. Oncol. 29, 19–24 (2007).
Lucassen, R. T., Blokx, W. A. M. & Veta, M. Tissue cross-section and pen marking segmentation in whole slide images. In Proceedings of SPIE 12933, Medical Imaging 2024: Digital and Computational Pathology, vol. 12933 (2024).
Winkler, J. K. et al. Association between surgical skin markings in dermoscopic images and diagnostic performance of a deep learning convolutional neural network for melanoma recognition. JAMA Dermatol. 155, 1135–1141 (2019).
Chen, R. J. et al. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16144–16155 (2022).
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of the International Conference on Learning Representations (2021).
Liu, J. et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173, 400–416 (2018).
Caron, M. et al. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650–9660 (2021).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. Proceedings of the International Conference on Learning Representations (2019).
Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019).
Acknowledgements
This research was financially supported by the Hanarth Foundation. The funder played no role in study design, data collection, analysis and interpretation of data, or the writing of this paper.
Author information
Author notes
These authors contributed equally: Mitko Veta, Willeke A. M. Blokx.
Authors and Affiliations
Department of Pathology, University Medical Center Utrecht, Utrecht, the Netherlands
Ruben T. Lucassen,Nikolas Stathonikos,Gerben E. Breimer&Willeke A. M. Blokx
Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, the Netherlands
Ruben T. Lucassen&Mitko Veta
Authors
- Ruben T. Lucassen
View author publications
You can also search for this author inPubMedGoogle Scholar
- Nikolas Stathonikos
View author publications
You can also search for this author inPubMedGoogle Scholar
- Gerben E. Breimer
View author publications
You can also search for this author inPubMedGoogle Scholar
- Mitko Veta
View author publications
You can also search for this author inPubMedGoogle Scholar
- Willeke A. M. Blokx
View author publications
You can also search for this author inPubMedGoogle Scholar
Contributions
R.L., M.V., and W.B. conceptualized the study. R.L., G.B., and W.B. participated in data curation and verification. R.L., N.S., and M.V. designed the methodology. R.L. developed the model and performed the evaluation. R.L., N.S., M.V., and W.B. analyzed and interpreted the results. R.L. wrote the original draft. M.V. and W.B. supervised the project and participated in funding acquisition. All authors had full access to all the data in the study. All authors read, edited, and approved the final paper. All authors accept the final responsibility to submit for publication and take responsibility for the contents of the paper.
Corresponding author
Correspondence to Ruben T. Lucassen.
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lucassen, R.T., Stathonikos, N., Breimer, G.E. et al. Artificial intelligence-based triaging of cutaneous melanocytic lesions. npj Biomed. Innov. 2, 10 (2025). https://doi.org/10.1038/s44385-025-00013-1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s44385-025-00013-1