This study developed an in-depth study algorithm based on DenseNet169 with acceptable performance (i.e., AUROC 0.895 and AUPRC 0.918 for an external confirmation data set) for the diagnosis of tongue cancer from endoscopic images (Table 2 and Figure 2). Other existing medical imaging studies have yielded higher results in some cases. However, in contrast to this study, most of them have the limitation that when using a set of external tests, neither internal test results nor validation results were shown.23,24. The AI model developed in our study can obtain visual samples of cancer in complex endoscopic oral images. This artificial diagnostic tool may be of clinical importance for early diagnosis of cancer.
Although the diagnosis of tongue cancer should be detected early, it is sometimes delayed25. In this case, with the progression of the cancer stage, the prognosis worsens and the scope of surgery expands, resulting in serious postoperative consequences such as dysarthria.26. Early diagnosis is difficult and from the patient’s point of view there is a lack of knowledge and awareness about tongue cancer.27. In addition, general practitioners find it difficult to diagnose cancer in local areas using endoscopic images.28. Therefore, cancer should be diagnosed by an oncologist with extensive clinical experience. In a previous study, a screening system with the involvement of experienced specialists reduced head and neck cancer and oral cancer mortality.29.
However, the number of specialists is small, and most of them work in large medical institutions, including universities, where access is low. In this study, the in-depth study model developed in the diagnosis of cancer was superior to that of the general practitioner, but lower than that of the oncologist (Figure 3). A difference between these results is possible because general practitioners have more clinical experience than cancer patients30. This suggests that AI-based diagnostic models have the potential to assist general practitioners with little clinical experience in oncology treatment in endoscopic imaging. In other studies, examples of increasing the accuracy of cancer diagnosis using AI are given31. In addition, when considering the results of the kappa coefficient, there was a good correlation between the model developed in this study and the expert on the classification of injuries (kappa value = 0.685, 95% CI 0.606–0.763) (Table 3). Therefore, like gastrointestinal endoscopy, the developed model allows general practitioners to improve the accuracy of the diagnosis of tongue cancer by combining it with oral endoscopy, which is available in primary care facilities.
Recently, several studies have reported the usefulness of medical imaging analysis based on in-depth study models. The CNN model based on ResNet-50 simultaneously studied the detection and characterization of lesions on magnetic resonance imaging (MRI).32. In addition, the model developed by CNN with VGGNet classifies malignant or malignant lesions in the mean image data.33. In this study, we re-examined an existing CNN model developed in a large general collection of natural imaging using oral endoscopic imaging (Figure 1). Six different models were used in this study: CNN, ResNet, EfficientNet, VGGNet, MobileNet, and DenseNet. Because the CNN model is the primary model for classifying images, it has been used as a basis for comparing the displays of other models.
VGGNet, ResNet, and DenseNet were models that split a giant skeleton, and when the layers deepened, each model could achieve better predictive performance. We were able to identify trends in the data and use these related models to find a suitable model. MobileNet and VGGNet have a relatively fast learning speed, match the required performance, and are used to quickly verify results by adding logic to find data properties. ResNet, DenseNet, and EfficientNet consist of deep layers; therefore, the pace of their learning is relatively slow, but their implementation is acceptable. In particular, DenseNet shows higher performance with fewer parameters than ResNet. ResNet integrates features by summarizing when passing through layers, but DenseNet is different because it integrates features instead of adding them.
Unlike previous studies that used standardized CT and MRI images, this study analyzed atypical oral images using an in-depth study algorithm above. Since tongue cancer is a rare disease, we have removed as much sound as possible from the image instead of increasing the amount of information. By minimizing data deviations, the gap between the selected population and the total population was narrowed. DenseNet169, which was evaluated as the most appropriate algorithm in this study, was also effective in the image evaluation that was conducted in the previous study. In one study on the classification of pathological images in which atypical images were used similar to this study, even with a small number of images, effective results were obtained.34. Similarly, DenseNet169 showed the best performance in the study of the AI model for classifying the quality of language images.35. Therefore, the implementation and optimization of AI algorithms is important considering the characteristics of each image data. In particular, we believe that the model obtained from this study will be useful for abnormal data with large deviations between images, including endoscopes.
Despite the latest innovative advances in in-depth study technology, a large and validated data set is one of the conditions for improving diagnostic work. Drix highlighted the problem of the “Frankenstein data set”22. The Frankenstein data set contains information that has been collected from a variety of sources and has been collected piecemeal. If the algorithm is tested with the same data used to study the model, it appears to be performed more accurately than in real data or in practical applications. Therefore, we focused on good organization and high quality. In previous research, easily accessible smartphone images and digital cameras were used; however, in this study, a set of data was constructed using oral endoscopic images constructed at clinical sites.19.20. Poor image quality can affect the analysis of image features and directly lead to incorrect diagnosis and hinder the development of the AI model. Therefore, it is difficult to classify oral endoscopic images. In particular, oral endoscopy performed during treatment has different characteristics depending on the examiner, as no guidelines have been established for the image.
This medical condition can lead to bias in the data set. To create a more stable set of tongue images, language images were collected using a single endoscopic equipment. In addition, to improve the quality of the data set, several head and neck cancer specialists from different institutions were directly involved in the data collection and review process. Data de-identification was conducted and data verification was performed more than twice. Moreover, the test was conducted by TTA, an external institution. The radiometric approach used in previous studies involves manual ROI segmentation and extraction of several text features.36. However, in this study, an in-depth learning network can be trained automatically without ROI segmentation. Therefore, there are advantages in terms of reduced training time and the cost of annotated workers. This method is designed to extract features directly from a set of data without the need for segmentation and manual processing. We performed processing to eliminate areas other than critical areas so that the model could easily identify patterns in the image data.
We processed the data set before developing the AI model. Endoscopic images were of different sizes, light conditions, and angles. In addition, due to the sound of the device itself, some pixels enter the oral endoscopic image on a regular basis. Some images also contained textual information, such as weather and written instructions (Figure 1). In addition to previous data processing steps, such as scale and exposure adjustment, we developed a new algorithm and applied it in our research. To standardize the image, we proceeded as follows. (1) We created a background image by converting the target image to a black and white image. (2) We have removed the text from the background images. (3) We blurred the background image based on external indicators using Gaussian obscurity. (4) Damages were studied in the background image. (5) We cut out the useless parts from the original image based on the lesions found in the background image (Appendix 1). Then all the images were converted to JPEG format, which requires a deep neural framework of ours. According to the algorithms, they were then resized to 224 × 224 or 300 × 300 to the required size of the input image of the models before the model training process.
The current study had several limitations. First, the developed model cannot accurately diagnose benign diseases among tongue lesions, such as leukoplakia and ulcers. In future research, we plan to develop a model that can differentiate malignant and malignant tumors by classifying them into three categories: normal, malignant, and malignant. Second, the characteristics of the oral endoscopic imaging used in this study were different from those of conventional CT and MRI images. These data have a high degree of freedom and affect the characteristics of the endoscope user with unusual and non-standard images. We use different data processing methods to compensate for these shortcomings. It will be useful to consider the use of endoscopic guidelines when collecting data for future studies. Third, the development of a cancer diagnosis model using only endoscopic images is limited. In future studies, high-quality diagnostic models are expected to be developed if images are combined with different clinical data. Fourth, several medical institutions participated in this study, which resulted in differences between institutions in terms of data volume, descriptive characteristics, and malignant and abnormal ratios (Table 1). In this study, preliminary data processing was performed to correct this. In future research, it is necessary to evenly distribute the ratio and amount of data for each participating institution. Finally, no injuries were found in this study. In future work, we plan to collect additional data on the lesions and use it to develop an AI model that identifies suspected lesions with thermal maps using Grad-CAM.
In conclusion, we confirmed a set of quality data using oral endoscopic images from several medical facilities. The in-depth study model based on the data set showed acceptable performance for use in the diagnosis of tongue cancer. Compared to human readers, it showed lower diagnostic rates and higher diagnostic rates than oncologists compared to general practitioners. Therefore, the developed algorithm can be used as an adjunct for general practitioners to improve the diagnosis and screening of cancer in a clinical setting.