Full text loading...
Hepatocellular Carcinoma (HCC) is a major disease that seriously threatens human health. Early screening can significantly improve the five-year survival rate of HCC patients. Cell-free DNA (cfDNA), as a potential carrier of cancer signals in body fluids, can be used for early cancer detection. However, current early HCC detection methods based on cfDNA sequencing require deep sequencing data, limiting their application and usage in routine disease screening. We proposed a foundational DNA language model, called CLHCC, for analyzing DNA sequences and methylation patterns to detect HCC at low sequencing depths.
CLHCC randomly selected 1500 DNA fragments from HCC-specific differentially methylated regions identified by cd-score. The model then performed a one-hot encoding strategy on these DNA fragments and input the data into a CNN combined with an LSTM neural network for classification.
We tested CLHCC on 2139 target-BS data samples, achieving an accuracy of 84.59% (precision: 83.44%, recall: 81%) under 10-fold cross-validations. This performance is better than DNA language models built using CNN or LSTM alone.
Our study applies deep learning to analyze DNA sequences in specific methylation regions without the need for complex alignment processes. This provides new theoretical and practical guidance for clinical applications and holds promise for non-invasive early HCC screening via cfDNA.