I built this prediction model because it connects my interests in biology and computer science.
Programming can be more than a way to solve logic problems; it can also become a scientific tool
for modeling biological patterns.
Cancer research now produces large sequencing datasets, and these datasets make it possible to
train programs that search for disease-related signals. This project explores how machine learning
can classify COAD tumor and normal samples from gene expression data.
Process
From 100% Accuracy To A Cleaner TCGA-Only Model
My first model combined TCGA data with UCSC Xena data, and the result showed 100% accuracy.
Although this looked impressive, it made the model less trustworthy because it might have learned
dataset-source differences instead of real cancer biology.
After that, I chose to use only TCGA COAD data. This made the comparison cleaner, but it also
created a new limitation: the TCGA-only normal group has only 41 samples, compared with 471 tumor
samples.
Terms Used Here
Machine learning
Training a program to find patterns from data.
Batch effect
A technical difference caused by data being produced in different labs or pipelines.
Overfitting
When a model learns the training data too closely and may not work well on new data.
Dataset
TCGA-Only Model Dataset
These numbers belong to the Model page context. They should not be mixed with the separate COAD Data
page RNA expression report.
Metrics, confusion matrices, and important features are shown together so the three machine learning
models can be compared on one page.
Perfect or near-perfect scores are not automatic proof that a model is ready for diagnosis. High scores
can happen when the test set is small, the task is too easy, or hidden data-source differences exist.
Limitations
What I Can Improve Next Time
Add more normal samples from carefully matched sources.
Test the model on an external validation dataset.
Check batch effects more carefully before trusting high scores.
Compare important genes with biological literature.
Report balanced accuracy, normal recall, and confusion matrices, not accuracy alone.