Prediction Model

I built this prediction model because it connects my interests in biology and computer science. Programming can be more than a way to solve logic problems; it can also become a scientific tool for modeling biological patterns.

Cancer research now produces large sequencing datasets, and these datasets make it possible to train programs that search for disease-related signals. This project explores how machine learning can classify COAD tumor and normal samples from gene expression data.

Process

From 100% Accuracy To A Cleaner TCGA-Only Model

My first model combined TCGA data with UCSC Xena data, and the result showed 100% accuracy. Although this looked impressive, it made the model less trustworthy because it might have learned dataset-source differences instead of real cancer biology.

After that, I chose to use only TCGA COAD data. This made the comparison cleaner, but it also created a new limitation: the TCGA-only normal group has only 41 samples, compared with 471 tumor samples.

Terms Used Here

Machine learning: Training a program to find patterns from data.
Batch effect: A technical difference caused by data being produced in different labs or pipelines.
Overfitting: When a model learns the training data too closely and may not work well on new data.

Dataset

TCGA-Only Model Dataset

These numbers belong to the Model page context. They should not be mixed with the separate COAD Data page RNA expression report.

Results

Three Baseline Models Compared Together

GitHub: train.py

Metrics, confusion matrices, and important features are shown together so the three machine learning models can be compared on one page.

Perfect or near-perfect scores are not automatic proof that a model is ready for diagnosis. High scores can happen when the test set is small, the task is too easy, or hidden data-source differences exist.

Limitations

What I Can Improve Next Time

Add more normal samples from carefully matched sources.
Test the model on an external validation dataset.
Check batch effects more carefully before trusting high scores.
Compare important genes with biological literature.
Report balanced accuracy, normal recall, and confusion matrices, not accuracy alone.