Lung Cancer Drug Discovery
Lung cancer remains a leading cause of cancer-related deaths worldwide, with small cell lung cancer (SCLC) being the most prevalent type. Despite advances in the treatment of lung cancer, high mortality rates persist due to the limited efficacy and frequent negative side effects associated with current medications. Therefore, discovering novel therapeutics that are more effective, have fewer adverse effects, and can overcome drug resistance is crucial for improving treatment options for lung cancer patients.
The abnormal activation of the Epidermal Growth Factor Receptor (EGFR) protein is a significant factor in the development and progression of lung cancer. The Epidermal Growth Factor Receptor (EGFR) is a protein responsible for facilitating cell growth and division. In cases of EGFR-positive lung cancer, a mutation or defect in the gene leads to continuous EGFR growth, causing uncontrolled cellular proliferation and ultimately, cancer. While chemotherapy remains one of the most effective solutions to treating cancer, its associated side effects such as fatigue, hair loss, and appetite changes have led to the exploration of alternative therapies. Specifically, various drugs are being tested for their efficacy in inhibiting EGFR protein multiplication. To measure a drug’s efficiency in inhibiting EGFR protein growth, we utilized the Inhibitory Concentration (IC 50) value, a quantitative measure of the amount of drug required to inhibit a given biological process by 50 percent.
Here, my team utilized the ChEMBL dataset, a chemical database of bioactive molecules, to obtain the canonical notation of molecular formulas for each drug. We then utilized the PubChem database to convert the canonical notation to molecular fingerprint, a unique binary representation of each molecule. We implemented machine learning models, namely SVM Regressor, KNN Regressor, Random Forest Regressor, XGboost Regressor, and PCA Regressor, to predict the IC50 value for a given chemical. Our study aimed to compare the performance of these three models in predicting drug efficacy. Thus, through this project we focus on solving the urgent need to enhance patient outcomes and lessen the burden of this illness serves as the driving force for a medication development endeavor for lung cancer. It provides the opportunity for substantial advancements in our understanding of lung cancer biology, improved treatment choices, higher survival rates, and better patient quality of life.
Dataset
The ChEMBL dataset and Pubchem dataset has been utilized to obtain the necessary data for this study. It contains over 2 million curated bioactivity data entries, sourced from more than 76,000 documents and 1.2 million assays. The data covers 13,000 targets, 1,800 cells, and 33,000 indications. The version used for this study was ChEMBL version 26, as of March 25, 2020. The ChEMBL dataset is a curated chemical database that contains bioactive molecules with drug-like properties. It provides canonical notations of molecular formulas for each drug and integrates chemical, bioactivity, and genetic data to support the development of novel pharmaceuticals. In addition, the study utilizes the PubChem database, which includes data on small molecules as well as larger molecules such as nucleotides, carbohydrates, lipids, peptides, and chemically modified macromolecules.
Result and analysis
Random Forest model has the closest regressor line to the origin, indicating the best performance. The SVM model is ranked second in terms of performance, whereas the Principal Component Regressor has the poorest performance among the five models. Considering RMSE Values and MAE values, Random Forest is the best model.
End Note:
For detailed analysis and please visit following sites.