ALS (full name: Amyotrophic Lateral Sclerosis) is a fatal motor neuron disease with substantial heterogeneity in genomics as well as clinical features. ALS is usually very progressive with very short survival after onset. Here we applied machine learning approaches to predict rate of progression based on available clinical data for ~5000 patients.
We have ~5000000 lines of data about all features including demographics, clinical trial as well as lab test results for ~5000 ALS patients provided by PRO-ACT(Pooled Resource Open-Access ALS Clinical Trial Database).
Clinically Progression Rate (PR) is very important feature for ALS patients, so our purpose here is to
predict PR based on available patient features.
89 Demographics Sex Male
329 Demographics Sex Female
329 Demographics Race American Indian
329 Demographics Race Asian
329 ALSFRS(R) ALSFRSDelta 189
329 ALSFRS(R) ALSFRSTotal 25
329 ALSFRS(R) ALSFRSDelta 212
329 ALSFRS(R) ALSFRSTotal 30
329 LaboratoryData LaboratoryDelta 100
329 LaboratoryData TestName Sodium
329 LaboratoryData TestResult 138
329 LaboratoryData TestUnit mmol/L Features include Static Feature (sex, age) and Dynamic Feature(Sodium concentration change over time).
-
First calculate our target variable: Progression Rate (Delta Health score/Delta time).
-
Covert time-dependent dynamic features into static: Simple linear regression, and use k,b as new static feature. In case data points are too few for fit, we also reserve Max() and Min() as new feature.
-
Convert character features into numeric (For example for 'Sex')
-
Merge multiple dataframes and drop features(columns) containing NaN in >50% of its cells. For remaining NaN we fill in with median of that column.
-
We eventually generated a (5372 * 134) dataframe with 5372 patients and 134 features.
-
Plot for gender and age distribution.
-
Plot for ALS progression.
-
Plot for association between features and progression.
-
Feature correlation
Some features are highly correlated, for example, Sodium and Chloride concentration, also ALT(SGPT) and AST(SGOT),two aminotransferases enzymes. Some correlated features are interesting, such as platelets count and pulse.
-
RandomForestRegressor
We applied RandomForestRegressor to train data. And from this model, the top important predictive features include Onset delta, systolic blood pressure, pulse, Sodium and creatine kinase.
-
Prediction using Cross-Validated and test data
The correlation coefficient between predicted and real Progression Rate are 0.46 and 0.65 for cross validation data and test data, respectively.
- Clinical data, esp. for rare disease like ALS, are extremely noisy with many missing data.
- Random Forest is optimal for study non-linear features in high-dimentional data.
- Open door to new predictive features like blood pressure, pulse and creatine kinase.
- Could help us reduce the clinical trial number for ALS patients.






