ALS Progression Prediction Project

ALS (full name: Amyotrophic Lateral Sclerosis) is a fatal motor neuron disease with substantial heterogeneity in genomics as well as clinical features. ALS is usually very progressive with very short survival after onset. Here we applied machine learning approaches to predict rate of progression based on available clinical data for ~5000 patients.

Introduction

We have ~5000000 lines of data about all features including demographics, clinical trial as well as lab test results for ~5000 ALS patients provided by PRO-ACT(Pooled Resource Open-Access ALS Clinical Trial Database).
Clinically Progression Rate (PR) is very important feature for ALS patients, so our purpose here is to predict PR based on available patient features.

Data Snippet

 89	Demographics	Sex	         Male 
329	Demographics 	Sex	         Female 
329	Demographics	Race             American Indian
329	Demographics	Race             Asian
329	ALSFRS(R)	ALSFRSDelta	 189 
329	ALSFRS(R)	ALSFRSTotal	 25  
329	ALSFRS(R)	ALSFRSDelta	 212 
329	ALSFRS(R)	ALSFRSTotal	 30 
329	LaboratoryData	LaboratoryDelta  100 
329	LaboratoryData	TestName	 Sodium  
329	LaboratoryData	TestResult	 138     
329	LaboratoryData	TestUnit	 mmol/L

Analysis

I. Overview

Features include Static Feature (sex, age) and Dynamic Feature(Sodium concentration change over time).

II. Feature Engineering & Data Cleaning

First calculate our target variable: Progression Rate (Delta Health score/Delta time).
Covert time-dependent dynamic features into static: Simple linear regression, and use k,b as new static feature. In case data points are too few for fit, we also reserve Max() and Min() as new feature.
Convert character features into numeric (For example for 'Sex')
Merge multiple dataframes and drop features(columns) containing NaN in >50% of its cells. For remaining NaN we fill in with median of that column.
We eventually generated a (5372 * 134) dataframe with 5372 patients and 134 features.
Plot for gender and age distribution.
Plot for ALS progression.
Plot for association between features and progression.

III. Model Selection: Random Forest Regression

Feature correlation

Some features are highly correlated, for example, Sodium and Chloride concentration, also ALT(SGPT) and AST(SGOT),two aminotransferases enzymes. Some correlated features are interesting, such as platelets count and pulse.
RandomForestRegressor

We applied RandomForestRegressor to train data. And from this model, the top important predictive features include Onset delta, systolic blood pressure, pulse, Sodium and creatine kinase.
Prediction using Cross-Validated and test data

The correlation coefficient between predicted and real Progression Rate are 0.46 and 0.65 for cross validation data and test data, respectively.

IV. Summary and Looking Forward

Clinical data, esp. for rare disease like ALS, are extremely noisy with many missing data.
Random Forest is optimal for study non-linear features in high-dimentional data.
Open door to new predictive features like blood pressure, pulse and creatine kinase.
Could help us reduce the clinical trial number for ALS patients.

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
Data		Data
Figure		Figure
code		code
.gitignore		.gitignore
CorrelationPlot.png		CorrelationPlot.png
FeatureAssociation.png		FeatureAssociation.png
GenderAge.png		GenderAge.png
ProgressionPlot.png		ProgressionPlot.png
README.md		README.md
TestDataPre.png		TestDataPre.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ALS Progression Prediction Project

Introduction

Data Snippet

Analysis

I. Overview

II. Feature Engineering & Data Cleaning

III. Model Selection: Random Forest Regression

IV. Summary and Looking Forward

About

Uh oh!

Releases

Packages

Languages

bioinformaticsgx/ALS_Progression

Folders and files

Latest commit

History

Repository files navigation

ALS Progression Prediction Project

Introduction

Data Snippet

Analysis

I. Overview

II. Feature Engineering & Data Cleaning

III. Model Selection: Random Forest Regression

IV. Summary and Looking Forward

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages