Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
d0b9d7e
update the configuration file to take in a model arch path
kathyxchen Nov 26, 2018
de7d2d4
documentation update
kathyxchen Nov 26, 2018
df96b01
update matrix file sampler to avoid loading in a large matrix file
kathyxchen Nov 26, 2018
e6613b6
add stdout logger
kathyxchen Nov 26, 2018
c3a4baf
Merge branch 'master' into revisions-1
kathyxchen Nov 26, 2018
bb7b8be
Merge remote-tracking branch 'upstream/master'
kathyxchen Nov 26, 2018
06bbd4c
Merge branch 'master' into revisions-1
kathyxchen Nov 26, 2018
e5142cc
add more description to case READMEs
kathyxchen Nov 27, 2018
997a97b
update config examples based on updated code/documentation
kathyxchen Nov 27, 2018
83d0a29
add comments to config examples
kathyxchen Nov 28, 2018
d8f838d
update docs FAQ and overview pages
kathyxchen Nov 30, 2018
2bdbe0b
update log information for train/eval and documentation for train
kathyxchen Nov 30, 2018
557b003
update the column name for test_performance.txt
kathyxchen Nov 30, 2018
64b7ba2
adjust wording on hyperparam opt
kathyxchen Dec 1, 2018
db95cf7
typo fix in overview
kathyxchen Dec 1, 2018
58d27ad
update README with more description about tutorials vs case studies
kathyxchen Dec 1, 2018
4331978
add configuration file doc
kathyxchen Dec 1, 2018
df7e34c
fix typos in config examples
kathyxchen Dec 2, 2018
ce63d18
update manuscript case study configs
kathyxchen Dec 2, 2018
ed6848f
update trainmodel to save model file in 2 diff formats
kathyxchen Dec 3, 2018
3f161de
update READMEs with note about data processing, typo fix
kathyxchen Dec 3, 2018
b56773e
update script to create TF intervals file
kathyxchen Dec 4, 2018
801b0b4
remove saving whole model to file for now
kathyxchen Dec 4, 2018
a205d8d
bug fixes to matfilesampler
kathyxchen Dec 4, 2018
68a2abd
updated config for matfilesampler based on code update
kathyxchen Dec 4, 2018
e3d6215
update how model archs are loaded from file
kathyxchen Dec 4, 2018
69699e5
remove __exit__ method
kathyxchen Dec 4, 2018
45964de
add README of dependencies for data processing
kathyxchen Dec 4, 2018
450d55d
add favicon
kathyxchen Dec 4, 2018
6bd7bc6
update CLI based on PR feedback and train model documentation
kathyxchen Dec 5, 2018
d636986
clarify analyze sequences line
kathyxchen Dec 5, 2018
bd85ab2
clarify analyze sequences
kathyxchen Dec 5, 2018
c3a98fa
fix links in CLI doc
kathyxchen Dec 6, 2018
1861f4c
bugfix in matfilesampler and docs update in faq
kathyxchen Dec 6, 2018
6593ee8
update getting started YAML file
kathyxchen Dec 7, 2018
2fbbfba
update the getting started tutorial with new logging output and docum…
kathyxchen Dec 7, 2018
584bf92
update train model with a save_new_checkpoints_after_n_steps parameter
kathyxchen Dec 7, 2018
0e1e8c4
update get_data_and_targets order of arguments
kathyxchen Dec 7, 2018
9a7e9b5
update documentation on when the test dataset is loaded and add a par…
kathyxchen Dec 8, 2018
08596fc
update quickstart training tutorial
kathyxchen Dec 9, 2018
40c7880
resolve trainmodel merge conflict
kathyxchen Dec 9, 2018
0cfd109
update documentation with parameter
kathyxchen Dec 9, 2018
5ad9da2
update mpra example with new logging outputs
kathyxchen Dec 9, 2018
c7d45c8
update getting started with some symlinks
kathyxchen Dec 9, 2018
38c5033
update deeperdeepsea as the actual model file vs symlink
kathyxchen Dec 9, 2018
11741c2
remove TF intervals file
kathyxchen Dec 9, 2018
e06c8bf
finalize getting started tutorial
kathyxchen Dec 10, 2018
f1c4e25
remove instantiate param from tutorials
kathyxchen Dec 10, 2018
07bf0e5
update configs and documentation
kathyxchen Dec 10, 2018
3b3fd24
update documentation for tutorials and CLI docs
kathyxchen Dec 10, 2018
710e7ed
small bugfixes
kathyxchen Dec 10, 2018
d335e66
small update to log statement
kathyxchen Dec 10, 2018
2a9ede7
link update in tutorials README
kathyxchen Dec 10, 2018
2a2fc50
adjust mode and n_sample handling in online sampler
kathyxchen Dec 10, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 19 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,11 @@

---

You have found Selene, a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.
Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

## Installation

Selene is a Python 3+ package. We recommend using it with Python 3.6 or above.
We recommend using Selene with Python 3.6 or above.
Package installation should only take a few minutes (less than 10 minutes, typically ~2-3 minutes) with any of these methods (pip, conda, source).

### Installing selene with [Anaconda](https://www.anaconda.com/download/) (for Linux):
Expand Down Expand Up @@ -56,16 +56,28 @@ For a more detailed overview of the components in the Selene software developmen
## Documentation

The documentation for Selene is available [here](https://selene.flatironinstitute.org/).
If you are interested in running Selene through the command-line interface (CLI), [this document](https://selene.flatironinstitute.org/overview/cli.html) describes how the configuration file format (used by the CLI) works and details all the possible configuration parameters you may need to build your own configuration file.

## Examples

In general, we recommend that the manuscript case studies and the tutorials be run on a machine with a GPU. All examples take significantly longer when run on a CPU machine.
We provide 2 sets of examples: Jupyter notebook tutorials and case studies that we've described in our manuscript.
The Jupyter notebooks are more accessible in that they can be easily perused and run on a laptop.
We also take the opportunity to show how Selene can be used through the CLI (via configuration files) as well as through the API.
Finally, the notebooks are particularly useful for demonstrating various visualization components that Selene contains.
The API, along with the visualization functions, are much less emphasized in the manuscript's case studies.

In the case studies, we demonstrate more complex use cases (e.g. training on much larger datasets) that we could not present in a Jupyter notebook.
Further, we show how you can use the outputs of variant effect prediction in a subsequent statistical analysis (case 3).
These examples reflect how we most often use Selene in our own projects, whereas the Jupyter notebooks survey the many different ways and contexts in which we can use Selene.

In general, we recommend that the examples be run on a machine with a CUDA-enabled GPU. All examples take significantly longer when run on a CPU machine.
(See the following sections for time estimates.)

### Tutorials

Tutorials for Selene are available [here](https://github.com/FunctionLab/selene/tree/master/tutorials).

It is possible to run the tutorials (Jupyter notebook examples) on a standard CPU machine--you should not expect to fully finish running the training examples unless you can run them for more than 2-3 days, but they can all be run to completion on CPU in a couple of days. You can also change the training parameters (e.g. total number of steps) so that they complete in a much faster amount of time.
It is possible to run the tutorials (Jupyter notebook examples) on a standard CPU machine--you should not expect to fully finish running the training examples unless you can run them for more than 2-3 days, but they can all be run to completion on CPU in a couple of days. You can also change the training parameters (e.g. total number of steps) so that they complete in a much faster amount of time.

The non-training examples (variant effect prediction, _in silico_ mutagenesis) can be run fairly quickly (variant effect prediction might take 20-30 minutes, _in silico_ mutagenesis in 10-15 minutes).

Expand All @@ -81,7 +93,8 @@ We recommend consulting the step-by-step breakdown of each case study that we pr
The manuscript examples were only tested on GPU.
Our GPU (NVIDIA Tesla V100) time estimates:

- Case study 1 finishes in about 1 day on a GPU node.
- Case study 2 takes 6-7 days to run training (distributed the work across 4 v100s).
- Case study 1 finishes in about 1.5 days on a GPU node.
- Case study 2 takes 6-7 days to run training (distributed the work across 4 v100s) and evaluation.
- Case study 3 (variant effect prediction) takes about 1 day to run.

The case studies in the manuscript focus on developing deep learning models for classification tasks. Selene does support training and evaluating sequence-based regression models, and we have provided a [tutorial to demonstrate this](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_mpra_example.ipynb).
19 changes: 9 additions & 10 deletions config_examples/evaluate_test_bed.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
---
ops: [evaluate]
model: {
file: /absolute/path/to/model/architecture.py,
path: /absolute/path/to/model/architecture.py,
class: ModelArchitectureClassName,
sequence_length: 1000,
n_classes_to_predict: 2,
non_strand_specific: {
use_module: True,
mode: mean
}
class_args: {
arg1: val1,
arg2: val2
},
non_strand_specific: mean
}
sampler: !obj:selene_sdk.samplers.file_samplers.BedFileSampler {
filepath: /path/to/file.bed, # generated from selene_sdk training (`test_data.bed`)
Expand All @@ -18,16 +17,16 @@ sampler: !obj:selene_sdk.samplers.file_samplers.BedFileSampler {
n_samples: n_samples_in_file, # wc -l file.bed
targets_avail: True,
sequence_length: 1000,
n_features: 2 # should match `n_classes_to_predict` in `model`
n_features: 2
}
evaluate_model: !obj:selene_sdk.EvaluateModel {
batch_size: 64,
features: !obj:selene_sdk.utils.load_features_list {
input_path: /path/to/distinct_features.txt
},
trained_model_path: /path/to/trained_model.pth.tar,
use_cuda: True,
batch_size: 64,
report_gt_feature_n_positives: 50,
use_cuda: True,
output_dir: /path/to/output_dir
}
random_seed: 1337
Expand Down
27 changes: 15 additions & 12 deletions config_examples/evaluate_test_mat.yml
Original file line number Diff line number Diff line change
@@ -1,29 +1,32 @@
---
ops: [evaluate]
model: {
file: /absolute/path/to/model/architecture.py,
path: /absolute/path/to/model/architecture.py,
class: ModelArchitectureClassName,
sequence_length: 1000,
n_classes_to_predict: 2,
non_strand_specific: {
use_module: True,
mode: mean
}
class_args: {
arg1: val1,
arg2: val2
},
non_strand_specific: mean
}
sampler: !obj:selene_sdk.samplers.file_samplers.MatFileSampler {
filepath: /path/to/test.mat,
sequence_key: testxdata,
targets_key: testdata,
shuffle_file: False
random_seed: 456,
shuffle: False,
sequence_batch_axis: 0,
sequence_alphabet_axis: 1,
targets_batch_axis: 0
}
evaluate_model: !obj:selene_sdk.EvaluateModel {
batch_size: 64,
features: !obj:selene_sdk.utils.load_features_list {
input_path: /path/to/distinct_features.txt
input_path: /path/to/features_list.txt
},
use_cuda: True,
trained_model_path: /path/to/trained/model.pth.tar,
batch_size: 64,
report_gt_feature_n_positives: 50,
trained_model_path: /path/to/trained_model.pth.tar
use_cuda: True
}
random_seed: 123
output_dir: /path/to/output_dir
Expand Down
16 changes: 8 additions & 8 deletions config_examples/get_predictions.yml
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
---
ops: [analyze]
model: {
file: /absolute/path/to/model/architecture.py,
path: /absolute/path/to/model/architecture.py,
class: ModelArchitectureClassName,
sequence_length: 1000,
n_classes_to_predict: 2,
non_strand_specific: {
use_module: True,
mode: mean
}
class_args: {
arg1: val1,
arg2: val2
},
non_strand_specific: mean
}
analyze_sequences: !obj:selene_sdk.predict.AnalyzeSequences {
trained_model_path: /path/to/trained/model.pth.tar,
sequence_length: 1000,
features: !obj:selene_sdk.utils.load_features_list {
input_path: /path/to/distinct_features.txt
},
trained_model_file: /path/to/trained_model.pth.tar,
batch_size: 64,
use_cuda: True
}
prediction: {
Expand Down
20 changes: 10 additions & 10 deletions config_examples/in_silico_mutagenesis.yml
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
---
ops: [analyze]
model: {
file: /absolute/path/to/model/architecture.py,
path: /absolute/path/to/model/architecture.py,
class: ModelArchitectureClassName,
sequence_length: 1000,
n_classes_to_predict: 2,
non_strand_specific: {
use_module: True,
mode: mean
}
class_args: {
arg1: val1,
arg2: val2
},
non_strand_specific: mean
}
analyze_sequences: !obj:selene_sdk.predict.AnalyzeSequences {
trained_model_path: /path/to/trained/model.pth.tar,
sequence_length: 1000,
features: !obj:selene_sdk.utils.load_features_list {
input_path: /path/to/distinct_features.txt
input_path: /path/to/distinct_features.txt
},
trained_model_path: /path/to/trained/model.pth.tar,
batch_size: 64,
use_cuda: True
}
in_silico_mutagenesis: {
Expand All @@ -26,5 +26,5 @@ in_silico_mutagenesis: {
output_dir: /path/to/output_dir,
use_sequence_name: False
}
random_seed: 1000
random_seed: 123
...
30 changes: 17 additions & 13 deletions config_examples/train.yml
Original file line number Diff line number Diff line change
@@ -1,32 +1,32 @@
---
ops: [train, evaluate]
model: {
file: /absolute/path/to/model/architecture.py,
path: /absolute/path/to/model/architecture.py,
class: ModelArchitectureClassName,
sequence_length: 1000,
n_classes_to_predict: 2,
non_strand_specific: {
use_module: True,
mode: mean
}
class_args: {
arg1: val1,
arg2: val2
},
non_strand_specific: mean
}
sampler: !obj:selene_sdk.samplers.IntervalsSampler {
reference_sequence: !obj:selene_sdk.sequences.Genome {
input_path: /path/to/genome/hg.fa
input_path: /path/to/reference_sequence.fa,
blacklist_regions: hg19 # only hg19 and hg38, remove if not applicable
},
target_path: /path/to/tabix/indexed/targets.bed.gz,
features: !obj:selene_sdk.utils.load_features_list {
input_path: /path/to/distinct_features.txt
},
target_path: /path/to/tabix/indexed/targets.bed.gz,
intervals_path: /path/to/intervals.txt,
test_holdout: [chr8, chr9],
validation_holdout: [chr6, chr7],
intervals_path: /path/to/intervals.bed,
sample_negative: True, # train on samples with no targets present
seed: 127,
test_holdout: [chr8, chr9], # can also be proportional, e.g. 0.10
validation_holdout: [chr6, chr7], # can also be proportional, e.g. 0.10
sequence_length: 1000,
center_bin_to_predict: 200,
feature_thresholds: 0.5,
mode: train,
sample_negative: True, # train on samples with no targets present
save_datasets: [test]
}
train_model: !obj:selene_sdk.TrainModel {
Expand All @@ -38,6 +38,10 @@ train_model: !obj:selene_sdk.TrainModel {
cpu_n_threads: 32,
use_cuda: True,
data_parallel: True, # multiple GPUs
logging_verbosity: 2,
# if resuming training, replace `False` below with the path to the trained
# model weights file created in a previous training run with Selene
checkpoint_resume: False
}
random_seed: 133
output_dir: /path/to/output_dir
Expand Down
25 changes: 13 additions & 12 deletions config_examples/variant_effect_prediction.yml
Original file line number Diff line number Diff line change
@@ -1,32 +1,33 @@
---
ops: [analyze]
model: {
file: /absolute/path/to/model/architecture.py,
path: /absolute/path/to/model/architecture.py,
class: ModelArchitectureClassName,
sequence_length: 1000,
n_classes_to_predict: 2,
non_strand_specific: {
use_module: True,
mode: mean
}
class_args: {
arg1: val1,
arg2: val2
},
non_strand_specific: mean
}
analyze_sequences: !obj:selene_sdk.predict.AnalyzeSequences {
trained_model_path: /path/to/trained/model.pth.tar,
sequence_length: 1000,
features: !obj:selene_sdk.utils.load_features_list {
input_path: /path/to/distinct_features.txt
input_path: /path/to/distinct_features.txt
},
trained_model_path: /path/to/trained/model.pth.tar,
batch_size: 64,
use_cuda: True,
reference_sequence: !obj:selene_sdk.sequences.Genome {
input_path: /path/to/genome/hg.fa
input_path: /path/to/reference_sequence.fa
}
}
variant_effect_prediction: {
vcf_files: [
/path/to/file1.vcf,
/path/to/file2.vcf
],
save_data: [predictions, diffs],
output_dir: /path/to/output_dir
save_data: [predictions, abs_diffs],
output_dir: /path/to/output/predicts/dir
}
random_seed: 123
...
Binary file added docs/source/_static/img/favicon.ico
Binary file not shown.
2 changes: 2 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,8 @@
],
}

html_favicon = "_static/img/favicon.ico"

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
Expand Down
3 changes: 2 additions & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,8 @@ The Github repository is located `here <https://github.com/FunctionLab/selene>`_
overview/overview
overview/installation
overview/tutorials
overview/extending
overview/cli
overview/faq

.. toctree::
:maxdepth: 1
Expand Down
Loading