FunctionLab · kathyxchen · Dec 10, 2018 · Nov 26, 2018 · Nov 26, 2018 · Nov 26, 2018
diff --git a/README.md b/README.md
@@ -2,11 +2,11 @@
 
 ---
 
-You have found Selene, a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.
+Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.
 
 ## Installation
 
-Selene is a Python 3+ package. We recommend using it with Python 3.6 or above. 
+We recommend using Selene with Python 3.6 or above. 
 Package installation should only take a few minutes (less than 10 minutes, typically ~2-3 minutes) with any of these methods (pip, conda, source).  
 
 ### Installing selene with [Anaconda](https://www.anaconda.com/download/) (for Linux):
@@ -56,16 +56,28 @@ For a more detailed overview of the components in the Selene software developmen
 ## Documentation
 
 The documentation for Selene is available [here](https://selene.flatironinstitute.org/).
+If you are interested in running Selene through the command-line interface (CLI), [this document](https://selene.flatironinstitute.org/overview/cli.html) describes how the configuration file format (used by the CLI) works and details all the possible configuration parameters you may need to build your own configuration file. 
 
 ## Examples
 
-In general, we recommend that the manuscript case studies and the tutorials be run on a machine with a GPU. All examples take significantly longer when run on a CPU machine. 
+We provide 2 sets of examples: Jupyter notebook tutorials and case studies that we've described in our manuscript. 
+The Jupyter notebooks are more accessible in that they can be easily perused and run on a laptop. 
+We also take the opportunity to show how Selene can be used through the CLI (via configuration files) as well as through the API. 
+Finally, the notebooks are particularly useful for demonstrating various visualization components that Selene contains. 
+The API, along with the visualization functions, are much less emphasized in the manuscript's case studies.
+
+In the case studies, we demonstrate more complex use cases (e.g. training on much larger datasets) that we could not present in a Jupyter notebook.
+Further, we show how you can use the outputs of variant effect prediction in a subsequent statistical analysis (case 3).
+These examples reflect how we most often use Selene in our own projects, whereas the Jupyter notebooks survey the many different ways and contexts in which we can use Selene.
+
+In general, we recommend that the examples be run on a machine with a CUDA-enabled GPU. All examples take significantly longer when run on a CPU machine.
+(See the following sections for time estimates.)
 
 ### Tutorials
 
 Tutorials for Selene are available [here](https://github.com/FunctionLab/selene/tree/master/tutorials).
 
-It is possible to run the tutorials (Jupyter notebook examples) on a standard CPU machine--you should not expect to fully finish running the training examples unless you can run them for more than 2-3 days, but they can all be run to completion on CPU in a couple of days. You can also change the training parameters (e.g. total number of steps) so that they complete in a much faster amount of time.
+It is possible to run the tutorials (Jupyter notebook examples) on a standard CPU machine--you should not expect to fully finish running the training examples unless you can run them for more than 2-3 days, but they can all be run to completion on CPU in a couple of days. You can also change the training parameters (e.g. total number of steps) so that they complete in a much faster amount of time. 
 
 The non-training examples (variant effect prediction, _in silico_ mutagenesis) can be run fairly quickly (variant effect prediction might take 20-30 minutes, _in silico_ mutagenesis in 10-15 minutes). 
 
@@ -81,7 +93,8 @@ We recommend consulting the step-by-step breakdown of each case study that we pr
 The manuscript examples were only tested on GPU.
 Our GPU (NVIDIA Tesla V100) time estimates:
 
-- Case study 1 finishes in about 1 day on a GPU node.
-- Case study 2 takes 6-7 days to run training (distributed the work across 4 v100s).
+- Case study 1 finishes in about 1.5 days on a GPU node.
+- Case study 2 takes 6-7 days to run training (distributed the work across 4 v100s) and evaluation.
 - Case study 3 (variant effect prediction) takes about 1 day to run. 
 
+The case studies in the manuscript focus on developing deep learning models for classification tasks. Selene does support training and evaluating sequence-based regression models, and we have provided a [tutorial to demonstrate this](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_mpra_example.ipynb).  
diff --git a/config_examples/evaluate_test_bed.yml b/config_examples/evaluate_test_bed.yml
@@ -1,14 +1,13 @@
 ---
 ops: [evaluate]
 model: {
-    file: /absolute/path/to/model/architecture.py, 
+    path: /absolute/path/to/model/architecture.py, 
     class: ModelArchitectureClassName,
-    sequence_length: 1000,
-    n_classes_to_predict: 2,
-    non_strand_specific: {
-        use_module: True,
-        mode: mean
-    }
+    class_args: {
+        arg1: val1,
+        arg2: val2
+    },
+    non_strand_specific: mean
 }
 sampler: !obj:selene_sdk.samplers.file_samplers.BedFileSampler {
     filepath: /path/to/file.bed,  # generated from selene_sdk training (`test_data.bed`) 
@@ -18,16 +17,16 @@ sampler: !obj:selene_sdk.samplers.file_samplers.BedFileSampler {
     n_samples: n_samples_in_file,  # wc -l file.bed
     targets_avail: True,
     sequence_length: 1000,
-    n_features: 2  # should match `n_classes_to_predict` in `model`
+    n_features: 2
 }
 evaluate_model: !obj:selene_sdk.EvaluateModel {
-    batch_size: 64,
     features:  !obj:selene_sdk.utils.load_features_list {
         input_path: /path/to/distinct_features.txt
     },
     trained_model_path: /path/to/trained_model.pth.tar,
-    use_cuda: True, 
+    batch_size: 64,
     report_gt_feature_n_positives: 50,
+    use_cuda: True, 
     output_dir: /path/to/output_dir
 }
 random_seed: 1337

diff --git a/config_examples/evaluate_test_mat.yml b/config_examples/evaluate_test_mat.yml
@@ -1,29 +1,32 @@
 ---
 ops: [evaluate]
 model: {
-    file: /absolute/path/to/model/architecture.py, 
+    path: /absolute/path/to/model/architecture.py, 
     class: ModelArchitectureClassName,
-    sequence_length: 1000,
-    n_classes_to_predict: 2,
-    non_strand_specific: {
-        use_module: True,
-        mode: mean
-    }
+    class_args: {
+        arg1: val1,
+        arg2: val2
+    },
+    non_strand_specific: mean
 }
 sampler: !obj:selene_sdk.samplers.file_samplers.MatFileSampler {
     filepath: /path/to/test.mat, 
     sequence_key: testxdata,
     targets_key: testdata,
-    shuffle_file: False
+    random_seed: 456,
+    shuffle: False,
+    sequence_batch_axis: 0,
+    sequence_alphabet_axis: 1,
+    targets_batch_axis: 0
 }
 evaluate_model: !obj:selene_sdk.EvaluateModel {
-    batch_size: 64,
     features:  !obj:selene_sdk.utils.load_features_list {
-        input_path: /path/to/distinct_features.txt 
+        input_path: /path/to/features_list.txt 
     },
-    use_cuda: True, 
+    trained_model_path: /path/to/trained/model.pth.tar,
+    batch_size: 64,
     report_gt_feature_n_positives: 50,
-    trained_model_path: /path/to/trained_model.pth.tar
+    use_cuda: True 
 }
 random_seed: 123
 output_dir: /path/to/output_dir

diff --git a/config_examples/get_predictions.yml b/config_examples/get_predictions.yml
@@ -1,21 +1,21 @@
 ---
 ops: [analyze]
 model: {
-    file: /absolute/path/to/model/architecture.py, 
+    path: /absolute/path/to/model/architecture.py, 
     class: ModelArchitectureClassName,
-    sequence_length: 1000,
-    n_classes_to_predict: 2,
-    non_strand_specific: {
-        use_module: True,
-        mode: mean
-    }
+    class_args: {
+        arg1: val1,
+        arg2: val2
+    },
+    non_strand_specific: mean
 }
 analyze_sequences: !obj:selene_sdk.predict.AnalyzeSequences {
+    trained_model_path: /path/to/trained/model.pth.tar,
     sequence_length: 1000,
     features: !obj:selene_sdk.utils.load_features_list {
         input_path: /path/to/distinct_features.txt 
     },
-    trained_model_file: /path/to/trained_model.pth.tar, 
+    batch_size: 64,
     use_cuda: True
 }
 prediction: {

diff --git a/config_examples/in_silico_mutagenesis.yml b/config_examples/in_silico_mutagenesis.yml
@@ -1,21 +1,21 @@
 ---
 ops: [analyze]
 model: {
-    file: /absolute/path/to/model/architecture.py, 
+    path: /absolute/path/to/model/architecture.py, 
     class: ModelArchitectureClassName,
-    sequence_length: 1000,
-    n_classes_to_predict: 2,
-    non_strand_specific: {
-        use_module: True,
-        mode: mean
-    }
+    class_args: {
+        arg1: val1,
+        arg2: val2
+    },
+    non_strand_specific: mean
 }
 analyze_sequences: !obj:selene_sdk.predict.AnalyzeSequences {
+    trained_model_path: /path/to/trained/model.pth.tar,
     sequence_length: 1000,
     features: !obj:selene_sdk.utils.load_features_list {
-        input_path: /path/to/distinct_features.txt
+        input_path: /path/to/distinct_features.txt 
     },
-    trained_model_path: /path/to/trained/model.pth.tar,
+    batch_size: 64,
     use_cuda: True
 }
 in_silico_mutagenesis: {
@@ -26,5 +26,5 @@ in_silico_mutagenesis: {
     output_dir: /path/to/output_dir,
     use_sequence_name: False
 }
-random_seed: 1000
+random_seed: 123
 ...
diff --git a/config_examples/train.yml b/config_examples/train.yml
@@ -1,32 +1,32 @@
 ---
 ops: [train, evaluate]
 model: {
-    file: /absolute/path/to/model/architecture.py, 
+    path: /absolute/path/to/model/architecture.py, 
     class: ModelArchitectureClassName,
-    sequence_length: 1000,
-    n_classes_to_predict: 2,
-    non_strand_specific: {
-        use_module: True,
-        mode: mean
-    }
+    class_args: {
+        arg1: val1,
+        arg2: val2
+    },
+    non_strand_specific: mean
 }
 sampler: !obj:selene_sdk.samplers.IntervalsSampler {
     reference_sequence: !obj:selene_sdk.sequences.Genome {
-        input_path: /path/to/genome/hg.fa
+        input_path: /path/to/reference_sequence.fa,
+        blacklist_regions: hg19  # only hg19 and hg38, remove if not applicable
     },
+    target_path: /path/to/tabix/indexed/targets.bed.gz,
     features: !obj:selene_sdk.utils.load_features_list {
         input_path: /path/to/distinct_features.txt
     },
-    target_path: /path/to/tabix/indexed/targets.bed.gz, 
-    intervals_path: /path/to/intervals.txt,
-    test_holdout: [chr8, chr9],
-    validation_holdout: [chr6, chr7],
+    intervals_path: /path/to/intervals.bed,
+    sample_negative: True,  # train on samples with no targets present
     seed: 127,
+    test_holdout: [chr8, chr9],  # can also be proportional, e.g. 0.10
+    validation_holdout: [chr6, chr7],  # can also be proportional, e.g. 0.10
     sequence_length: 1000,
     center_bin_to_predict: 200,
     feature_thresholds: 0.5,
     mode: train,
-    sample_negative: True,  # train on samples with no targets present
     save_datasets: [test]
 }
 train_model: !obj:selene_sdk.TrainModel {
@@ -38,6 +38,10 @@ train_model: !obj:selene_sdk.TrainModel {
     cpu_n_threads: 32,
     use_cuda: True,
     data_parallel: True,  # multiple GPUs
+    logging_verbosity: 2,
+    # if resuming training, replace `False` below with the path to the trained
+    # model weights file created in a previous training run with Selene
+    checkpoint_resume: False
 }
 random_seed: 133
 output_dir: /path/to/output_dir

diff --git a/config_examples/variant_effect_prediction.yml b/config_examples/variant_effect_prediction.yml
@@ -1,32 +1,33 @@
 ---
 ops: [analyze]
 model: {
-    file: /absolute/path/to/model/architecture.py, 
+    path: /absolute/path/to/model/architecture.py, 
     class: ModelArchitectureClassName,
-    sequence_length: 1000,
-    n_classes_to_predict: 2,
-    non_strand_specific: {
-        use_module: True,
-        mode: mean
-    }
+    class_args: {
+        arg1: val1,
+        arg2: val2
+    },
+    non_strand_specific: mean
 }
 analyze_sequences: !obj:selene_sdk.predict.AnalyzeSequences {
+    trained_model_path: /path/to/trained/model.pth.tar,
     sequence_length: 1000,
     features: !obj:selene_sdk.utils.load_features_list {
-        input_path: /path/to/distinct_features.txt
+        input_path: /path/to/distinct_features.txt 
     },
-    trained_model_path: /path/to/trained/model.pth.tar,
+    batch_size: 64,
+    use_cuda: True,
     reference_sequence: !obj:selene_sdk.sequences.Genome {
-        input_path: /path/to/genome/hg.fa
+        input_path: /path/to/reference_sequence.fa
     }
 }
 variant_effect_prediction: {
     vcf_files: [
        /path/to/file1.vcf,
        /path/to/file2.vcf
     ], 
-    save_data: [predictions, diffs],
-    output_dir: /path/to/output_dir 
+    save_data: [predictions, abs_diffs],
+    output_dir: /path/to/output/predicts/dir 
 }
 random_seed: 123
 ...
diff --git a/docs/source/_static/img/favicon.ico b/docs/source/_static/img/favicon.ico
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -127,6 +127,8 @@
     ],
 }
 
+html_favicon = "_static/img/favicon.ico"
+
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -17,7 +17,8 @@ The Github repository is located `here <https://github.com/FunctionLab/selene>`_
    overview/overview
    overview/installation
    overview/tutorials
-   overview/extending
+   overview/cli
+   overview/faq
 
 .. toctree::
    :maxdepth: 1