Enhancements and some bug fixes after initial feedback. #57

kathyxchen · 2018-11-27T20:20:25Z

This PR is in response to recent feedback we've received (and from our own continued use of Selene). It will address a number of different issues that each require relatively small changes (I will mention these issues in follow-up comments). I may also include a large documentation update with information about configuration files in this PR.

kathyxchen · 2018-11-27T21:07:58Z

Addressed issue TrainModel class has an incorrectly named variable - prevents loading a model checkpoint. #56 here.
Addressed issue Make loading a model architecture/optimizer/loss from file more robust #54 here
Addressed issue Avoid keeping test set in memory unnecessarily #53 here and here.
Addressed issue update MatFileSampler to avoid loading entire HDF5 file into memory #52 here

kathyxchen · 2018-12-03T14:57:45Z

@ksenia007 - for some reason, I can't request a review from you (looks like I can only request reviewers within the FunctionLab organization/contributors to Selene). Could you still go ahead and take a look at this document? No need to spend too long on it (I know it is very long). Thanks very much :)

ksenia007

Overall I really like this. It is well written, easy to understand and concise. I tried to add some notes along the way about the parts I either found confusing or would've liked to see some more info about :)

ksenia007 · 2018-12-04T19:05:50Z

docs/source/overview/cli.md

+        - [Matrix file sampler](#Matrix-file-sampler)
+
+## Overview
+Selene's CLI accepts configuration files that are composed of 4 main (high-level) groups:


Will be nice to have some more info about the config files format, for example, "CLI accepts configuration files in [YAML] (link to outside(?) page about it) format".

ksenia007 · 2018-12-04T19:10:26Z

docs/source/overview/cli.md

+### Expected input class and methods
+There are two possible formats you can use to do this:
+
+- single Python file: We expect that most people will start using Selene with model architectures in this format. In this case, you implement your architecture as a class and include 2 static methods, `criterion` and `get_optimizer` in the same file. 


I found that explanation a little confusing. Perhaps it would be nice to add a link to an example file right here? I think just linking https://github.com/FunctionLab/selene/blob/master/models/deepsea.py might be helpful

ksenia007 · 2018-12-04T19:25:56Z

docs/source/overview/cli.md

+- `n_validation_samples`: Default is `None`. Specify the number of validation samples in the validation set. If `None`
+   - and the data sampler you use is of type `selene_sdk.samplers.OnlineSampler`, we will by default retrieve 32000 validation samples.
+   - and you are using a `selene_sdk.samplers.MultiFileSampler`, we will use all the validation samples available in the appropriate data file.
+- `n_test_samples`: Default is `None`. Specify the number of test samples in the test set. If `None`


Since you have sampler info later, it is a bit confusing on how the test sampler will work. Maybe add a link that would bring the reader to the bottom of the page?
I guess I would add here that if you do not have a test partition, you will train on everything? (if that is true)

ksenia007 · 2018-12-04T19:28:59Z

docs/source/overview/cli.md

+   - and you are using a `selene_sdk.samplers.MultiFileSampler`, we will use all the validation samples available in the appropriate data file.
+- `n_test_samples`: Default is `None`. Specify the number of test samples in the test set. If `None`
+    - and the sampler you specified has no test partition, it is assumed that you will not be evaluating your trained model using the `evaluate` method available in `selene_sdk.TrainModel`. (i.e. `evaluate` should not be one of the operations in the `ops` list)
+    - and the sampler you use is of type `selene_sdk.samplers.OnlineSampler` (and the test partition exists), we will retrieve 640000 test samples.


I could not find OnlineSampler on this page, only in the documentation. I think if you mention it here it might be a good idea to add some info in the sampler section too.

ksenia007 · 2018-12-04T19:32:05Z

docs/source/overview/cli.md

+- `n_validation_samples`: Default is `None`. Specify the number of validation samples in the validation set. If `None`
+   - and the data sampler you use is of type `selene_sdk.samplers.OnlineSampler`, we will by default retrieve 32000 validation samples.
+   - and you are using a `selene_sdk.samplers.MultiFileSampler`, we will use all the validation samples available in the appropriate data file.
+- `n_test_samples`: Default is `None`. Specify the number of test samples in the test set. If `None`


Also, in the documentation, you have different combinations of the samplers. Is there a reason for it? http://selene.flatironinstitute.org/selene.html#trainmodel

What do you mean by different combinations of samplers? (I've updated the TrainModel docs in this PR to match what I've written in cli.md, so I might have addressed this already)

In explanation to what is n_test_samples in the documentation you mention None+ IntervalsSampler, None+RandomSampler, None+MatFileSampler. Here you explain what happens with None+OnlineSampler and None+MultiFileSampler

ksenia007 · 2018-12-04T19:34:46Z

docs/source/overview/cli.md

+ - `checkpoint_resume`: Default is `None`. If not `None`, you should pass in the path to a model weights file generated by `torch.save` (and can now be read by `torch.load`) to resume training. 
+
+#### Additional notes
+An important thing to observe about the contents of `train_model`: the [documentation for the `TrainModel` class](http://selene.flatironinstitute.org/selene.html#trainmodel) in `selene_sdk` shows that we are missing a number of arguments needed to instantiate the class.


Might need to rephrase this. For example, 'attentive readers might have noticed that in the documentation there are more arguments that are required to instantiate the class. This is because they are assumed to be carried through/retrieved from other parts to ensure consistency...'

ksenia007 · 2018-12-04T19:39:43Z

docs/source/overview/cli.md

+- `logits` (log-fold change scores): The difference between `logit(alt)` and `logit(ref)` predictions.
+You'll find examples of how this is specified in the [variant effect prediction](#Variant-effect-prediction) and [_in silico_ mutagenesis](#In-silico-mutagenesis) sections.
+
+In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key, which we will explain generally here. The other one is dependent on which of the 3 sub-operations you use and will be explained in the appropriate subsection below. 


I would remove the 'explain generally here' and just proceed without noting when you will talk about them. You are talking about those in a consecutive manner, so I do not think it would cause a confusion.

Suggested change

In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key, which we will explain generally here. The other one is dependent on which of the 3 sub-operations you use and will be explained in the appropriate subsection below.

In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key and the other one is dependent on which of the 3 sub-operations you use - `prediction`, `variant_effect_prediction` or `in_silico_mutagenesis`.

### Analyze sequences

ksenia007 · 2018-12-04T19:43:17Z

docs/source/overview/cli.md

+- `shuffle`: Optional, default is `True`. Shuffle the order of the samples in the matrix before sampling from it.
+- `sequence_batch_axis`: Optional, default is 0. Specify the batch axis for the sequences matrix.
+- `sequence_alphabet_axis`: Optional, default is 1. Specify the alphabet axis.
+- `targets_batch_axis`: Optional, default is 0. Specify the batch axis for the targets matrix.


At the very end, as a reader, I would like to see an expanded YAML file with all those parts that were just explained being used. I know you have a link in the beginning, but having it here would make it easier to comprehend as a whole

ksenia007 · 2018-12-05T19:00:00Z

docs/source/overview/cli.md

+ops: [train, evaluate, analyze]
+```
+The `ops` key expects one or more of `[train, evaluate, analyze]` to be specified as a list. In addition to the general & model architecture configurations described in the next 2 sections, each of these operations has an expected set of configurations:
+- `train`: `train_model` (see [Train](#Train)) and `sampler` (see [Samplers used for training](#Samplers-used-for-training))


By some reason, the links don't work for Analyze or Samplers

Thanks! I have gone through and tested all the links

…entation

kathyxchen · 2018-12-07T19:47:14Z

This PR addresses #55 with cli.md

…am to control it

kathyxchen · 2018-12-10T19:38:29Z

This PR is ready to be merged. I will do at least 1 follow-up PR to add in links to the new CLI documentation, update some tutorials/manuscript cases after running Selene with the updated code, fix any additional issues found along the way, etc.

kathyxchen added 8 commits November 26, 2018 15:45

update the configuration file to take in a model arch path

d0b9d7e

documentation update

de7d2d4

update matrix file sampler to avoid loading in a large matrix file

df96b01

add stdout logger

e6613b6

Merge branch 'master' into revisions-1

c3a4baf

Merge remote-tracking branch 'upstream/master'

bb7b8be

Merge branch 'master' into revisions-1

06bbd4c

add more description to case READMEs

e5142cc

kathyxchen added 11 commits November 27, 2018 17:13

update config examples based on updated code/documentation

997a97b

add comments to config examples

83d0a29

update docs FAQ and overview pages

d8f838d

update log information for train/eval and documentation for train

2bdbe0b

update the column name for test_performance.txt

557b003

adjust wording on hyperparam opt

64b7ba2

typo fix in overview

db95cf7

update README with more description about tutorials vs case studies

58d27ad

add configuration file doc

4331978

fix typos in config examples

df7e34c

update manuscript case study configs

ce63d18

update trainmodel to save model file in 2 diff formats

ed6848f

kathyxchen requested a review from ksenia007 December 3, 2018 19:18

kathyxchen added 3 commits December 3, 2018 16:02

update READMEs with note about data processing, typo fix

3f161de

update script to create TF intervals file

b56773e

remove saving whole model to file for now

801b0b4

ksenia007 reviewed Dec 4, 2018

View reviewed changes

kathyxchen added 3 commits December 4, 2018 14:59

bug fixes to matfilesampler

a205d8d

updated config for matfilesampler based on code update

68a2abd

update how model archs are loaded from file

e3d6215

clarify analyze sequences

bd85ab2

ksenia007 reviewed Dec 5, 2018

View reviewed changes

kathyxchen added 4 commits December 6, 2018 11:12

fix links in CLI doc

c3a98fa

bugfix in matfilesampler and docs update in faq

1861f4c

update getting started YAML file

6593ee8

update the getting started tutorial with new logging output and docum…

2fbbfba

…entation

kathyxchen added 17 commits December 7, 2018 15:40

update train model with a save_new_checkpoints_after_n_steps parameter

584bf92

update get_data_and_targets order of arguments

0e1e8c4

update documentation on when the test dataset is loaded and add a par…

9a7e9b5

…am to control it

update quickstart training tutorial

08596fc

resolve trainmodel merge conflict

40c7880

update documentation with parameter

0cfd109

update mpra example with new logging outputs

5ad9da2

update getting started with some symlinks

c7d45c8

update deeperdeepsea as the actual model file vs symlink

38c5033

remove TF intervals file

11741c2

finalize getting started tutorial

e06c8bf

remove instantiate param from tutorials

f1c4e25

update configs and documentation

07bf0e5

update documentation for tutorials and CLI docs

3b3fd24

small bugfixes

710e7ed

small update to log statement

d335e66

link update in tutorials README

2a9ede7

kathyxchen changed the title ~~[WIP] Enhancements and some bug fixes after initial feedback.~~ Enhancements and some bug fixes after initial feedback. Dec 10, 2018

adjust mode and n_sample handling in online sampler

2a2fc50

kathyxchen merged commit 7b0e702 into FunctionLab:master Dec 10, 2018

kathyxchen mentioned this pull request Dec 11, 2018

Update remaining manuscript and tutorial configurations with new configuration file parsing #60

Merged

kathyxchen deleted the revisions-1 branch September 20, 2020 18:11

-In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key, which we will explain generally here. The other one is dependent on which of the 3 sub-operations you use and will be explained in the appropriate subsection below.
+In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key and the other one is dependent on which of the 3 sub-operations you use - `prediction`, `variant_effect_prediction` or `in_silico_mutagenesis`.
+### Analyze sequences

Enhancements and some bug fixes after initial feedback. #57

Enhancements and some bug fixes after initial feedback. #57

Uh oh!

Conversation

kathyxchen commented Nov 27, 2018

Uh oh!

kathyxchen commented Nov 27, 2018

Uh oh!

kathyxchen commented Dec 3, 2018

Uh oh!

ksenia007 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kathyxchen commented Dec 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kathyxchen commented Dec 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kathyxchen commented Dec 7, 2018 •

edited

Loading

kathyxchen commented Dec 10, 2018 •

edited

Loading