-
Notifications
You must be signed in to change notification settings - Fork 91
Enhancements and some bug fixes after initial feedback. #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
@ksenia007 - for some reason, I can't request a review from you (looks like I can only request reviewers within the FunctionLab organization/contributors to Selene). Could you still go ahead and take a look at this document? No need to spend too long on it (I know it is very long). Thanks very much :) |
ksenia007
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I really like this. It is well written, easy to understand and concise. I tried to add some notes along the way about the parts I either found confusing or would've liked to see some more info about :)
docs/source/overview/cli.md
Outdated
| - [Matrix file sampler](#Matrix-file-sampler) | ||
|
|
||
| ## Overview | ||
| Selene's CLI accepts configuration files that are composed of 4 main (high-level) groups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will be nice to have some more info about the config files format, for example, "CLI accepts configuration files in [YAML] (link to outside(?) page about it) format".
docs/source/overview/cli.md
Outdated
| ### Expected input class and methods | ||
| There are two possible formats you can use to do this: | ||
|
|
||
| - single Python file: We expect that most people will start using Selene with model architectures in this format. In this case, you implement your architecture as a class and include 2 static methods, `criterion` and `get_optimizer` in the same file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that explanation a little confusing. Perhaps it would be nice to add a link to an example file right here? I think just linking https://github.com/FunctionLab/selene/blob/master/models/deepsea.py might be helpful
docs/source/overview/cli.md
Outdated
| - `n_validation_samples`: Default is `None`. Specify the number of validation samples in the validation set. If `None` | ||
| - and the data sampler you use is of type `selene_sdk.samplers.OnlineSampler`, we will by default retrieve 32000 validation samples. | ||
| - and you are using a `selene_sdk.samplers.MultiFileSampler`, we will use all the validation samples available in the appropriate data file. | ||
| - `n_test_samples`: Default is `None`. Specify the number of test samples in the test set. If `None` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have sampler info later, it is a bit confusing on how the test sampler will work. Maybe add a link that would bring the reader to the bottom of the page?
I guess I would add here that if you do not have a test partition, you will train on everything? (if that is true)
docs/source/overview/cli.md
Outdated
| - and you are using a `selene_sdk.samplers.MultiFileSampler`, we will use all the validation samples available in the appropriate data file. | ||
| - `n_test_samples`: Default is `None`. Specify the number of test samples in the test set. If `None` | ||
| - and the sampler you specified has no test partition, it is assumed that you will not be evaluating your trained model using the `evaluate` method available in `selene_sdk.TrainModel`. (i.e. `evaluate` should not be one of the operations in the `ops` list) | ||
| - and the sampler you use is of type `selene_sdk.samplers.OnlineSampler` (and the test partition exists), we will retrieve 640000 test samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could not find OnlineSampler on this page, only in the documentation. I think if you mention it here it might be a good idea to add some info in the sampler section too.
docs/source/overview/cli.md
Outdated
| - `n_validation_samples`: Default is `None`. Specify the number of validation samples in the validation set. If `None` | ||
| - and the data sampler you use is of type `selene_sdk.samplers.OnlineSampler`, we will by default retrieve 32000 validation samples. | ||
| - and you are using a `selene_sdk.samplers.MultiFileSampler`, we will use all the validation samples available in the appropriate data file. | ||
| - `n_test_samples`: Default is `None`. Specify the number of test samples in the test set. If `None` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, in the documentation, you have different combinations of the samplers. Is there a reason for it? http://selene.flatironinstitute.org/selene.html#trainmodel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by different combinations of samplers? (I've updated the TrainModel docs in this PR to match what I've written in cli.md, so I might have addressed this already)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In explanation to what is n_test_samples in the documentation you mention None+ IntervalsSampler, None+RandomSampler, None+MatFileSampler. Here you explain what happens with None+OnlineSampler and None+MultiFileSampler
docs/source/overview/cli.md
Outdated
| - `checkpoint_resume`: Default is `None`. If not `None`, you should pass in the path to a model weights file generated by `torch.save` (and can now be read by `torch.load`) to resume training. | ||
|
|
||
| #### Additional notes | ||
| An important thing to observe about the contents of `train_model`: the [documentation for the `TrainModel` class](http://selene.flatironinstitute.org/selene.html#trainmodel) in `selene_sdk` shows that we are missing a number of arguments needed to instantiate the class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to rephrase this. For example, 'attentive readers might have noticed that in the documentation there are more arguments that are required to instantiate the class. This is because they are assumed to be carried through/retrieved from other parts to ensure consistency...'
docs/source/overview/cli.md
Outdated
| - `logits` (log-fold change scores): The difference between `logit(alt)` and `logit(ref)` predictions. | ||
| You'll find examples of how this is specified in the [variant effect prediction](#Variant-effect-prediction) and [_in silico_ mutagenesis](#In-silico-mutagenesis) sections. | ||
|
|
||
| In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key, which we will explain generally here. The other one is dependent on which of the 3 sub-operations you use and will be explained in the appropriate subsection below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the 'explain generally here' and just proceed without noting when you will talk about them. You are talking about those in a consecutive manner, so I do not think it would cause a confusion.
| In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key, which we will explain generally here. The other one is dependent on which of the 3 sub-operations you use and will be explained in the appropriate subsection below. | |
| In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key and the other one is dependent on which of the 3 sub-operations you use - `prediction`, `variant_effect_prediction` or `in_silico_mutagenesis`. | |
| ### Analyze sequences | |
| - `shuffle`: Optional, default is `True`. Shuffle the order of the samples in the matrix before sampling from it. | ||
| - `sequence_batch_axis`: Optional, default is 0. Specify the batch axis for the sequences matrix. | ||
| - `sequence_alphabet_axis`: Optional, default is 1. Specify the alphabet axis. | ||
| - `targets_batch_axis`: Optional, default is 0. Specify the batch axis for the targets matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the very end, as a reader, I would like to see an expanded YAML file with all those parts that were just explained being used. I know you have a link in the beginning, but having it here would make it easier to comprehend as a whole
docs/source/overview/cli.md
Outdated
| ops: [train, evaluate, analyze] | ||
| ``` | ||
| The `ops` key expects one or more of `[train, evaluate, analyze]` to be specified as a list. In addition to the general & model architecture configurations described in the next 2 sections, each of these operations has an expected set of configurations: | ||
| - `train`: `train_model` (see [Train](#Train)) and `sampler` (see [Samplers used for training](#Samplers-used-for-training)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By some reason, the links don't work for Analyze or Samplers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I have gone through and tested all the links
|
This PR is ready to be merged. I will do at least 1 follow-up PR to add in links to the new CLI documentation, update some tutorials/manuscript cases after running Selene with the updated code, fix any additional issues found along the way, etc. |
This PR is in response to recent feedback we've received (and from our own continued use of Selene). It will address a number of different issues that each require relatively small changes (I will mention these issues in follow-up comments). I may also include a large documentation update with information about configuration files in this PR.