Skip to content

Conversation

@kathyxchen
Copy link
Collaborator

This PR is in response to recent feedback we've received (and from our own continued use of Selene). It will address a number of different issues that each require relatively small changes (I will mention these issues in follow-up comments). I may also include a large documentation update with information about configuration files in this PR.

@kathyxchen
Copy link
Collaborator Author

@ksenia007 - for some reason, I can't request a review from you (looks like I can only request reviewers within the FunctionLab organization/contributors to Selene). Could you still go ahead and take a look at this document? No need to spend too long on it (I know it is very long). Thanks very much :)

@kathyxchen kathyxchen requested a review from ksenia007 December 3, 2018 19:18
Copy link

@ksenia007 ksenia007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I really like this. It is well written, easy to understand and concise. I tried to add some notes along the way about the parts I either found confusing or would've liked to see some more info about :)

- [Matrix file sampler](#Matrix-file-sampler)

## Overview
Selene's CLI accepts configuration files that are composed of 4 main (high-level) groups:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be nice to have some more info about the config files format, for example, "CLI accepts configuration files in [YAML] (link to outside(?) page about it) format".

### Expected input class and methods
There are two possible formats you can use to do this:

- single Python file: We expect that most people will start using Selene with model architectures in this format. In this case, you implement your architecture as a class and include 2 static methods, `criterion` and `get_optimizer` in the same file.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that explanation a little confusing. Perhaps it would be nice to add a link to an example file right here? I think just linking https://github.com/FunctionLab/selene/blob/master/models/deepsea.py might be helpful

- `n_validation_samples`: Default is `None`. Specify the number of validation samples in the validation set. If `None`
- and the data sampler you use is of type `selene_sdk.samplers.OnlineSampler`, we will by default retrieve 32000 validation samples.
- and you are using a `selene_sdk.samplers.MultiFileSampler`, we will use all the validation samples available in the appropriate data file.
- `n_test_samples`: Default is `None`. Specify the number of test samples in the test set. If `None`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you have sampler info later, it is a bit confusing on how the test sampler will work. Maybe add a link that would bring the reader to the bottom of the page?
I guess I would add here that if you do not have a test partition, you will train on everything? (if that is true)

- and you are using a `selene_sdk.samplers.MultiFileSampler`, we will use all the validation samples available in the appropriate data file.
- `n_test_samples`: Default is `None`. Specify the number of test samples in the test set. If `None`
- and the sampler you specified has no test partition, it is assumed that you will not be evaluating your trained model using the `evaluate` method available in `selene_sdk.TrainModel`. (i.e. `evaluate` should not be one of the operations in the `ops` list)
- and the sampler you use is of type `selene_sdk.samplers.OnlineSampler` (and the test partition exists), we will retrieve 640000 test samples.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not find OnlineSampler on this page, only in the documentation. I think if you mention it here it might be a good idea to add some info in the sampler section too.

- `n_validation_samples`: Default is `None`. Specify the number of validation samples in the validation set. If `None`
- and the data sampler you use is of type `selene_sdk.samplers.OnlineSampler`, we will by default retrieve 32000 validation samples.
- and you are using a `selene_sdk.samplers.MultiFileSampler`, we will use all the validation samples available in the appropriate data file.
- `n_test_samples`: Default is `None`. Specify the number of test samples in the test set. If `None`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in the documentation, you have different combinations of the samplers. Is there a reason for it? http://selene.flatironinstitute.org/selene.html#trainmodel

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by different combinations of samplers? (I've updated the TrainModel docs in this PR to match what I've written in cli.md, so I might have addressed this already)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In explanation to what is n_test_samples in the documentation you mention None+ IntervalsSampler, None+RandomSampler, None+MatFileSampler. Here you explain what happens with None+OnlineSampler and None+MultiFileSampler

- `checkpoint_resume`: Default is `None`. If not `None`, you should pass in the path to a model weights file generated by `torch.save` (and can now be read by `torch.load`) to resume training.

#### Additional notes
An important thing to observe about the contents of `train_model`: the [documentation for the `TrainModel` class](http://selene.flatironinstitute.org/selene.html#trainmodel) in `selene_sdk` shows that we are missing a number of arguments needed to instantiate the class.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to rephrase this. For example, 'attentive readers might have noticed that in the documentation there are more arguments that are required to instantiate the class. This is because they are assumed to be carried through/retrieved from other parts to ensure consistency...'

- `logits` (log-fold change scores): The difference between `logit(alt)` and `logit(ref)` predictions.
You'll find examples of how this is specified in the [variant effect prediction](#Variant-effect-prediction) and [_in silico_ mutagenesis](#In-silico-mutagenesis) sections.

In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key, which we will explain generally here. The other one is dependent on which of the 3 sub-operations you use and will be explained in the appropriate subsection below.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove the 'explain generally here' and just proceed without noting when you will talk about them. You are talking about those in a consecutive manner, so I do not think it would cause a confusion.

Suggested change
In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key, which we will explain generally here. The other one is dependent on which of the 3 sub-operations you use and will be explained in the appropriate subsection below.
In all `analyze`-related operations, we ask that you specify 2 configuration keys. One will always be the `analyze_sequences` key and the other one is dependent on which of the 3 sub-operations you use - `prediction`, `variant_effect_prediction` or `in_silico_mutagenesis`.
### Analyze sequences

- `shuffle`: Optional, default is `True`. Shuffle the order of the samples in the matrix before sampling from it.
- `sequence_batch_axis`: Optional, default is 0. Specify the batch axis for the sequences matrix.
- `sequence_alphabet_axis`: Optional, default is 1. Specify the alphabet axis.
- `targets_batch_axis`: Optional, default is 0. Specify the batch axis for the targets matrix.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very end, as a reader, I would like to see an expanded YAML file with all those parts that were just explained being used. I know you have a link in the beginning, but having it here would make it easier to comprehend as a whole

ops: [train, evaluate, analyze]
```
The `ops` key expects one or more of `[train, evaluate, analyze]` to be specified as a list. In addition to the general & model architecture configurations described in the next 2 sections, each of these operations has an expected set of configurations:
- `train`: `train_model` (see [Train](#Train)) and `sampler` (see [Samplers used for training](#Samplers-used-for-training))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By some reason, the links don't work for Analyze or Samplers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I have gone through and tested all the links

@kathyxchen
Copy link
Collaborator Author

kathyxchen commented Dec 7, 2018

This PR addresses #55 with cli.md

@kathyxchen kathyxchen changed the title [WIP] Enhancements and some bug fixes after initial feedback. Enhancements and some bug fixes after initial feedback. Dec 10, 2018
@kathyxchen
Copy link
Collaborator Author

kathyxchen commented Dec 10, 2018

This PR is ready to be merged. I will do at least 1 follow-up PR to add in links to the new CLI documentation, update some tutorials/manuscript cases after running Selene with the updated code, fix any additional issues found along the way, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants