-
Notifications
You must be signed in to change notification settings - Fork 29
Implement sample variant annotation dataflow pipeline #37
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deflaux let me know if you'd prefer I fork into my own options at this point. Not sure how much we want to jam into this object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine place for it for now.
4dded6d to
6798954
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make sure I'm understanding this correctly, are these assumptions true?
- By default, this pipeline will yield an output record for every alternate allele in 1,000 Genomes within BRCA1 that is a SNP and has an effect other than synonymous.
- For 1,000 genomes, restricting to sample HG00261 has no bearing on the output of this pipeline since all samples have calls for all variants (and we are also not retrieving/looking at the genotype within the call).
- If we change the job parameters to run on Platinum Genomes and a callSetId within it, we will only annotate the variants that the specified callSetId has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. People typical run a variant annotation program on a single VCF, so I think the behavior is reasonably well aligned with a user's expectations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CH, is right but you still want to keep track of what you're annotating since metadata is still important if you combine datasets or are comparing them. If you can cache them that will save you time later on.
|
This looks good to me - merge it at your convenience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? Why not convert the list of Contigs to a PCollection directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
64aa02f to
6b46e28
Compare
|
Rebased, made some performance changes, and added some timing information. The end result is that it will currently work well on small regions, but performs quite poorly on whole variant-sets, on account of SearchVariants throughput. This should improve over time. |
Implement sample variant annotation dataflow pipeline
|
Nice sample CH! |
…tation Implement sample variant annotation dataflow pipeline
See various caveats and disclaimers in comments: this is a limited sample application.
One thing which may need revision is the output; right now it's really only human readable (at best). Open to suggestions on a better output format.