-
Notifications
You must be signed in to change notification settings - Fork 35
Compute variables in variant_stats concurrently #1116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This should be achievable with a gufunc that computes all of the variables at the same time, without requiring an argument for eager evaluation. The gufunc would return a separate array for each variable and can be applied lazily using sg.variant_stats(ds, merge=False).compute() |
On a related note, there are several variables that could potentially be combined into a single
Replacing these with a |
Ah, that sounds perfect then! I've changed the issue title to remove the I agree combining the various hom/het variables into a single multidimensional count would be good, but i guess that's a different issue. (But maybe the more efficient version gufunc could compute this value, and then things like |
I forgot that we already have we already have count_variant_genotypes. So, I don't think there's much value in duplicating that functionality here anyway. |
We should link to that function from |
Do you mean link in the docs or call it from |
I just meant that we should link to |
Just looking at the implementation, |
I think it's reasonable to want these basic QC stats to run quickly @timothymillar, as they'll be the basic building blocks for a bunch of other things. There's always going to be some extra code involved in good performance, and having the existing (independently computed) variables does make it easy to test the code. |
* Add count_variant_alleles option to calculate directly from calls * Improve performance of variant_stats using gufuncs * Raise error is variant_stats used on mixed-ploidy data * Document behavior of variant_stats with partial genotype calls
* Add count_variant_alleles option to calculate directly from calls * Improve performance of variant_stats using gufuncs * Raise error is variant_stats used on mixed-ploidy data * Document behavior of variant_stats with partial genotype calls
* Add count_variant_alleles option to calculate directly from calls * Improve performance of variant_stats using gufuncs * Raise error is variant_stats used on mixed-ploidy data * Document behavior of variant_stats with partial genotype calls
* Add count_variant_alleles option to calculate directly from calls * Improve performance of variant_stats using gufuncs * Raise error is variant_stats used on mixed-ploidy data * Document behavior of variant_stats with partial genotype calls
* Add count_variant_alleles option to calculate directly from calls * Improve performance of variant_stats using gufuncs * Raise error is variant_stats used on mixed-ploidy data * Document behavior of variant_stats with partial genotype calls
Doing downstream calculations with variables returned by
variant_stats
is quite slow (see here: sgkit-dev/sgkit-publication#34). It seems likely that this is mostly due to requiring us go through the whole variant matrix to compute it for each stat. While this is fine for a single variable, it adds up quickly if you want to look at the number of hets, homs, allele counts etc.The variables are defined as Dask arrays here using numpy operations:
https://github.com/pystatgen/sgkit/blob/cc048581be1b44be7208634107877f3512f29823/sgkit/stats/aggregation.py#L435
It would be much more efficient to compute all these values at once, in a single pass through the genotype matrix. Would it make sense to add an option to this (and I guess
sample_stat
) to do this? We could call it eithereager
orlazy
--- since things are lazy by default, I guess it makes sense to call the parametereager
to make it stand out?In terms of implementation, I guess this would need to be a numba gufunc? The code should be straightforward.
The text was updated successfully, but these errors were encountered: