Compute variables in variant_stats concurrently #1116

jeromekelleher · 2023-08-20T08:16:59Z

Doing downstream calculations with variables returned by variant_stats is quite slow (see here: sgkit-dev/sgkit-publication#34). It seems likely that this is mostly due to requiring us go through the whole variant matrix to compute it for each stat. While this is fine for a single variable, it adds up quickly if you want to look at the number of hets, homs, allele counts etc.

The variables are defined as Dask arrays here using numpy operations:

https://github.com/pystatgen/sgkit/blob/cc048581be1b44be7208634107877f3512f29823/sgkit/stats/aggregation.py#L435

It would be much more efficient to compute all these values at once, in a single pass through the genotype matrix. Would it make sense to add an option to this (and I guess sample_stat) to do this? We could call it either eager or lazy --- since things are lazy by default, I guess it makes sense to call the parameter eager to make it stand out?

In terms of implementation, I guess this would need to be a numba gufunc? The code should be straightforward.

The text was updated successfully, but these errors were encountered:

timothymillar · 2023-08-20T23:06:14Z

This should be achievable with a gufunc that computes all of the variables at the same time, without requiring an argument for eager evaluation. The gufunc would return a separate array for each variable and can be applied lazily using da.apply_gufunc which allows multiple return values. The resulting dask arrays need to be evaluated simultaneously to avoid duplicated compute, e.g.:

sg.variant_stats(ds, merge=False).compute()

timothymillar · 2023-08-20T23:49:17Z

On a related note, there are several variables that could potentially be combined into a single variant_genotype_count array (variants * genotypes)

sgkit.variables.variant_n_het_spec: Second genotype for bi-allelic diploid case (multiple columns in general case)
sgkit.variables.variant_n_hom_ref_spec: First genotype
sgkit.variables.variant_n_hom_alt_spec: Third genotype for bi-allelic diploid case (one genotype per alt allele in general case)
sgkit.variables.variant_n_non_ref_spec: Sum of all genotypes except the first

Replacing these with a variant_genotype_count array would make it easy to generalize to the multi-allelic and polyploid cases (not mixed-ploidy). But it would be a significant breaking change.

jeromekelleher · 2023-08-21T08:32:40Z

This should be achievable with a gufunc that computes all of the variables at the same time, without requiring an argument for eager evaluation.

Ah, that sounds perfect then! I've changed the issue title to remove the eager bit.

I agree combining the various hom/het variables into a single multidimensional count would be good, but i guess that's a different issue. (But maybe the more efficient version gufunc could compute this value, and then things like n_het defined in terms of that?)

timothymillar · 2023-08-21T21:07:43Z

I agree combining the various hom/het variables into a single multidimensional count would be good, but i guess that's a different issue

I forgot that we already have we already have count_variant_genotypes. So, I don't think there's much value in duplicating that functionality here anyway.

jeromekelleher · 2023-08-22T08:13:38Z

We should link to that function from variant_stats too I guess.

timothymillar · 2023-08-22T08:44:24Z

Do you mean link in the docs or call it from variant_stats? The variant_genotype_count array only contains some of the data reported by variant_stats, so that would still result in multiple passes over call_genotype.

jeromekelleher · 2023-08-22T12:02:05Z

I just meant that we should link to variant_genotype_count from the docs of variant_stats, since it's closely related to the variables computed by variant_stats

timothymillar · 2023-08-27T22:40:43Z

Just looking at the implementation, variant_stats calls count_variant_alleles which in turn calls count_call_alleles. The count_call_alleles function creates an array of shape (variants, samples, alleles) which is obviously more memory intensive than the (variants, alleles) array returned by count_variant_alleles. We could re-write count_variant_alleles to compute the (variants, alleles) array directly which should also be faster. But this all comes back to the question of how do we find a balance between variable re-use with performance?

jeromekelleher · 2023-08-29T15:09:43Z

I think it's reasonable to want these basic QC stats to run quickly @timothymillar, as they'll be the basic building blocks for a bunch of other things. There's always going to be some extra code involved in good performance, and having the existing (independently computed) variables does make it easy to test the code.

* Add count_variant_alleles option to calculate directly from calls * Improve performance of variant_stats using gufuncs * Raise error is variant_stats used on mixed-ploidy data * Document behavior of variant_stats with partial genotype calls

jeromekelleher added the performance label Aug 20, 2023

jeromekelleher changed the title ~~Add 'eager=False' option to variant_stats~~ Compute variables in variant_stats concurrently Aug 21, 2023

timothymillar mentioned this issue Aug 30, 2023

Improve performance of variant_stats #1119

Merged

5 tasks

tomwhite closed this as completed in #1119 Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compute variables in variant_stats concurrently #1116

Compute variables in variant_stats concurrently #1116

jeromekelleher commented Aug 20, 2023

timothymillar commented Aug 20, 2023 •

edited

Loading

Uh oh!

timothymillar commented Aug 20, 2023

Uh oh!

jeromekelleher commented Aug 21, 2023 •

edited

Loading

Uh oh!

timothymillar commented Aug 21, 2023

Uh oh!

jeromekelleher commented Aug 22, 2023

Uh oh!

timothymillar commented Aug 22, 2023

Uh oh!

jeromekelleher commented Aug 22, 2023

Uh oh!

timothymillar commented Aug 27, 2023

Uh oh!

jeromekelleher commented Aug 29, 2023

Uh oh!

Compute variables in variant_stats concurrently #1116

Compute variables in variant_stats concurrently #1116

Comments

jeromekelleher commented Aug 20, 2023

timothymillar commented Aug 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timothymillar commented Aug 20, 2023

Uh oh!

jeromekelleher commented Aug 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timothymillar commented Aug 21, 2023

Uh oh!

jeromekelleher commented Aug 22, 2023

Uh oh!

timothymillar commented Aug 22, 2023

Uh oh!

jeromekelleher commented Aug 22, 2023

Uh oh!

timothymillar commented Aug 27, 2023

Uh oh!

jeromekelleher commented Aug 29, 2023

Uh oh!

timothymillar commented Aug 20, 2023 •

edited

Loading

jeromekelleher commented Aug 21, 2023 •

edited

Loading