Skip to content

Compute variables in variant_stats concurrently #1116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jeromekelleher opened this issue Aug 20, 2023 · 9 comments · Fixed by #1119
Closed

Compute variables in variant_stats concurrently #1116

jeromekelleher opened this issue Aug 20, 2023 · 9 comments · Fixed by #1119

Comments

@jeromekelleher
Copy link
Collaborator

Doing downstream calculations with variables returned by variant_stats is quite slow (see here: sgkit-dev/sgkit-publication#34). It seems likely that this is mostly due to requiring us go through the whole variant matrix to compute it for each stat. While this is fine for a single variable, it adds up quickly if you want to look at the number of hets, homs, allele counts etc.

The variables are defined as Dask arrays here using numpy operations:

https://github.com/pystatgen/sgkit/blob/cc048581be1b44be7208634107877f3512f29823/sgkit/stats/aggregation.py#L435

It would be much more efficient to compute all these values at once, in a single pass through the genotype matrix. Would it make sense to add an option to this (and I guess sample_stat) to do this? We could call it either eager or lazy --- since things are lazy by default, I guess it makes sense to call the parameter eager to make it stand out?

In terms of implementation, I guess this would need to be a numba gufunc? The code should be straightforward.

@timothymillar
Copy link
Collaborator

timothymillar commented Aug 20, 2023

This should be achievable with a gufunc that computes all of the variables at the same time, without requiring an argument for eager evaluation. The gufunc would return a separate array for each variable and can be applied lazily using da.apply_gufunc which allows multiple return values. The resulting dask arrays need to be evaluated simultaneously to avoid duplicated compute, e.g.:

sg.variant_stats(ds, merge=False).compute()

@timothymillar
Copy link
Collaborator

On a related note, there are several variables that could potentially be combined into a single variant_genotype_count array (variants * genotypes)

  • sgkit.variables.variant_n_het_spec: Second genotype for bi-allelic diploid case (multiple columns in general case)
  • sgkit.variables.variant_n_hom_ref_spec: First genotype
  • sgkit.variables.variant_n_hom_alt_spec: Third genotype for bi-allelic diploid case (one genotype per alt allele in general case)
  • sgkit.variables.variant_n_non_ref_spec: Sum of all genotypes except the first

Replacing these with a variant_genotype_count array would make it easy to generalize to the multi-allelic and polyploid cases (not mixed-ploidy). But it would be a significant breaking change.

@jeromekelleher jeromekelleher changed the title Add 'eager=False' option to variant_stats Compute variables in variant_stats concurrently Aug 21, 2023
@jeromekelleher
Copy link
Collaborator Author

jeromekelleher commented Aug 21, 2023

This should be achievable with a gufunc that computes all of the variables at the same time, without requiring an argument for eager evaluation.

Ah, that sounds perfect then! I've changed the issue title to remove the eager bit.

I agree combining the various hom/het variables into a single multidimensional count would be good, but i guess that's a different issue. (But maybe the more efficient version gufunc could compute this value, and then things like n_het defined in terms of that?)

@timothymillar
Copy link
Collaborator

I agree combining the various hom/het variables into a single multidimensional count would be good, but i guess that's a different issue

I forgot that we already have we already have count_variant_genotypes. So, I don't think there's much value in duplicating that functionality here anyway.

@jeromekelleher
Copy link
Collaborator Author

We should link to that function from variant_stats too I guess.

@timothymillar
Copy link
Collaborator

Do you mean link in the docs or call it from variant_stats? The variant_genotype_count array only contains some of the data reported by variant_stats, so that would still result in multiple passes over call_genotype.

@jeromekelleher
Copy link
Collaborator Author

I just meant that we should link to variant_genotype_count from the docs of variant_stats, since it's closely related to the variables computed by variant_stats

@timothymillar
Copy link
Collaborator

Just looking at the implementation, variant_stats calls count_variant_alleles which in turn calls count_call_alleles. The count_call_alleles function creates an array of shape (variants, samples, alleles) which is obviously more memory intensive than the (variants, alleles) array returned by count_variant_alleles. We could re-write count_variant_alleles to compute the (variants, alleles) array directly which should also be faster. But this all comes back to the question of how do we find a balance between variable re-use with performance?

@jeromekelleher
Copy link
Collaborator Author

I think it's reasonable to want these basic QC stats to run quickly @timothymillar, as they'll be the basic building blocks for a bunch of other things. There's always going to be some extra code involved in good performance, and having the existing (independently computed) variables does make it easy to test the code.

timothymillar added a commit to timothymillar/sgkit that referenced this issue Aug 29, 2023
* Add count_variant_alleles option to calculate directly from calls
* Improve performance of variant_stats using gufuncs
* Raise error is variant_stats used on mixed-ploidy data
* Document behavior of variant_stats with partial genotype calls
timothymillar added a commit to timothymillar/sgkit that referenced this issue Aug 30, 2023
* Add count_variant_alleles option to calculate directly from calls
* Improve performance of variant_stats using gufuncs
* Raise error is variant_stats used on mixed-ploidy data
* Document behavior of variant_stats with partial genotype calls
timothymillar added a commit to timothymillar/sgkit that referenced this issue Sep 8, 2023
* Add count_variant_alleles option to calculate directly from calls
* Improve performance of variant_stats using gufuncs
* Raise error is variant_stats used on mixed-ploidy data
* Document behavior of variant_stats with partial genotype calls
timothymillar added a commit to timothymillar/sgkit that referenced this issue Sep 9, 2023
* Add count_variant_alleles option to calculate directly from calls
* Improve performance of variant_stats using gufuncs
* Raise error is variant_stats used on mixed-ploidy data
* Document behavior of variant_stats with partial genotype calls
tomwhite pushed a commit that referenced this issue Oct 3, 2023
* Add count_variant_alleles option to calculate directly from calls
* Improve performance of variant_stats using gufuncs
* Raise error is variant_stats used on mixed-ploidy data
* Document behavior of variant_stats with partial genotype calls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants