Skip to content

Use variable-length strings for storing alleles in Zarr #643

@tomwhite

Description

@tomwhite

Currently we use fixed-length strings for storing alleles, but this is inefficient since the length is the size of the longest allele in the whole dataset.

For example, in some 1000 genomes data (chr22) I noticed that the longest allele is 414 base pairs, which means the data type is "S414" - unnecessarily large for the vast majority of variants. This meant that the variant_allele data took up 15 MB (compressed), rather than something like one tenth of that if a variable-length encoding were used (like scikit-allel does).

It would be worth investigating how we could use Zarr's variable-length strings (the number of alt alleles would remain fixed though, in contrast to #634). In particular, there may be some work to get this representation to work nicely with xarray.

Metadata

Metadata

Assignees

No one assigned

    Labels

    data representationIssues related to how data is represented: data types, data structures, indexes, access methods, etc

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions