Skip to content

allow hashlib.file_digest() to calculate hashes for multiple algorithms at once #106053

Closed as not planned
@calestyo

Description

@calestyo

Feature or enhancement

Please consider extending the interface of hashlib.file_digest() so that it can calculate a file's hashsums for multiple algorithms efficiently, that is without re-reading the file multiple times.

One idea would be that if its digest parameter is a list (or perhaps even any iterable) it would simply create multiple digest objects (one per algorithm) and call .update() on each of those.

In the end it might e.g. return a dict, where the key is the algorithm and the value the hashvalue, tough this wouldn't work properly I guess, if digest ain't a string.

So maybe just return an (ordered) list of hashvalues and put it in the responsibility of the caller to know the order of algorithms as passed in digest.

In principle implementation seems easy at a first glance, but at a 2nd one it may be more complex (well at least for me, being a Python-noob):

file_digest() calls update() twice, depending on the object type, I guess:

digestobj.update(fileobj.getbuffer())

and
digestobj.update(view[:size])

In both cases I don't really know whether it's possible (and if so efficiently) to simply use the source (i.e. fileobj.getbuffer() respectively view[:size]) multiple times and always get the same data without any additional reading.

Probably not so in the first case?

Pitch

Admittedly, most use cases need only one hash algorithm. But there are some more cases, beyond a utility that prints various hashalgos for a given file 😉 , that could benefit from this. For example in security, when verifying files it's not so uncommon to verify against multiple different hash algos. E.g. Debian’s secure APT files (Release and Package files) contain typically various hash algos for a given file.

Of course one can simply manually read the file in binary mode and .update() a number of digests and not use file_digest() at all.

But this looses any optimisations done by that (like the zero-copy buffer, or should it ever get the already indicated method using AF_ALG sockets and sendfile() for zero-copy hashing with hardware acceleration). For users it would be just nice to have a function that does it right out of the box.

Previous discussion

No real discussion I guess, but I've asked for opinions in #89313 (comment) and was recommended to open a new issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibPython modules in the Lib dirtype-featureA feature request or enhancement

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions