-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
allow hashlib.file_digest() to calculate hashes for multiple algorithms at once #106053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Maybe there's a solution for the problem, without changing the API:
invoked like this:
The idea would be that the It seems to "work" with file handles:
and
I just don't really understand whether this is still "efficient" in the sense that no more data is read or copied (except of course, that each hash is updated) than it would be done with a single algorithm when For the case of a file handle, I'd guess it actually is as efficient: Lines 228 to 236 in d2cbb6e
The data seems to be anyway read in (copied) once into buf ... and AFAIU, my code would simply .update() each hash with that.So it's read multiple times from memory, but not from storage? For the case of an Lines 213 to 216 in d2cbb6e
I don't quite understand it, TBH. ^^ It would seem to me, that each hash object is
I'd say its data is anyway guaranteed to be already fully in memory. Right? So if some expert on Python IO/memory/buffers, etc. could confirm whether this already solves the problem (as efficiently as it can be), the problem would be solved... and one could only consider whether it's worth to add such a Thanks, |
Thanks for the suggestion. I don't think such a thing is common enough to need it in the standard library. The core of |
Feature or enhancement
Please consider extending the interface of
hashlib.file_digest()
so that it can calculate a file's hashsums for multiple algorithms efficiently, that is without re-reading the file multiple times.One idea would be that if its
digest
parameter is a list (or perhaps even any iterable) it would simply create multiple digest objects (one per algorithm) and call.update()
on each of those.In the end it might e.g. return a dict, where the key is the algorithm and the value the hashvalue, tough this wouldn't work properly I guess, if
digest
ain't a string.So maybe just return an (ordered) list of hashvalues and put it in the responsibility of the caller to know the order of algorithms as passed in
digest
.In principle implementation seems easy at a first glance, but at a 2nd one it may be more complex (well at least for me, being a Python-noob):
file_digest()
callsupdate()
twice, depending on the object type, I guess:cpython/Lib/hashlib.py
Line 215 in 4849a80
and
cpython/Lib/hashlib.py
Line 236 in 4849a80
In both cases I don't really know whether it's possible (and if so efficiently) to simply use the source (i.e.
fileobj.getbuffer()
respectivelyview[:size]
) multiple times and always get the same data without any additional reading.Probably not so in the first case?
Pitch
Admittedly, most use cases need only one hash algorithm. But there are some more cases, beyond a utility that prints various hashalgos for a given file 😉 , that could benefit from this. For example in security, when verifying files it's not so uncommon to verify against multiple different hash algos. E.g. Debian’s secure APT files (
Release
andPackage
files) contain typically various hash algos for a given file.Of course one can simply manually read the file in binary mode and
.update()
a number of digests and not usefile_digest()
at all.But this looses any optimisations done by that (like the zero-copy buffer, or should it ever get the already indicated method using AF_ALG sockets and sendfile() for zero-copy hashing with hardware acceleration). For users it would be just nice to have a function that does it right out of the box.
Previous discussion
No real discussion I guess, but I've asked for opinions in #89313 (comment) and was recommended to open a new issue.
The text was updated successfully, but these errors were encountered: