Skip to content

allow hashlib.file_digest() to calculate hashes for multiple algorithms at once #106053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
calestyo opened this issue Jun 24, 2023 · 2 comments
Closed
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@calestyo
Copy link
Contributor

Feature or enhancement

Please consider extending the interface of hashlib.file_digest() so that it can calculate a file's hashsums for multiple algorithms efficiently, that is without re-reading the file multiple times.

One idea would be that if its digest parameter is a list (or perhaps even any iterable) it would simply create multiple digest objects (one per algorithm) and call .update() on each of those.

In the end it might e.g. return a dict, where the key is the algorithm and the value the hashvalue, tough this wouldn't work properly I guess, if digest ain't a string.

So maybe just return an (ordered) list of hashvalues and put it in the responsibility of the caller to know the order of algorithms as passed in digest.

In principle implementation seems easy at a first glance, but at a 2nd one it may be more complex (well at least for me, being a Python-noob):

file_digest() calls update() twice, depending on the object type, I guess:

digestobj.update(fileobj.getbuffer())

and
digestobj.update(view[:size])

In both cases I don't really know whether it's possible (and if so efficiently) to simply use the source (i.e. fileobj.getbuffer() respectively view[:size]) multiple times and always get the same data without any additional reading.

Probably not so in the first case?

Pitch

Admittedly, most use cases need only one hash algorithm. But there are some more cases, beyond a utility that prints various hashalgos for a given file 😉 , that could benefit from this. For example in security, when verifying files it's not so uncommon to verify against multiple different hash algos. E.g. Debian’s secure APT files (Release and Package files) contain typically various hash algos for a given file.

Of course one can simply manually read the file in binary mode and .update() a number of digests and not use file_digest() at all.

But this looses any optimisations done by that (like the zero-copy buffer, or should it ever get the already indicated method using AF_ALG sockets and sendfile() for zero-copy hashing with hardware acceleration). For users it would be just nice to have a function that does it right out of the box.

Previous discussion

No real discussion I guess, but I've asked for opinions in #89313 (comment) and was recommended to open a new issue.

@calestyo calestyo added the type-feature A feature request or enhancement label Jun 24, 2023
@calestyo
Copy link
Contributor Author

Maybe there's a solution for the problem, without changing the API:

class MultiDigest:
    def __init__(self, digests):
        self.digests = digests
    
    def __call__(self):
        return self
    
    def update(self, data):
        for d in self.digests:
            d.update(data)

invoked like this:

m = MultiDigest([hashlib.sha1(), hashlib.sha512()])

The idea would be that the MultiDigest class provides the interface (well, only the necessary parts) of a hash object but internally calls update on multiple "actual" hash objects.

It seems to "work" with file handles:

f = open("bin.img", "rb")
d = hashlib.file_digest(f, m)
d.digests[0].hexdigest()
d.digests[1].hexdigest()

and io.BytesIO, which, AFAICS, is the only thing that provides getbuffer (??):

b = io.BytesIO(b"abcdef"*100)
d = hashlib.file_digest(b, m)
d.digests[0].hexdigest()
d.digests[1].hexdigest()

I just don't really understand whether this is still "efficient" in the sense that no more data is read or copied (except of course, that each hash is updated) than it would be done with a single algorithm when hashlib.file_digest() is used.

For the case of a file handle, I'd guess it actually is as efficient:

cpython/Lib/hashlib.py

Lines 228 to 236 in d2cbb6e

# binary file, socket.SocketIO object
# Note: socket I/O uses different syscalls than file I/O.
buf = bytearray(_bufsize) # Reusable buffer to reduce allocations.
view = memoryview(buf)
while True:
size = fileobj.readinto(buf)
if size == 0:
break # EOF
digestobj.update(view[:size])

The data seems to be anyway read in (copied) once into buf ... and AFAIU, my code would simply .update() each hash with that.

So it's read multiple times from memory, but not from storage?

For the case of an io.BytesIO object:

cpython/Lib/hashlib.py

Lines 213 to 216 in d2cbb6e

if hasattr(fileobj, "getbuffer"):
# io.BytesIO object, use zero-copy buffer
digestobj.update(fileobj.getbuffer())
return digestobj

I don't quite understand it, TBH. ^^

It would seem to me, that each hash object is .update()ed with the full data at once.

Purely from documentation:

A binary stream using an in-memory bytes buffer.

I'd say its data is anyway guaranteed to be already fully in memory.

So at least there should be no multiple reading from storage, nor any multiple copying buffers (except of course, what the hash algo itself may internally copy ... but this we can't really avoid anyway).

Right?

So if some expert on Python IO/memory/buffers, etc. could confirm whether this already solves the problem (as efficiently as it can be), the problem would be solved... and one could only consider whether it's worth to add such a MultiDigest to Python, or perhaps include it as an example in the documentation?

Thanks,
Chris.

@iritkatriel iritkatriel added the stdlib Python modules in the Lib dir label Nov 27, 2023
@hauntsaninja
Copy link
Contributor

Thanks for the suggestion. I don't think such a thing is common enough to need it in the standard library. The core of hashlib.file_digest when reading files is seven lines of simple pure Python, so I think you can use that to roll your own helper.

@hauntsaninja hauntsaninja closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants