Use caching for the etag using stored hash if available #182

blink1073 · 2025-01-15T20:47:12Z

Fixes #181

blink1073 · 2025-01-15T21:19:17Z

Another option is for us to have a weak ref dictionary, so we can associate an object with its sha1sum and also handle grid files not created by us, wdyt?

ShaneHarvey · 2025-01-15T21:13:24Z

flask_pymongo/__init__.py

+        pos = fileobj.tell()
+        raw_data = fileobj.read()
+        fileobj.seek(pos)
+        return hashlib.sha1(raw_data).hexdigest()


Instead of reading the whole file, it's better to iterate the file data in chunks and calculate the hash:

hash = hashlib.sha1() while True: chunk = fileobj.readchunk() if not chunk: break hash.update(chunk) sha1_sum = hash.hexdigest()

That way we don't have to assemble the entire file in memory at once. However, I suggest we don't implement this optimization now and defer it as future work because there are too many other questions in flight that would change this code.

ShaneHarvey · 2025-01-15T21:18:27Z

flask_pymongo/__init__.py

+        sha1_sum = self._compute_sha(fileobj)
+        metadata = dict(sha1_sum=sha1_sum)
+        id = storage.put(
+            fileobj, filename=filename, content_type=content_type, metadata=metadata, **kwargs


Hmm I just had another thought. GridOut is seekable but fileobj on save_file() here can be any file-like object and IIRC not every such object is seekable, is that true?. If so we may need to upload the file and compute the hash in one pass.

ShaneHarvey · 2025-01-15T21:21:36Z

flask_pymongo/__init__.py

+        metadata = fileobj.metadata
+        sha1_sum = metadata and metadata.get("sha1_sum")
+        sha1_sum = sha1_sum or self._compute_sha(fileobj)
+        response.set_etag(sha1_sum)


Should we check for an existing md5 and use that hash here? That way it would be compatible with existing flask-pymongo databases and apps won't pay the cost of recomputing the hash every time when reading existing data.

Alternatively we could update the fileobj and add the sha1 hash here so that the cost is only paid on the first read.

This conflicts with the version parameter of send_file, since we'd end up creating new versions to add the shasum if it doesn't exist, and the ordering would get all garbled

I believe the file "version" concept is for the file data itself. Also, I do feel more strongly that we should continue to use the md5 field if it's available for backwards compat. So the full read logic would be:

Use sha hash from file metadata if available,

Fallback to using md5 hash from file metadata if available,

Fallback to recompute sha hash.

This should be good enough since step 3 only happens when reading files that were not uploaded via save_file() and that's probably a niche use case. If that's an incorrect assumption we can revisit the idea later on.

Regardless of the read_file impl details, send_file should add the hash automatically so that future reads are cheap. One clever way would be to add a wrapper class around fileobj that hashes the data as it's read:

class Wrapper(): def __init__(self, file): self.file = file self.hash = hashlib.sha1() def read(self, n): data = self.file.read(n) if data: self.hash.update(data) return data def save_file(...) storage = GridFS(db_obj, base) hashingfile = Wrapper(fileobj) with storage.new_file(filename=filename, content_type=content_type, **kwargs) as grid_file: grid_file.write(hashingfile) grid_file.sha1 = hashingfile.hash.hexdigest() return grid_file._id

ShaneHarvey · 2025-01-15T21:29:48Z

Another option is for us to have a weak ref dictionary, so we can associate an object with its sha1sum and also handle grid files not created by us, wdyt?

My preferred approach would be to update the GridFS file with the new hash if it's not found when reading the file. That way the file hash only needs to be calculated once.

The one case this could be problematic is when an app is deployed with read-only user permissions on the GridFS database, although I'm not sure if flask pymongo supports that use case at all.

blink1073 · 2025-01-15T22:30:16Z

My preferred approach would be to update the GridFS file with the new hash if it's not found when reading the file. That way the file hash only needs to be calculated once.
The one case this could be problematic is when an app is deployed with read-only user permissions on the GridFS database, although I'm not sure if flask pymongo supports that use case at all.

This conflicts with the version parameter of send_file, since we'd end up creating new versions to add the shasum if it doesn't exist, and the ordering would get all garbled. I think that the weak ref approach makes the most sense, as it is an attached property of the gridfs file.

blink1073 · 2025-01-16T01:44:45Z

A weakref won't work either, since these objects are created on demand... the latest version uses a fixed-size cache to at least store the more recently used checksums. I didn't set it on write because it wouldn't be uniform for pre-existing files and you might not end up reading the file from flask, so it would be wasted effort.

ShaneHarvey · 2025-01-16T20:22:27Z

flask_pymongo/__init__.py

+
+        # GridFS does not manage its own checksum, so we manage our own using its
+        # metadata storage, to be used for the etag.
+        sha1_sum = self._hash_cache.get(str(fileobj._id))


This is problematic because although _id is unique, _ids can be reused over time by deleting the old file and uploading a new one with the same _id. In that case this cache becomes out of date.

ShaneHarvey · 2025-01-16T20:31:32Z

flask_pymongo/__init__.py

+        metadata = fileobj.metadata
+        sha1_sum = metadata and metadata.get("sha1_sum")
+        sha1_sum = sha1_sum or self._compute_sha(fileobj)
+        response.set_etag(sha1_sum)


This conflicts with the version parameter of send_file, since we'd end up creating new versions to add the shasum if it doesn't exist, and the ordering would get all garbled

I believe the file "version" concept is for the file data itself. Also, I do feel more strongly that we should continue to use the md5 field if it's available for backwards compat. So the full read logic would be:

Use sha hash from file metadata if available,

Fallback to using md5 hash from file metadata if available,

Fallback to recompute sha hash.

This should be good enough since step 3 only happens when reading files that were not uploaded via save_file() and that's probably a niche use case. If that's an incorrect assumption we can revisit the idea later on.

Regardless of the read_file impl details, send_file should add the hash automatically so that future reads are cheap. One clever way would be to add a wrapper class around fileobj that hashes the data as it's read:

class Wrapper(): def __init__(self, file): self.file = file self.hash = hashlib.sha1() def read(self, n): data = self.file.read(n) if data: self.hash.update(data) return data def save_file(...) storage = GridFS(db_obj, base) hashingfile = Wrapper(fileobj) with storage.new_file(filename=filename, content_type=content_type, **kwargs) as grid_file: grid_file.write(hashingfile) grid_file.sha1 = hashingfile.hash.hexdigest() return grid_file._id

blink1073 · 2025-01-17T13:46:16Z

I updated the logic accordingly

ShaneHarvey

Thanks for the changes. They look good to me. Are there any tests we can add for this?

blink1073 · 2025-01-17T22:39:09Z

Are there any tests we can add for this?

We already had a test for the etag behavior. Do you mean for md5 handling?

ShaneHarvey · 2025-01-21T19:29:36Z

Yeah like a test that creates a file with the previous md5 format and verifies that the etag is the md5 hash.

blink1073 · 2025-01-21T20:09:46Z

Yeah like a test that creates a file with the previous md5 format and verifies that the etag is the md5 hash.

Done

ShaneHarvey

LGTM!

ShaneHarvey · 2025-01-21T21:07:28Z

When you merge, could you update the PR title to reflect the new approach?

Use caching for the etag

67849af

blink1073 requested a review from ShaneHarvey January 15, 2025 20:47

blink1073 marked this pull request as draft January 15, 2025 20:58

blink1073 added 2 commits January 15, 2025 15:03

Fix handling of sha sum

17e7424

fix test

794c9d4

blink1073 marked this pull request as ready for review January 15, 2025 21:16

ShaneHarvey reviewed Jan 15, 2025

View reviewed changes

Use a fixed size cache to store sha sums

00a02fb

ShaneHarvey reviewed Jan 16, 2025

View reviewed changes

address review

006e7bc

blink1073 requested a review from ShaneHarvey January 17, 2025 13:46

ShaneHarvey reviewed Jan 17, 2025

View reviewed changes

blink1073 requested a review from ShaneHarvey January 17, 2025 22:39

add md5 test

ed52375

ShaneHarvey approved these changes Jan 21, 2025

View reviewed changes

blink1073 changed the title ~~Use caching for the etag~~ Use caching for the etag using stored hash if available Jan 21, 2025

blink1073 merged commit e2e8d10 into mongodb-labs:main Jan 21, 2025
21 checks passed

Use caching for the etag using stored hash if available #182

Use caching for the etag using stored hash if available #182

Uh oh!

Conversation

blink1073 commented Jan 15, 2025

Uh oh!

blink1073 commented Jan 15, 2025

Uh oh!

ShaneHarvey Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

blink1073 commented Jan 15, 2025

Uh oh!

blink1073 commented Jan 16, 2025

Uh oh!

ShaneHarvey Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey Jan 16, 2025

Choose a reason for hiding this comment

Uh oh!

blink1073 commented Jan 17, 2025

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

blink1073 commented Jan 17, 2025

Uh oh!

ShaneHarvey commented Jan 21, 2025

Uh oh!

blink1073 commented Jan 21, 2025

Uh oh!

ShaneHarvey left a comment

Choose a reason for hiding this comment

Uh oh!

ShaneHarvey commented Jan 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ShaneHarvey commented Jan 15, 2025 •

edited

Loading

ShaneHarvey commented Jan 21, 2025 •

edited

Loading