Skip to content

Use caching for the etag using stored hash if available #182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 21, 2025

Conversation

blink1073
Copy link
Member

Fixes #181

@blink1073 blink1073 requested a review from ShaneHarvey January 15, 2025 20:47
@blink1073 blink1073 marked this pull request as draft January 15, 2025 20:58
@blink1073 blink1073 marked this pull request as ready for review January 15, 2025 21:16
@blink1073
Copy link
Member Author

Another option is for us to have a weak ref dictionary, so we can associate an object with its sha1sum and also handle grid files not created by us, wdyt?

pos = fileobj.tell()
raw_data = fileobj.read()
fileobj.seek(pos)
return hashlib.sha1(raw_data).hexdigest()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reading the whole file, it's better to iterate the file data in chunks and calculate the hash:

hash = hashlib.sha1()
while True:
    chunk = fileobj.readchunk()
    if not chunk:
        break
    hash.update(chunk)
sha1_sum = hash.hexdigest()

That way we don't have to assemble the entire file in memory at once. However, I suggest we don't implement this optimization now and defer it as future work because there are too many other questions in flight that would change this code.

sha1_sum = self._compute_sha(fileobj)
metadata = dict(sha1_sum=sha1_sum)
id = storage.put(
fileobj, filename=filename, content_type=content_type, metadata=metadata, **kwargs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I just had another thought. GridOut is seekable but fileobj on save_file() here can be any file-like object and IIRC not every such object is seekable, is that true?. If so we may need to upload the file and compute the hash in one pass.

metadata = fileobj.metadata
sha1_sum = metadata and metadata.get("sha1_sum")
sha1_sum = sha1_sum or self._compute_sha(fileobj)
response.set_etag(sha1_sum)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for an existing md5 and use that hash here? That way it would be compatible with existing flask-pymongo databases and apps won't pay the cost of recomputing the hash every time when reading existing data.

Alternatively we could update the fileobj and add the sha1 hash here so that the cost is only paid on the first read.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conflicts with the version parameter of send_file, since we'd end up creating new versions to add the shasum if it doesn't exist, and the ordering would get all garbled

I believe the file "version" concept is for the file data itself. Also, I do feel more strongly that we should continue to use the md5 field if it's available for backwards compat. So the full read logic would be:

  1. Use sha hash from file metadata if available,
  2. Fallback to using md5 hash from file metadata if available,
  3. Fallback to recompute sha hash.

This should be good enough since step 3 only happens when reading files that were not uploaded via save_file() and that's probably a niche use case. If that's an incorrect assumption we can revisit the idea later on.

Regardless of the read_file impl details, send_file should add the hash automatically so that future reads are cheap. One clever way would be to add a wrapper class around fileobj that hashes the data as it's read:

class Wrapper():
    def __init__(self, file):
        self.file = file
        self.hash = hashlib.sha1()
        
    def read(self, n):
        data = self.file.read(n)
        if data:
            self.hash.update(data)
        return data

def save_file(...)
        storage = GridFS(db_obj, base)
        hashingfile = Wrapper(fileobj)
        with storage.new_file(filename=filename, content_type=content_type, **kwargs) as grid_file:
            grid_file.write(hashingfile)
            grid_file.sha1 = hashingfile.hash.hexdigest()
            return grid_file._id

@ShaneHarvey
Copy link
Collaborator

ShaneHarvey commented Jan 15, 2025

Another option is for us to have a weak ref dictionary, so we can associate an object with its sha1sum and also handle grid files not created by us, wdyt?

My preferred approach would be to update the GridFS file with the new hash if it's not found when reading the file. That way the file hash only needs to be calculated once.

The one case this could be problematic is when an app is deployed with read-only user permissions on the GridFS database, although I'm not sure if flask pymongo supports that use case at all.

@blink1073
Copy link
Member Author

My preferred approach would be to update the GridFS file with the new hash if it's not found when reading the file. That way the file hash only needs to be calculated once.
The one case this could be problematic is when an app is deployed with read-only user permissions on the GridFS database, although I'm not sure if flask pymongo supports that use case at all.

This conflicts with the version parameter of send_file, since we'd end up creating new versions to add the shasum if it doesn't exist, and the ordering would get all garbled. I think that the weak ref approach makes the most sense, as it is an attached property of the gridfs file.

@blink1073
Copy link
Member Author

A weakref won't work either, since these objects are created on demand... the latest version uses a fixed-size cache to at least store the more recently used checksums. I didn't set it on write because it wouldn't be uniform for pre-existing files and you might not end up reading the file from flask, so it would be wasted effort.


# GridFS does not manage its own checksum, so we manage our own using its
# metadata storage, to be used for the etag.
sha1_sum = self._hash_cache.get(str(fileobj._id))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic because although _id is unique, _ids can be reused over time by deleting the old file and uploading a new one with the same _id. In that case this cache becomes out of date.

metadata = fileobj.metadata
sha1_sum = metadata and metadata.get("sha1_sum")
sha1_sum = sha1_sum or self._compute_sha(fileobj)
response.set_etag(sha1_sum)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conflicts with the version parameter of send_file, since we'd end up creating new versions to add the shasum if it doesn't exist, and the ordering would get all garbled

I believe the file "version" concept is for the file data itself. Also, I do feel more strongly that we should continue to use the md5 field if it's available for backwards compat. So the full read logic would be:

  1. Use sha hash from file metadata if available,
  2. Fallback to using md5 hash from file metadata if available,
  3. Fallback to recompute sha hash.

This should be good enough since step 3 only happens when reading files that were not uploaded via save_file() and that's probably a niche use case. If that's an incorrect assumption we can revisit the idea later on.

Regardless of the read_file impl details, send_file should add the hash automatically so that future reads are cheap. One clever way would be to add a wrapper class around fileobj that hashes the data as it's read:

class Wrapper():
    def __init__(self, file):
        self.file = file
        self.hash = hashlib.sha1()
        
    def read(self, n):
        data = self.file.read(n)
        if data:
            self.hash.update(data)
        return data

def save_file(...)
        storage = GridFS(db_obj, base)
        hashingfile = Wrapper(fileobj)
        with storage.new_file(filename=filename, content_type=content_type, **kwargs) as grid_file:
            grid_file.write(hashingfile)
            grid_file.sha1 = hashingfile.hash.hexdigest()
            return grid_file._id

@blink1073
Copy link
Member Author

I updated the logic accordingly

@blink1073 blink1073 requested a review from ShaneHarvey January 17, 2025 13:46
Copy link
Collaborator

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. They look good to me. Are there any tests we can add for this?

@blink1073
Copy link
Member Author

Are there any tests we can add for this?

We already had a test for the etag behavior. Do you mean for md5 handling?

@blink1073 blink1073 requested a review from ShaneHarvey January 17, 2025 22:39
@ShaneHarvey
Copy link
Collaborator

Yeah like a test that creates a file with the previous md5 format and verifies that the etag is the md5 hash.

@blink1073
Copy link
Member Author

Yeah like a test that creates a file with the previous md5 format and verifies that the etag is the md5 hash.

Done

Copy link
Collaborator

@ShaneHarvey ShaneHarvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ShaneHarvey
Copy link
Collaborator

ShaneHarvey commented Jan 21, 2025

When you merge, could you update the PR title to reflect the new approach?

@blink1073 blink1073 changed the title Use caching for the etag Use caching for the etag using stored hash if available Jan 21, 2025
@blink1073 blink1073 merged commit e2e8d10 into mongodb-labs:main Jan 21, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

save_file should store the file etag hash so that it doesn't need to be calculated on each read
2 participants