-
Notifications
You must be signed in to change notification settings - Fork 178
Use caching for the etag using stored hash if available #182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Another option is for us to have a weak ref dictionary, so we can associate an object with its sha1sum and also handle grid files not created by us, wdyt? |
flask_pymongo/__init__.py
Outdated
pos = fileobj.tell() | ||
raw_data = fileobj.read() | ||
fileobj.seek(pos) | ||
return hashlib.sha1(raw_data).hexdigest() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of reading the whole file, it's better to iterate the file data in chunks and calculate the hash:
hash = hashlib.sha1()
while True:
chunk = fileobj.readchunk()
if not chunk:
break
hash.update(chunk)
sha1_sum = hash.hexdigest()
That way we don't have to assemble the entire file in memory at once. However, I suggest we don't implement this optimization now and defer it as future work because there are too many other questions in flight that would change this code.
flask_pymongo/__init__.py
Outdated
sha1_sum = self._compute_sha(fileobj) | ||
metadata = dict(sha1_sum=sha1_sum) | ||
id = storage.put( | ||
fileobj, filename=filename, content_type=content_type, metadata=metadata, **kwargs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I just had another thought. GridOut is seekable but fileobj on save_file() here can be any file-like object and IIRC not every such object is seekable, is that true?. If so we may need to upload the file and compute the hash in one pass.
flask_pymongo/__init__.py
Outdated
metadata = fileobj.metadata | ||
sha1_sum = metadata and metadata.get("sha1_sum") | ||
sha1_sum = sha1_sum or self._compute_sha(fileobj) | ||
response.set_etag(sha1_sum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check for an existing md5 and use that hash here? That way it would be compatible with existing flask-pymongo databases and apps won't pay the cost of recomputing the hash every time when reading existing data.
Alternatively we could update the fileobj and add the sha1 hash here so that the cost is only paid on the first read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conflicts with the version parameter of send_file, since we'd end up creating new versions to add the shasum if it doesn't exist, and the ordering would get all garbled
I believe the file "version" concept is for the file data itself. Also, I do feel more strongly that we should continue to use the md5 field if it's available for backwards compat. So the full read logic would be:
- Use sha hash from file metadata if available,
- Fallback to using md5 hash from file metadata if available,
- Fallback to recompute sha hash.
This should be good enough since step 3 only happens when reading files that were not uploaded via save_file() and that's probably a niche use case. If that's an incorrect assumption we can revisit the idea later on.
Regardless of the read_file impl details, send_file should add the hash automatically so that future reads are cheap. One clever way would be to add a wrapper class around fileobj that hashes the data as it's read:
class Wrapper():
def __init__(self, file):
self.file = file
self.hash = hashlib.sha1()
def read(self, n):
data = self.file.read(n)
if data:
self.hash.update(data)
return data
def save_file(...)
storage = GridFS(db_obj, base)
hashingfile = Wrapper(fileobj)
with storage.new_file(filename=filename, content_type=content_type, **kwargs) as grid_file:
grid_file.write(hashingfile)
grid_file.sha1 = hashingfile.hash.hexdigest()
return grid_file._id
My preferred approach would be to update the GridFS file with the new hash if it's not found when reading the file. That way the file hash only needs to be calculated once. The one case this could be problematic is when an app is deployed with read-only user permissions on the GridFS database, although I'm not sure if flask pymongo supports that use case at all. |
This conflicts with the |
A weakref won't work either, since these objects are created on demand... the latest version uses a fixed-size cache to at least store the more recently used checksums. I didn't set it on write because it wouldn't be uniform for pre-existing files and you might not end up reading the file from flask, so it would be wasted effort. |
flask_pymongo/__init__.py
Outdated
|
||
# GridFS does not manage its own checksum, so we manage our own using its | ||
# metadata storage, to be used for the etag. | ||
sha1_sum = self._hash_cache.get(str(fileobj._id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic because although _id
is unique, _id
s can be reused over time by deleting the old file and uploading a new one with the same _id
. In that case this cache becomes out of date.
flask_pymongo/__init__.py
Outdated
metadata = fileobj.metadata | ||
sha1_sum = metadata and metadata.get("sha1_sum") | ||
sha1_sum = sha1_sum or self._compute_sha(fileobj) | ||
response.set_etag(sha1_sum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conflicts with the version parameter of send_file, since we'd end up creating new versions to add the shasum if it doesn't exist, and the ordering would get all garbled
I believe the file "version" concept is for the file data itself. Also, I do feel more strongly that we should continue to use the md5 field if it's available for backwards compat. So the full read logic would be:
- Use sha hash from file metadata if available,
- Fallback to using md5 hash from file metadata if available,
- Fallback to recompute sha hash.
This should be good enough since step 3 only happens when reading files that were not uploaded via save_file() and that's probably a niche use case. If that's an incorrect assumption we can revisit the idea later on.
Regardless of the read_file impl details, send_file should add the hash automatically so that future reads are cheap. One clever way would be to add a wrapper class around fileobj that hashes the data as it's read:
class Wrapper():
def __init__(self, file):
self.file = file
self.hash = hashlib.sha1()
def read(self, n):
data = self.file.read(n)
if data:
self.hash.update(data)
return data
def save_file(...)
storage = GridFS(db_obj, base)
hashingfile = Wrapper(fileobj)
with storage.new_file(filename=filename, content_type=content_type, **kwargs) as grid_file:
grid_file.write(hashingfile)
grid_file.sha1 = hashingfile.hash.hexdigest()
return grid_file._id
I updated the logic accordingly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. They look good to me. Are there any tests we can add for this?
We already had a test for the etag behavior. Do you mean for md5 handling? |
Yeah like a test that creates a file with the previous md5 format and verifies that the etag is the md5 hash. |
Done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
When you merge, could you update the PR title to reflect the new approach? |
Fixes #181