-
Notifications
You must be signed in to change notification settings - Fork 179
Use caching for the etag using stored hash if available #182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
67849af
17e7424
794c9d4
00a02fb
006e7bc
ed52375
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -183,12 +183,13 @@ def get_upload(filename): | |
response.headers["Content-Disposition"] = f"attachment; filename={filename}" | ||
response.content_length = fileobj.length | ||
response.last_modified = fileobj.upload_date | ||
# Compute the sha1 sum of the file for the etag. | ||
pos = fileobj.tell() | ||
raw_data = fileobj.read() | ||
fileobj.seek(pos) | ||
digest = hashlib.sha1(raw_data).hexdigest() | ||
response.set_etag(digest) | ||
|
||
# Get or compute the sha1 sum for the etag. | ||
metadata = fileobj.metadata | ||
sha1_sum = metadata and metadata.get("sha1_sum") | ||
sha1_sum = sha1_sum or self._compute_sha(fileobj) | ||
response.set_etag(sha1_sum) | ||
|
||
response.cache_control.max_age = cache_for | ||
response.cache_control.public = True | ||
response.make_conditional(request) | ||
|
@@ -237,5 +238,19 @@ def save_upload(filename): | |
db_obj = self.db | ||
assert db_obj is not None, "Please initialize the app before calling save_file!" | ||
storage = GridFS(db_obj, base) | ||
id = storage.put(fileobj, filename=filename, content_type=content_type, **kwargs) | ||
|
||
# GridFS does not manage its own hash, so we manage our own using its | ||
# metadata storage, to be used for the etag. | ||
sha1_sum = self._compute_sha(fileobj) | ||
metadata = dict(sha1_sum=sha1_sum) | ||
id = storage.put( | ||
fileobj, filename=filename, content_type=content_type, metadata=metadata, **kwargs | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm I just had another thought. GridOut is seekable but fileobj on save_file() here can be any file-like object and IIRC not every such object is seekable, is that true?. If so we may need to upload the file and compute the hash in one pass. |
||
) | ||
return id | ||
|
||
def _compute_sha(self, fileobj: Any) -> str: | ||
"""Compute the sha sum of a file object.""" | ||
pos = fileobj.tell() | ||
raw_data = fileobj.read() | ||
fileobj.seek(pos) | ||
return hashlib.sha1(raw_data).hexdigest() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of reading the whole file, it's better to iterate the file data in chunks and calculate the hash: hash = hashlib.sha1()
while True:
chunk = fileobj.readchunk()
if not chunk:
break
hash.update(chunk)
sha1_sum = hash.hexdigest() That way we don't have to assemble the entire file in memory at once. However, I suggest we don't implement this optimization now and defer it as future work because there are too many other questions in flight that would change this code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check for an existing md5 and use that hash here? That way it would be compatible with existing flask-pymongo databases and apps won't pay the cost of recomputing the hash every time when reading existing data.
Alternatively we could update the fileobj and add the sha1 hash here so that the cost is only paid on the first read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the file "version" concept is for the file data itself. Also, I do feel more strongly that we should continue to use the md5 field if it's available for backwards compat. So the full read logic would be:
This should be good enough since step 3 only happens when reading files that were not uploaded via save_file() and that's probably a niche use case. If that's an incorrect assumption we can revisit the idea later on.
Regardless of the read_file impl details, send_file should add the hash automatically so that future reads are cheap. One clever way would be to add a wrapper class around fileobj that hashes the data as it's read: