Skip to content

Add S3OptimizedUploadStorage #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 8, 2021
Merged

Add S3OptimizedUploadStorage #128

merged 9 commits into from
Sep 8, 2021

Conversation

drakon
Copy link
Contributor

@drakon drakon commented Apr 8, 2021

Hi Joe

As mentioned I now added the code into your library. I tested it locally with my setup and for my project and it worked as before for me.

The only thing I'm not that happy is that I had to put it into another file "storages_optimized.py" as when I tried it in "storages.py" I got import issues because of the import of "storages.backends.s3boto3". I found some workarounds but they seemed rather risky compared to just using another file name. But if you have a better idea, please do change it.

I also added some comments to the readme.

Not sure if it needs more notes for caution as I don't know all the cases that don't work now (for example it's probably not a good idea to just use this as the DEFAULT_FILE_STORAGE in Django as file uploads/save from other places than S3 probably don't work as expected).

Looking forward to your inputs. And thanks for your patience. ;-)

Best,
Christian

@codecov
Copy link

codecov bot commented Apr 8, 2021

Codecov Report

Merging #128 (efe5a64) into master (366e0ac) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #128      +/-   ##
==========================================
+ Coverage   99.38%   99.43%   +0.04%     
==========================================
  Files           7        8       +1     
  Lines         162      176      +14     
==========================================
+ Hits          161      175      +14     
  Misses          1        1              
Impacted Files Coverage Δ
s3file/storages_optimized.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 366e0ac...efe5a64. Read the comment docs.

@drakon
Copy link
Contributor Author

drakon commented Apr 8, 2021

Ok, fixed the issues with black and pydocstyle. The fails from the CI/pytest and bandit seem not to be cause by any of my changes.

But I have a hard time adding a test/coverage for this as it don't know how to emulate the initial upload and then the save/copy call on MockS3. Will try again looking into it but would be really glad to get some pointers.

@drakon
Copy link
Contributor Author

drakon commented Apr 9, 2021

@codingjoe ok, I got now some tests to cover all my added code. For some reason the tests don't run through though and the coverage doesn't get updated here on Github. I'll leave it for now until your inputs.

@codingjoe
Copy link
Owner

I'll look into it next week, since I'm on vacation right now. But thanks in advance for all the hard work.

@drakon
Copy link
Contributor Author

drakon commented Apr 9, 2021

No worries, take your time and enjoy your vacation! :)
I hope my commits didn't spam you too much. Next time I'll use the draft feature before pushing every commit. ^^

Copy link
Owner

@codingjoe codingjoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @drakon,

Thanks a lot. This looks like a good start. I left you a couple of comments, but this will require a couple of rounds to really nail the implementation. Don't be discouraged, though.

It would be great if you could rebase your branch, since I fixed the CI suite.

@drakon
Copy link
Contributor Author

drakon commented Apr 12, 2021

Hi Joe, yeah, sure. Thanks for the comments! Will go through your comments and address them in the PR. Needed some time to setup everything in the first place so I haven't spent enough focus time on the actual changes. Will do!

@codingjoe codingjoe mentioned this pull request Aug 12, 2021
@drakon
Copy link
Contributor Author

drakon commented Sep 6, 2021

Hi @codingjoe, took me a while as I was engaged with other parts/projects. But I finally improved the PR and commented on the changes I made above. Hope it's good now. :)

Copy link
Owner

@codingjoe codingjoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, LGTM, lets give this a go!

@codingjoe codingjoe merged commit 83d1d17 into codingjoe:master Sep 8, 2021
@codingjoe
Copy link
Owner

It's released in 5.4.0 🎉

@drakon
Copy link
Contributor Author

drakon commented Sep 8, 2021

Awesome, thanks! :)

@codingjoe
Copy link
Owner

@drakon, we tried switching to the new backend, but saw some errors. I believe you are retuning the wrong value. You should be returning name not cleand_name and you, please check? Some of our image processors complain that the key does not exist and you are using name as the copy key, not cleaned_name.

@drakon
Copy link
Contributor Author

drakon commented Sep 13, 2021

That strikes me as odd as this is a part that I didn't change and is consistent with the behaviour of the default implementation (current version):

def _save(self, name, content):
        cleaned_name = self._clean_name(name)
        name = self._normalize_name(cleaned_name)
        params = self._get_write_parameters(name, content)

        if (self.gzip and
                params['ContentType'] in self.gzip_content_types and
                'ContentEncoding' not in params):
            content = self._compress_content(content)
            params['ContentEncoding'] = 'gzip'

        obj = self.bucket.Object(name)
        content.seek(0, os.SEEK_SET)
        obj.upload_fileobj(content, ExtraArgs=params)
        return cleaned_name

Secondly the name here is actually the cleaned and normalized (see first two lines).

So, the only real different should be contained to this line:

# Copy the file instead uf uploading
obj.copy({"Bucket": self.bucket.name, "Key": content.obj.key}, ExtraArgs=params)

content.obj.key refers to the source key. And I'd expect this to be already be cleaned.

Can you provide more about the error you receive or the keys/name/paths used?

@codingjoe
Copy link
Owner

@amureki will give you an update about that. I also thought this might have been a race condition, but S3 should be consistent.

Maybe the errors we say were also related to another issues. @amureki has more details when he is back from vacation next week.

@amureki
Copy link
Contributor

amureki commented Sep 21, 2021

@drakon hey Christian! First, thank you for this PR, this is a nice improvement. 👍

I investigated a bit more here and found out that the errors (TypeError from this PR) are appearing when the upload is done with this backend, but triggered not from django-s3file (case you mentioned in your PR description here).

Now, I wonder if we can tweak this custom backend to handle such cases and check if content actually contains S3 file. Do you think that would defeat the original purpose and decrease the optimization?

@drakon
Copy link
Contributor Author

drakon commented Sep 21, 2021

Hi @amureki no worries. :)

Hmm. I see. Well the initial idea was that this storage only optimises uploads where you already know you have an S3 file. That's also why there's an error when that's not the case (assumption is that content.obj.key is an S3 key). So it only works in combination with this library (where we already know that the file is already uploaded with the client side). So it's applied only to the fields where you want the optimisation: models.FileField(storage=OptimizedStorage, ...). Others would behave as usual.

From what I understand from you is that you would want to do an upload from any other sources as well with this optimised storage?

It should certainly be doable but I was a bit hesitant (also with the comment from Joe) to silently fall back to an un-optimised case when the condition is not met for 2 reasons:

  • The condition for the upload might be also met for other cases. Or in logical terms: The condition used in the check is necessary for S3 files but not sufficient.
  • Because the rest of the implementation is a copy/paste of the original one. Keeping it up-to-date is maybe a bit harder so we should only use it when really necessary.

Having said that, it should be straight-forward to implement it as instead of throwing an error we could just use line 32 & 33. Unfortunately I don't even have a use-case that where I can test this in a real-world scenario. And I'd probably would recommend to change the check to something that is necessary but also sufficient to check if content is really an S3 file. Meaning it identifies all S3 files as such but would exclude all others. But in that case one could also directly add this to django-storages.

But to be clear: In that case it's just about convenience, there would be no optimization compared to the normal backend (it literally would be the same code).

@amureki
Copy link
Contributor

amureki commented Sep 23, 2021

@drakon thank you for digging in with us into this issue. :)

I agree, that it might harm to do more changes to this storage backend. And instead, we could just update certain fields with the custom backend variable (or introduce a custom field with the storage backend plugged in).

So, I think we are good here, thank you again for the help and for the initial contribution!

@drakon
Copy link
Contributor Author

drakon commented Sep 23, 2021

My pleasure. Thanks for providing this library in the first place. :)

Yes, I think that's probably the easiest way. If there is growing need and more people wanting this, one could think about trying to get this into django-storages directly as an optimization for such cases as in that case it'd be an efficiency gain for others too. But at this point I'm not even sure if it's a real issue outside of this library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants