Add S3OptimizedUploadStorage #128

drakon · 2021-04-08T15:22:10Z

Hi Joe

As mentioned I now added the code into your library. I tested it locally with my setup and for my project and it worked as before for me.

The only thing I'm not that happy is that I had to put it into another file "storages_optimized.py" as when I tried it in "storages.py" I got import issues because of the import of "storages.backends.s3boto3". I found some workarounds but they seemed rather risky compared to just using another file name. But if you have a better idea, please do change it.

I also added some comments to the readme.

Not sure if it needs more notes for caution as I don't know all the cases that don't work now (for example it's probably not a good idea to just use this as the DEFAULT_FILE_STORAGE in Django as file uploads/save from other places than S3 probably don't work as expected).

Looking forward to your inputs. And thanks for your patience. ;-)

Best,
Christian

codecov · 2021-04-08T15:23:35Z

Codecov Report

Merging #128 (efe5a64) into master (366e0ac) will increase coverage by 0.04%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #128      +/-   ##
==========================================
+ Coverage   99.38%   99.43%   +0.04%     
==========================================
  Files           7        8       +1     
  Lines         162      176      +14     
==========================================
+ Hits          161      175      +14     
  Misses          1        1

Impacted Files	Coverage Δ
s3file/storages_optimized.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 366e0ac...efe5a64. Read the comment docs.

drakon · 2021-04-08T21:30:53Z

Ok, fixed the issues with black and pydocstyle. The fails from the CI/pytest and bandit seem not to be cause by any of my changes.

But I have a hard time adding a test/coverage for this as it don't know how to emulate the initial upload and then the save/copy call on MockS3. Will try again looking into it but would be really glad to get some pointers.

drakon · 2021-04-09T11:08:47Z

@codingjoe ok, I got now some tests to cover all my added code. For some reason the tests don't run through though and the coverage doesn't get updated here on Github. I'll leave it for now until your inputs.

codingjoe · 2021-04-09T11:25:15Z

I'll look into it next week, since I'm on vacation right now. But thanks in advance for all the hard work.

drakon · 2021-04-09T12:04:02Z

No worries, take your time and enjoy your vacation! :)
I hope my commits didn't spam you too much. Next time I'll use the draft feature before pushing every commit. ^^

codingjoe

Hi @drakon,

Thanks a lot. This looks like a good start. I left you a couple of comments, but this will require a couple of rounds to really nail the implementation. Don't be discouraged, though.

It would be great if you could rebase your branch, since I fixed the CI suite.

README.rst

s3file/storages.py

s3file/storages_optimized.py

drakon · 2021-04-12T13:27:44Z

Hi Joe, yeah, sure. Thanks for the comments! Will go through your comments and address them in the PR. Needed some time to setup everything in the first place so I haven't spent enough focus time on the actual changes. Will do!

drakon · 2021-09-06T15:47:17Z

Hi @codingjoe, took me a while as I was engaged with other parts/projects. But I finally improved the PR and commented on the changes I made above. Hope it's good now. :)

codingjoe

OK, LGTM, lets give this a go!

codingjoe · 2021-09-08T15:56:24Z

It's released in 5.4.0 🎉

drakon · 2021-09-08T16:16:37Z

Awesome, thanks! :)

codingjoe · 2021-09-13T07:17:45Z

@drakon, we tried switching to the new backend, but saw some errors. I believe you are retuning the wrong value. You should be returning name not cleand_name and you, please check? Some of our image processors complain that the key does not exist and you are using name as the copy key, not cleaned_name.

drakon · 2021-09-13T13:26:54Z

That strikes me as odd as this is a part that I didn't change and is consistent with the behaviour of the default implementation (current version):

def _save(self, name, content):
        cleaned_name = self._clean_name(name)
        name = self._normalize_name(cleaned_name)
        params = self._get_write_parameters(name, content)

        if (self.gzip and
                params['ContentType'] in self.gzip_content_types and
                'ContentEncoding' not in params):
            content = self._compress_content(content)
            params['ContentEncoding'] = 'gzip'

        obj = self.bucket.Object(name)
        content.seek(0, os.SEEK_SET)
        obj.upload_fileobj(content, ExtraArgs=params)
        return cleaned_name

Secondly the name here is actually the cleaned and normalized (see first two lines).

So, the only real different should be contained to this line:

# Copy the file instead uf uploading
obj.copy({"Bucket": self.bucket.name, "Key": content.obj.key}, ExtraArgs=params)

content.obj.key refers to the source key. And I'd expect this to be already be cleaned.

Can you provide more about the error you receive or the keys/name/paths used?

codingjoe · 2021-09-15T06:10:39Z

@amureki will give you an update about that. I also thought this might have been a race condition, but S3 should be consistent.

Maybe the errors we say were also related to another issues. @amureki has more details when he is back from vacation next week.

amureki · 2021-09-21T07:08:07Z

@drakon hey Christian! First, thank you for this PR, this is a nice improvement. 👍

I investigated a bit more here and found out that the errors (TypeError from this PR) are appearing when the upload is done with this backend, but triggered not from django-s3file (case you mentioned in your PR description here).

Now, I wonder if we can tweak this custom backend to handle such cases and check if content actually contains S3 file. Do you think that would defeat the original purpose and decrease the optimization?

drakon · 2021-09-21T14:46:54Z

Hi @amureki no worries. :)

Hmm. I see. Well the initial idea was that this storage only optimises uploads where you already know you have an S3 file. That's also why there's an error when that's not the case (assumption is that content.obj.key is an S3 key). So it only works in combination with this library (where we already know that the file is already uploaded with the client side). So it's applied only to the fields where you want the optimisation: models.FileField(storage=OptimizedStorage, ...). Others would behave as usual.

From what I understand from you is that you would want to do an upload from any other sources as well with this optimised storage?

It should certainly be doable but I was a bit hesitant (also with the comment from Joe) to silently fall back to an un-optimised case when the condition is not met for 2 reasons:

The condition for the upload might be also met for other cases. Or in logical terms: The condition used in the check is necessary for S3 files but not sufficient.
Because the rest of the implementation is a copy/paste of the original one. Keeping it up-to-date is maybe a bit harder so we should only use it when really necessary.

Having said that, it should be straight-forward to implement it as instead of throwing an error we could just use line 32 & 33. Unfortunately I don't even have a use-case that where I can test this in a real-world scenario. And I'd probably would recommend to change the check to something that is necessary but also sufficient to check if content is really an S3 file. Meaning it identifies all S3 files as such but would exclude all others. But in that case one could also directly add this to django-storages.

But to be clear: In that case it's just about convenience, there would be no optimization compared to the normal backend (it literally would be the same code).

amureki · 2021-09-23T13:59:00Z

@drakon thank you for digging in with us into this issue. :)

I agree, that it might harm to do more changes to this storage backend. And instead, we could just update certain fields with the custom backend variable (or introduce a custom field with the storage backend plugged in).

So, I think we are good here, thank you again for the help and for the initial contribution!

drakon · 2021-09-23T14:07:13Z

My pleasure. Thanks for providing this library in the first place. :)

Yes, I think that's probably the easiest way. If there is growing need and more people wanting this, one could think about trying to get this into django-storages directly as an optimization for such cases as in that case it'd be an efficiency gain for others too. But at this point I'm not even sure if it's a real issue outside of this library.

Add S3OptimizedUploadStorage

1546524

drakon added 2 commits April 8, 2021 21:18

Run black

fb2acc6

Adhere to pydocstyle

907a9fa

drakon added 4 commits April 9, 2021 11:15

Fix check for file type

10a1809

Add tests for S3OptimizedUploadStorage

eb6f7f6

Fix imports

7a14bbc

Add test for zip

39d9bbc

codingjoe reviewed Apr 12, 2021

View reviewed changes

README.rst Outdated Show resolved Hide resolved

s3file/storages.py Outdated Show resolved Hide resolved

s3file/storages_optimized.py Show resolved Hide resolved

s3file/storages_optimized.py Outdated Show resolved Hide resolved

codingjoe assigned codingjoe and drakon Apr 12, 2021

codingjoe mentioned this pull request Aug 12, 2021

Large file performance #166

Closed

drakon and others added 2 commits September 6, 2021 14:26

Merge branch 'codingjoe:master' into master

72bd345

Address issues from PR feedback

efe5a64

codingjoe approved these changes Sep 8, 2021

View reviewed changes

codingjoe merged commit 83d1d17 into codingjoe:master Sep 8, 2021

Uh oh!

Add S3OptimizedUploadStorage #128

Add S3OptimizedUploadStorage #128

Uh oh!

Conversation

drakon commented Apr 8, 2021

Uh oh!

codecov bot commented Apr 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

drakon commented Apr 8, 2021

Uh oh!

drakon commented Apr 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codingjoe commented Apr 9, 2021

Uh oh!

drakon commented Apr 9, 2021

Uh oh!

codingjoe left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

drakon commented Apr 12, 2021

Uh oh!

drakon commented Sep 6, 2021

Uh oh!

codingjoe left a comment

Choose a reason for hiding this comment

Uh oh!

codingjoe commented Sep 8, 2021

Uh oh!

drakon commented Sep 8, 2021

Uh oh!

codingjoe commented Sep 13, 2021

Uh oh!

drakon commented Sep 13, 2021

Uh oh!

codingjoe commented Sep 15, 2021

Uh oh!

amureki commented Sep 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drakon commented Sep 21, 2021

Uh oh!

amureki commented Sep 23, 2021

Uh oh!

drakon commented Sep 23, 2021

Uh oh!

Uh oh!

codecov bot commented Apr 8, 2021 •

edited

Loading

drakon commented Apr 9, 2021 •

edited

Loading

amureki commented Sep 21, 2021 •

edited

Loading