In-memory zero-copy S3 PutObject / GetObject with custom buffers in 2020 #1430

tilsche · 2020-08-05T17:25:10Z

Confirm by changing [ ] to [x] below:

I've gone though Developer Guide and API reference
I've searched for previous similar issues and didn't find any solution

Platform/OS/Hardware/Device
Linux x86-64

Describe the question

I have large memory buffers that I need to transfer to S3 and vice versa at optimal performance. I need to avoid redundant allocations and copies. How is this possible in a clean idiomatic way?

This question has been asked in various issues, e.g., #64, #533, #785. However, they are all closed with no clear documented solution and it has been suggested to open a new issue if needed. In general, the suggested approaches revolve around a custom stream implementation, which is unfortunately not trivial to get right and would preferably not done in user code again and again.

I am particularly referencing a comment from @JonathanHenson from 2018:

[...] in the very near future, we’ll be providing an interface that will make sure you don’t have to ever touch one again if you don’t want to. Basically we have a stream_buf implementation that will run your simpler interface implementation for you automatically.

I would hope that the very near future is now but unfortunately I was unable to find such an implementation - but I consider it entirely possible that it exists in the vastness of the Aws SDK.

Logs/output
N/A

KaibaLopez · 2020-08-05T21:06:51Z

Hi @tilsche ,
We've got a sample for s3's transfer manager that may be useful to you, let me know if that's not what you're looking for though.

tilsche · 2020-08-06T15:07:54Z

This does go in the right direction. Even though I don't think I want to use the Aws::Transfer::TransferManager::Create, the important point seems to be Aws::Utils::Stream::PreallocatedStreamBuf. Apparently it was changed recently to accept raw pointers to data instead of Aws::Utils::Array.

I am using the following for PutObject:

// In the example I the operation is competed within the scope, so I just put it on the stack
Aws::Utils::Stream::PreallocatedStreamBuf streambuf(reinterpret_cast<unsigned char*>(buffer), buffer_size);
auto preallocated_stream = Aws::MakeShared<Aws::IOStream>("", &streambuf);

Aws::S3::Model::PutObjectRequest request;
request.SetBody(preallocated_stream);

And GetObject:

Aws::Utils::Stream::PreallocatedStreamBuf streambuf(reinterpret_cast<unsigned char*>(buffer), buffer_size);
Aws::S3::Model::GetObjectRequest request;
request.SetResponseStreamFactory([&streambuf]() { return Aws::New<Aws::IOStream>("", &streambuf); });

Now that's still more boilerplate than I'm comfortable with. That seems fixable with another custom class:

class PreallocatedIOStream : public Aws::IOStream
{
public:
    PreallocatedIOStream(std::byte* memory, std::size_t size)
    : Aws::IOStream(new Aws::Utils::Stream::PreallocatedStreamBuf(
          reinterpret_cast<unsigned char*>(memory), size))
    {
    }

    ~PreallocatedIOStream()
    {
        delete rdbuf();
    }
};

Now we can just:

request.SetBody(Aws::MakeShared<PreallocatedIOStream>("", memory, size);

request.SetResponseStreamFactory([memory, size]() {
    return Aws::New<PreallocatedIOStream>("", memory, size);
});

For PutObject and GetObject respectively, without worrying about the lifetime of the PreallocatedStreamBuf since it is now managed by the IOStream. Of course we still manage the memory itself.

I'm not entirely happy with delete rdbuf(), but it seems to be correct and is concise. Of course this relies on the fact that Aws::IOStream behaves like std::iostream with respect to non-ownership of the provided streambuf.

Now coming back to the example I do not get the point of MyUnderlyingStream. What functionality does it add?

At the same time, it seems to me that the allocated PreallocatedStreamBuf leaks, does it not?

Would it not make sense to provide a PreallocatedIOStream to the user in the first place?

KaibaLopez · 2020-09-10T22:20:13Z

Hi @tilsche ,
Sorry for the late response, it seems like you've reached a conclusion on how to do it, and regarding your questions:

I do not get the point of MyUnderlyingStream. What functionality does it add?

The sample is meant not to be "the way" but to show case on how to approach that, MyUnderlyingStream has no real functionality other than to show that a custom buffer can be used. Your solution is more complete

PreallocatedStreamBuf leaks, does it not?
Hmm you may be correct here actually.

Would it not make sense to provide a PreallocatedIOStream to the user in the first place?

Maybe, but honestly if we take it as a feature request it would be very low on priority. If you have a specific idea on how to implement it please make a PR so we may review it.

This fixes a rare read corruption that may manifest with multiple different error messages. One example: `[TileDB::Buffer] Error: Read failed; Trying to read beyond buffer size` The issue is that the `Aws::IOStream` constructor that we use within the callback set in `GetObjectRequest::SetResponseStreamFactory()` accepts a pointer to a `boost::interprocess::bufferbuf`. We currently allocate one of these `bufferbuf` instances on the heap and free it when we return from `S3::read`. Experimentally, I have determined that this can cause a corruption to the buffer we stream into. We are not responsible for what the AWS SDK does with objects created from the callback we assigned in `GetObjectRequest::SetResponseStreamFactory()`. This patch introduces a small wrapper so that the `Aws::IOStream` manages the lifetime of the `boost::interprocess::bufferbuf`. I also noticed that we use an SDK version with a `Aws::Utils::Stream::PreallocatedStreamBuf` that we can use instead of the boost `bufferbuf`. Relevant discussion on stack overflow: aws/aws-sdk-cpp#1430 --- TYPE: BUG DESC: Fix rare read corruption in S3

This fixes a rare read corruption that may manifest with multiple different error messages. One example: `[TileDB::Buffer] Error: Read failed; Trying to read beyond buffer size` The issue is that the `Aws::IOStream` constructor that we use within the callback set in `GetObjectRequest::SetResponseStreamFactory()` accepts a pointer to a `boost::interprocess::bufferbuf`. We currently allocate one of these `bufferbuf` instances on the heap and free it when we return from `S3::read`. Experimentally, I have determined that this can cause a corruption to the buffer we stream into. We are not responsible for what the AWS SDK does with objects created from the callback we assigned in `GetObjectRequest::SetResponseStreamFactory()`. This patch introduces a small wrapper so that the `Aws::IOStream` manages the lifetime of the `boost::interprocess::bufferbuf`. I also noticed that we use an SDK version with a `Aws::Utils::Stream::PreallocatedStreamBuf` that we can use instead of the boost `bufferbuf`. Relevant discussion on stack overflow: aws/aws-sdk-cpp#1430 --- TYPE: BUG DESC: Fix rare read corruption in S3 Co-authored-by: Joe Maley <[email protected]>

This fixes a rare read corruption that may manifest with multiple different error messages. One example: `[TileDB::Buffer] Error: Read failed; Trying to read beyond buffer size` The issue is that the `Aws::IOStream` constructor that we use within the callback set in `GetObjectRequest::SetResponseStreamFactory()` accepts a pointer to a `boost::interprocess::bufferbuf`. We currently allocate one of these `bufferbuf` instances on the heap and free it when we return from `S3::read`. Experimentally, I have determined that this can cause a corruption to the buffer we stream into. We are not responsible for what the AWS SDK does with objects created from the callback we assigned in `GetObjectRequest::SetResponseStreamFactory()`. This patch introduces a small wrapper so that the `Aws::IOStream` manages the lifetime of the `boost::interprocess::bufferbuf`. I also noticed that we use an SDK version with a `Aws::Utils::Stream::PreallocatedStreamBuf` that we can use instead of the boost `bufferbuf`. Relevant discussion on stack overflow: aws/aws-sdk-cpp#1430 --- TYPE: BUG DESC: Fix rare read corruption in S3

This fixes a rare read corruption that may manifest with multiple different error messages. One example: `[TileDB::Buffer] Error: Read failed; Trying to read beyond buffer size` The issue is that the `Aws::IOStream` constructor that we use within the callback set in `GetObjectRequest::SetResponseStreamFactory()` accepts a pointer to a `boost::interprocess::bufferbuf`. We currently allocate one of these `bufferbuf` instances on the heap and free it when we return from `S3::read`. Experimentally, I have determined that this can cause a corruption to the buffer we stream into. We are not responsible for what the AWS SDK does with objects created from the callback we assigned in `GetObjectRequest::SetResponseStreamFactory()`. This patch introduces a small wrapper so that the `Aws::IOStream` manages the lifetime of the `boost::interprocess::bufferbuf`. I also noticed that we use an SDK version with a `Aws::Utils::Stream::PreallocatedStreamBuf` that we can use instead of the boost `bufferbuf`. Relevant discussion on stack overflow: aws/aws-sdk-cpp#1430 --- TYPE: BUG DESC: Fix rare read corruption in S3 Co-authored-by: Joe Maley <[email protected]>

tilsche added guidance Question that needs advice or information. needs-triage This issue or PR still needs to be triaged. labels Aug 5, 2020

KaibaLopez self-assigned this Aug 5, 2020

KaibaLopez added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. and removed needs-triage This issue or PR still needs to be triaged. labels Aug 5, 2020

github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Aug 7, 2020

KaibaLopez added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Sep 10, 2020

github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Sep 15, 2020

github-actions bot closed this as completed Sep 15, 2020

joe-maley mentioned this issue May 7, 2021

Fix read corruption in S3 TileDB-Inc/TileDB#2253

Merged

joe-maley mentioned this issue May 7, 2021

[Backport] Fix read corruption in S3 TileDB-Inc/TileDB#2254

Merged

wu-hanqing mentioned this issue Dec 15, 2021

curvefs/client: redundant memory copy when get/put object from/into object storage service opencurve/curve#849

Closed

nalinigans mentioned this issue Sep 20, 2023

Fix memory leaks as first part of performance improvements datma-health/TileDB#164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

In-memory zero-copy S3 PutObject / GetObject with custom buffers in 2020 #1430

In-memory zero-copy S3 PutObject / GetObject with custom buffers in 2020 #1430

tilsche commented Aug 5, 2020

KaibaLopez commented Aug 5, 2020

Uh oh!

tilsche commented Aug 6, 2020 •

edited

Loading

Uh oh!

KaibaLopez commented Sep 10, 2020

Uh oh!

In-memory zero-copy S3 PutObject / GetObject with custom buffers in 2020 #1430

In-memory zero-copy S3 PutObject / GetObject with custom buffers in 2020 #1430

Comments

tilsche commented Aug 5, 2020

KaibaLopez commented Aug 5, 2020

Uh oh!

tilsche commented Aug 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KaibaLopez commented Sep 10, 2020

Uh oh!

tilsche commented Aug 6, 2020 •

edited

Loading