Skip to content

In-memory zero-copy S3 PutObject / GetObject with custom buffers in 2020 #1430

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
tilsche opened this issue Aug 5, 2020 · 3 comments
Closed
2 tasks done
Assignees
Labels
closed-for-staleness guidance Question that needs advice or information.

Comments

@tilsche
Copy link

tilsche commented Aug 5, 2020

Confirm by changing [ ] to [x] below:

Platform/OS/Hardware/Device
Linux x86-64

Describe the question

I have large memory buffers that I need to transfer to S3 and vice versa at optimal performance. I need to avoid redundant allocations and copies. How is this possible in a clean idiomatic way?

This question has been asked in various issues, e.g., #64, #533, #785. However, they are all closed with no clear documented solution and it has been suggested to open a new issue if needed. In general, the suggested approaches revolve around a custom stream implementation, which is unfortunately not trivial to get right and would preferably not done in user code again and again.

I am particularly referencing a comment from @JonathanHenson from 2018:

[...] in the very near future, we’ll be providing an interface that will make sure you don’t have to ever touch one again if you don’t want to. Basically we have a stream_buf implementation that will run your simpler interface implementation for you automatically.

I would hope that the very near future is now but unfortunately I was unable to find such an implementation - but I consider it entirely possible that it exists in the vastness of the Aws SDK.

Logs/output
N/A

@tilsche tilsche added guidance Question that needs advice or information. needs-triage This issue or PR still needs to be triaged. labels Aug 5, 2020
@KaibaLopez
Copy link
Contributor

Hi @tilsche ,
We've got a sample for s3's transfer manager that may be useful to you, let me know if that's not what you're looking for though.

@KaibaLopez KaibaLopez self-assigned this Aug 5, 2020
@KaibaLopez KaibaLopez added response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. and removed needs-triage This issue or PR still needs to be triaged. labels Aug 5, 2020
@tilsche
Copy link
Author

tilsche commented Aug 6, 2020

This does go in the right direction. Even though I don't think I want to use the Aws::Transfer::TransferManager::Create, the important point seems to be Aws::Utils::Stream::PreallocatedStreamBuf. Apparently it was changed recently to accept raw pointers to data instead of Aws::Utils::Array.

I am using the following for PutObject:

// In the example I the operation is competed within the scope, so I just put it on the stack
Aws::Utils::Stream::PreallocatedStreamBuf streambuf(reinterpret_cast<unsigned char*>(buffer), buffer_size);
auto preallocated_stream = Aws::MakeShared<Aws::IOStream>("", &streambuf);

Aws::S3::Model::PutObjectRequest request;
request.SetBody(preallocated_stream);

And GetObject:

Aws::Utils::Stream::PreallocatedStreamBuf streambuf(reinterpret_cast<unsigned char*>(buffer), buffer_size);
Aws::S3::Model::GetObjectRequest request;
request.SetResponseStreamFactory([&streambuf]() { return Aws::New<Aws::IOStream>("", &streambuf); });

Now that's still more boilerplate than I'm comfortable with. That seems fixable with another custom class:

class PreallocatedIOStream : public Aws::IOStream
{
public:
    PreallocatedIOStream(std::byte* memory, std::size_t size)
    : Aws::IOStream(new Aws::Utils::Stream::PreallocatedStreamBuf(
          reinterpret_cast<unsigned char*>(memory), size))
    {
    }

    ~PreallocatedIOStream()
    {
        delete rdbuf();
    }
};

Now we can just:

request.SetBody(Aws::MakeShared<PreallocatedIOStream>("", memory, size);

request.SetResponseStreamFactory([memory, size]() {
    return Aws::New<PreallocatedIOStream>("", memory, size);
});

For PutObject and GetObject respectively, without worrying about the lifetime of the PreallocatedStreamBuf since it is now managed by the IOStream. Of course we still manage the memory itself.

I'm not entirely happy with delete rdbuf(), but it seems to be correct and is concise. Of course this relies on the fact that Aws::IOStream behaves like std::iostream with respect to non-ownership of the provided streambuf.

Now coming back to the example I do not get the point of MyUnderlyingStream. What functionality does it add?

At the same time, it seems to me that the allocated PreallocatedStreamBuf leaks, does it not?

Would it not make sense to provide a PreallocatedIOStream to the user in the first place?

@github-actions github-actions bot removed the response-requested Waiting on additional info and feedback. Will move to "closing-soon" in 10 days. label Aug 7, 2020
@KaibaLopez
Copy link
Contributor

Hi @tilsche ,
Sorry for the late response, it seems like you've reached a conclusion on how to do it, and regarding your questions:

I do not get the point of MyUnderlyingStream. What functionality does it add?

The sample is meant not to be "the way" but to show case on how to approach that, MyUnderlyingStream has no real functionality other than to show that a custom buffer can be used. Your solution is more complete

PreallocatedStreamBuf leaks, does it not?
Hmm you may be correct here actually.

Would it not make sense to provide a PreallocatedIOStream to the user in the first place?

Maybe, but honestly if we take it as a feature request it would be very low on priority. If you have a specific idea on how to implement it please make a PR so we may review it.

@KaibaLopez KaibaLopez added the closing-soon This issue will automatically close in 4 days unless further comments are made. label Sep 10, 2020
@github-actions github-actions bot added closed-for-staleness and removed closing-soon This issue will automatically close in 4 days unless further comments are made. labels Sep 15, 2020
joe-maley pushed a commit to TileDB-Inc/TileDB that referenced this issue May 7, 2021
This fixes a rare read corruption that may manifest with multiple different error
messages. One example:
`[TileDB::Buffer] Error: Read failed; Trying to read beyond buffer size`

The issue is that the `Aws::IOStream` constructor that we use within the callback
set in `GetObjectRequest::SetResponseStreamFactory()` accepts a pointer to a
`boost::interprocess::bufferbuf`. We currently allocate one of these `bufferbuf`
instances on the heap and free it when we return from `S3::read`. Experimentally,
I have determined that this can cause a corruption to the buffer we stream into.

We are not responsible for what the AWS SDK does with objects created from the
callback we assigned in `GetObjectRequest::SetResponseStreamFactory()`. This
patch introduces a small wrapper so that the `Aws::IOStream` manages the lifetime
of the `boost::interprocess::bufferbuf`.

I also noticed that we use an SDK version with a `Aws::Utils::Stream::PreallocatedStreamBuf`
that we can use instead of the boost `bufferbuf`.

Relevant discussion on stack overflow:
aws/aws-sdk-cpp#1430

---

TYPE: BUG
DESC: Fix rare read corruption in S3
joe-maley pushed a commit to TileDB-Inc/TileDB that referenced this issue May 7, 2021
This fixes a rare read corruption that may manifest with multiple different error
messages. One example:
`[TileDB::Buffer] Error: Read failed; Trying to read beyond buffer size`

The issue is that the `Aws::IOStream` constructor that we use within the callback
set in `GetObjectRequest::SetResponseStreamFactory()` accepts a pointer to a
`boost::interprocess::bufferbuf`. We currently allocate one of these `bufferbuf`
instances on the heap and free it when we return from `S3::read`. Experimentally,
I have determined that this can cause a corruption to the buffer we stream into.

We are not responsible for what the AWS SDK does with objects created from the
callback we assigned in `GetObjectRequest::SetResponseStreamFactory()`. This
patch introduces a small wrapper so that the `Aws::IOStream` manages the lifetime
of the `boost::interprocess::bufferbuf`.

I also noticed that we use an SDK version with a `Aws::Utils::Stream::PreallocatedStreamBuf`
that we can use instead of the boost `bufferbuf`.

Relevant discussion on stack overflow:
aws/aws-sdk-cpp#1430

---

TYPE: BUG
DESC: Fix rare read corruption in S3

Co-authored-by: Joe Maley <[email protected]>
joe-maley pushed a commit to TileDB-Inc/TileDB that referenced this issue May 7, 2021
This fixes a rare read corruption that may manifest with multiple different error
messages. One example:
`[TileDB::Buffer] Error: Read failed; Trying to read beyond buffer size`

The issue is that the `Aws::IOStream` constructor that we use within the callback
set in `GetObjectRequest::SetResponseStreamFactory()` accepts a pointer to a
`boost::interprocess::bufferbuf`. We currently allocate one of these `bufferbuf`
instances on the heap and free it when we return from `S3::read`. Experimentally,
I have determined that this can cause a corruption to the buffer we stream into.

We are not responsible for what the AWS SDK does with objects created from the
callback we assigned in `GetObjectRequest::SetResponseStreamFactory()`. This
patch introduces a small wrapper so that the `Aws::IOStream` manages the lifetime
of the `boost::interprocess::bufferbuf`.

I also noticed that we use an SDK version with a `Aws::Utils::Stream::PreallocatedStreamBuf`
that we can use instead of the boost `bufferbuf`.

Relevant discussion on stack overflow:
aws/aws-sdk-cpp#1430

---

TYPE: BUG
DESC: Fix rare read corruption in S3
joe-maley pushed a commit to TileDB-Inc/TileDB that referenced this issue May 7, 2021
This fixes a rare read corruption that may manifest with multiple different error
messages. One example:
`[TileDB::Buffer] Error: Read failed; Trying to read beyond buffer size`

The issue is that the `Aws::IOStream` constructor that we use within the callback
set in `GetObjectRequest::SetResponseStreamFactory()` accepts a pointer to a
`boost::interprocess::bufferbuf`. We currently allocate one of these `bufferbuf`
instances on the heap and free it when we return from `S3::read`. Experimentally,
I have determined that this can cause a corruption to the buffer we stream into.

We are not responsible for what the AWS SDK does with objects created from the
callback we assigned in `GetObjectRequest::SetResponseStreamFactory()`. This
patch introduces a small wrapper so that the `Aws::IOStream` manages the lifetime
of the `boost::interprocess::bufferbuf`.

I also noticed that we use an SDK version with a `Aws::Utils::Stream::PreallocatedStreamBuf`
that we can use instead of the boost `bufferbuf`.

Relevant discussion on stack overflow:
aws/aws-sdk-cpp#1430

---

TYPE: BUG
DESC: Fix rare read corruption in S3

Co-authored-by: Joe Maley <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-for-staleness guidance Question that needs advice or information.
Projects
None yet
Development

No branches or pull requests

2 participants