Skip to content

Multiple buffer writes to a single s3 object #1351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kkbachu opened this issue Mar 30, 2020 · 20 comments
Closed

Multiple buffer writes to a single s3 object #1351

kkbachu opened this issue Mar 30, 2020 · 20 comments
Assignees
Labels
guidance Question that needs advice or information.

Comments

@kkbachu
Copy link

kkbachu commented Mar 30, 2020

Confirm by changing [ ] to [x] below:

Platform/OS/Hardware/Device
Ubuntu 18.04

Describe the question
My application reads the file 1 block of size 1mb at a time from a proprietary filesystem and I want to the push one buffer at a time to a single s3 object. I couldn't find any documentation around this. Looked at #64 that is close to what I am looking for but it still refers to 1 custom buffer to put object request.

Is there a way to do this with aws sdk cpp for s3?

putobject request and tansfermanager refers to passing a file or a single buffer. But I couldn't find an ability to write buffer by buffer to a single object in a loop or something.

Any help is appreciated.

Logs/output
If applicable, add logs or error output.

To enable logging, set the following system properties:

REMEMBER TO SANITIZE YOUR PERSONAL INFO

options.loggingOptions.logLevel = Aws::Utils::Logging::LogLevel::Trace;
Aws::InitAPI(options)
@kkbachu kkbachu added guidance Question that needs advice or information. needs-triage This issue or PR still needs to be triaged. labels Mar 30, 2020
@kkbachu
Copy link
Author

kkbachu commented Mar 31, 2020

Can anyone provide an idea as to how to move forward? Appreciate it. Thanks.

@webbnh
Copy link

webbnh commented Mar 31, 2020

In my code, I use the Transfer Manager (Aws::Transfer::TransferManager) to do uploads and downloads. For downloads, I supply my own stream implementation, so that I can process the buffers as soon as they arrive. When I implemented that, I also experimented with doing something similar for the upload, but I ended up not pursuing it. However, for your case, it might be just what you want.

If you implement your own stream (Aws::IOStream), it can read your filesystem in whatever way you want and supply the results to the Transfer Manager. I can't promise that the TM will abide by your block-size choices (it will quite likely make multiple requests to your stream in order to fill its own buffer), but at least the filesystem reads will be under your control. (There might be TM configuration/tuning parameters which allow you to control its buffering...I didn't have a need to pursue that.)

@kkbachu
Copy link
Author

kkbachu commented Mar 31, 2020

Thanks @webbnh.
My own Aws::IOStream seems like a good idea. Will try it out and post any further questions here.

@KaibaLopez
Copy link
Contributor

@kkbachu ,
@webbnh is correct in that you can make your own buffer, use your own file system and tell transfer manager to use those.
But it sounds to me like you can just use the transfer manager upload, you can specify the buffer size and transfer manager would take care of dividing the object in parts and upload them either one by one or asynchronously.
Let me know if I'm missing something or why the transfer manager upload would not work for you, sample code for what you're trying to do would also be great.

@KaibaLopez KaibaLopez self-assigned this Mar 31, 2020
@KaibaLopez KaibaLopez removed the needs-triage This issue or PR still needs to be triaged. label Mar 31, 2020
@kkbachu
Copy link
Author

kkbachu commented Mar 31, 2020

@KaibaLopez ,
Regarding uploads --
TransferManager -- it reads from the local file(to be uploaded) with Aws::IOStream(cpp iostreams), uses its own buffers and uploads to S3.
What I am looking for is - Replace Aws::IOStream with MyClass::MyIOStream so that I can read from a proprietary storage system. This allows me to leverage TransferManager's functionality. How can I do this?
Regarding downloads --
Looks like there is a way have CustomCallback that can receive a buffer on downloading each part, this will enable me to write back to proprietary storage system. I haven't tried this yet since I am still figuring out how to upload with TransferManager but not use Aws::IOStream.

@kkbachu
Copy link
Author

kkbachu commented Mar 31, 2020

I tried MyIOStream class a derivative of Aws::IOStream, but it doesn't go far. How can I substitute Aws::IOStream with MyIOStream?

[ERROR] 2020-03-31 22:24:47.016 TransferManager [140637072394112] Failed to read from input stream to upload file to bucket: xxx with key:

Looks like I need to have my own version of TransferManager where I can read/write proprietary filesystem instead of regular fstream calls.

@webbnh
Copy link

webbnh commented Apr 1, 2020

In my code, I created a class derived from std::iostream, but my class doesn't do anything other than allow me to specify the std::streambuf that the stream uses. Then I created another class derived from std::streambuf, and that's where all the action is: I implemented overrides for the underflow(), xsgetn(), overflow(), xsputn(), showmanyc(), seekpos(), seekoff(), and sync() methods. Your concerns may or may not extend that far.

You should be able to write your own version of the TransferManager -- I don't think it does anything that you couldn't do -- but I decided not to re-invent that particular wheel...it was more effective for me to use it as is and hook in at the bottom of the stream implementation.

If you follow my path, I have two caveats for you. First, be aware that the TransferManager may use multiple threads, so your implementation has to take the appropriate precautions to handle or guard against concurrent reentry. Second, when you get to the download side, be aware that the downloaded blocks do not necessarily arrive in order (because of concurrent I/O requests and vagaries of network flow), so, if your filesystem requires that the file be written linearly, then your implementation should be prepared to buffer the blocks as they are received. (In my case, I needed to process the file linearly, so the callbacks weren't sufficient for me; also, IIRC, the callbacks didn't give me access to the data -- just to status -- so, I needed hooks into the stream in order to be able to process the data before the download was complete.)

Good luck!

@kkbachu
Copy link
Author

kkbachu commented Apr 1, 2020

Thanks @webbnh for sharing your insights.
In my case, its more to do filesystem reading and writing. Wish TransferManager had ability to pass custom iostream object so that it can read/write to it(helps avoid re-inventing the wheel as you said). streambuf comes later.

This is what I am doing as an experiment with put_object_async.cpp example code -
`
MyIOStream myfileStream; // proprietary stream reader
unsigned char* buffer = new unsigned char[length];
myfileStream.read((char*)buffer, length);

  // code is from transfermanager - begin
  static const char CLASS_TAG[] = "EventHeader";
  auto streamBuf = Aws::New<Aws::Utils::Stream::PreallocatedStreamBuf>(CLASS_TAG, buffer,
                  static_cast<size_t>(length));
  auto preallocatedStreamReader = Aws::MakeShared<Aws::IOStream>(CLASS_TAG, streamBuf);
  object_request.SetBody(preallocatedStreamReader);
  object_request.SetContentMD5(Aws::Utils::HashingUtils::Base64Encode(
                          Aws::Utils::HashingUtils::CalculateMD5(*preallocatedStreamReader)));
   // code is from transfermanager - end

  // Set up AsyncCallerContext. Pass the S3 object name to the callback.
  auto context =
      Aws::MakeShared<Aws::Client::AsyncCallerContext>("PutObjectAllocationTag");
  context->SetUUID(s3_object_name);

  // Put the object asynchronously
  s3_client->PutObjectAsync(object_request,
                           put_object_async_finished,
                           context);

`
download and multipart upload will probably be lot more involved.

@webbnh
Copy link

webbnh commented Apr 2, 2020

Wish TransferManager had ability to pass custom iostream object

Does this grant your wish? (The second UploadFile() method.)

@kkbachu
Copy link
Author

kkbachu commented Apr 2, 2020

It doesn't look like it, I did look at the second UploadFile() method that takes Aws::IOStream as input but it is basically regular cpp iostream under the hood(its typedef'ed).
Then, UploadFile() performs following operations -
fileStream->seekg(0, std::ios_base::end);
streamToPut->read((char*)buffer, lengthToWrite);

@webbnh
Copy link

webbnh commented Apr 2, 2020

Right, so if you pass it an instance of a class derived from std::iostream which uses your custom std::streambuf, then when the TransferManager calls seekg() or read() on the stream it will end up in your custom methods, and then you can do with it as you like.

@kkbachu
Copy link
Author

kkbachu commented Apr 2, 2020

Thanks @webbnh for staying with me to help out.

I tried a class derived from Aws::IOStream but it doesn't get to my methods. May be I am doing something wrong.
`/**

  • In memory stream implementation
    */
    class MyUnderlyingStream : public Aws::IOStream
    {
    public:
    using Base = Aws::IOStream;
    MyUnderlyingStream() = default;

     // provide a customer controlled streambuf, so as to put all transfered data into   this in memory buffer.
     MyUnderlyingStream(std::streambuf* buf) : Base(buf)
     {}
    
     // cpp interfaces that need to be supported to make Transfer Manager happy
     bool good() const {
         std::cout << "My good" << std::endl;
         return true;
     }
    

    basic_istream<char, char_traits>& seekg( pos_type pos ) {
    std::cout << "My seek" << std::endl;
    return *this;
    }
    basic_istream<char, char_traits>& seekg( off_type off, std::ios_base::seekdi r dir) {
    std::cout << "My seek" << std::endl;
    return *this;
    }

     virtual ~MyUnderlyingStream() = default;
    

};`
[ERROR] 2020-04-02 17:47:43.913 TransferManager [140227555768192] Failed to read from input stream to upload file to bucket

@webbnh
Copy link

webbnh commented Apr 2, 2020

I'm guessing that the problem is that Aws::IOStream::seekg() (et al.) are not virtual member functions. So, in your derived class, your definitions for them replace them without overriding them...and, so, when you pass your class instance as though it were an Aws::IOStream instance, it's the methods in the parent class which get called.

I'm guessing that that is why I hooked into the std::streambuf class instead of the std::iostream class: std::streambuf::seekpos() is virtual, so the definition in the derived class instance will override the parent class definition.

When you define your member functions, you should use the override keyword in the declaration/definition -- this tells the compiler that you intend the function to override the one in the parent class, and, the compiler will warn you at build-time if it's not going to work.

@kkbachu
Copy link
Author

kkbachu commented Apr 3, 2020

When you define your member functions, you should use the override keyword in the declaration/definition -- this tells the compiler that you intend the function to override the one in the parent class, and, the compiler will warn you at build-time if it's not going to work.

Compiler complains that marked 'override', but does not override
Looks like iostream member functions like read, seekg, tellg, good, etc are not virtual, so I cannot override them.

@kkbachu
Copy link
Author

kkbachu commented Apr 9, 2020

I tried a single put object with in-memory buffer as below. But when its uploaded to S3 and I manually download the file from S3 console to cross check validity of the content(simple text file), it does not contain new line feeds '\n' or '\r'.

Pretty much implemented some parts of TransferManager.cpp for SinglePartUpload(), except that source file read is proprietary filesystem.

static const char CLASS_TAG[] = "EventHeader"; auto streamBuf = Aws::New<Aws::Utils::Stream::PreallocatedStreamBuf>(CLASS_TAG, buffer, static_cast<size_t>(bytesRead)); auto preallocatedStreamReader = Aws::MakeShared<Aws::IOStream>(CLASS_TAG, streamBuf); object_request.SetBody(preallocatedStreamReader); object_request.SetContentType("text/plain"); // tried "binary/octet-stream"

Hexdump of the buffer that I passed to streambuf.
data length: 21
00000000 61 61 61 61 61 61 0a 62 62 62 62 62 62 0a 63 63 |aaaaaa.bbbbbb.cc|
00000010 63 63 63 63 0a |cccc. |

'0a' is a line feed.

What am I missing?

@webbnh
Copy link

webbnh commented Apr 9, 2020

Is there a parameter which specifies whether the stream is binary or text? (Text streams are typically read line by line and have their line terminators removed.)

@kkbachu
Copy link
Author

kkbachu commented Apr 9, 2020

Since I am using streambuf and passing it to iostream, can't find anything specific to this issue. Need help.

@kkbachu
Copy link
Author

kkbachu commented Apr 9, 2020

Sorry, false alarm. When I was trying to open the file directly from the browser, default editor strips new lines but if I open the text file with text editor like wordpad, it has new lines.

Different question - curious if SetContentType() is important or not.

@kkbachu
Copy link
Author

kkbachu commented Apr 9, 2020

Although, its working for me to upload/download using application buffers instead of read/write to filestream, but I have to do pretty much what TransferManager does. TransferManager has a lot of goodness that I would love to leverage instead of reinventing the wheel. May be we should add a feature request to have transfermanager support filestream like callbacks?

@webbnh
Copy link

webbnh commented Apr 9, 2020

Different question - curious if SetContentType() is important or not.

This is passing out of my areas of expertise, but I think the answer is summarized as, "its only important if something looks at it". That is, I believe that if you access the contents directly (e.g., using the SDK, here) you won't know the difference unless you ask. However, if you use a RESTful interface (e.g., a web browser) to fetch it, it will try to format and present the contents, and it will make choices based on what you set the type to.

May be we should add a feature request to have transfermanager support filestream like callbacks?

That would have been great a one point...but that's all water over the dam for me now. :-)

@kkbachu kkbachu closed this as completed May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
guidance Question that needs advice or information.
Projects
None yet
Development

No branches or pull requests

3 participants