BYOB ("Bring Your Own Buffer") read interface

Currently, ``HDFile.read(...)`` involves both allocation and copying of buffers. When reading locally, with read short circuiting enabled, this can become a bottleneck. Here is ``HDFile.read`` annotated with the allocations and copies:

```python
    def read(self, length=None):
        """ Read bytes from open file """
        if not _lib.hdfsFileIsOpenForRead(self._handle):
            raise IOError('File not read mode')
        buffers = []
        buffer_size = self.buff if self.buff != 0 else DEFAULT_READ_BUFFER_SIZE

        if length is None:
            out = 1
            while out:
                out = self.read(buffer_size)
                buffers.append(out)
        else:
            while length:
                bufsize = min(buffer_size, length)
                p = ctypes.create_string_buffer(bufsize)  # <-- allocation for each slice that is read
                ret = _lib.hdfsRead(
                    self._fs, self._handle, p, ctypes.c_int32(bufsize))
                if ret == 0:
                    break
                if ret > 0:
                    if ret < bufsize:
                        buffers.append(p.raw[:ret])   # <-- .raw creates a copy
                    elif ret == bufsize:
                        buffers.append(p.raw)  # <-- .raw again
                    length -= ret
                else:
                    raise IOError('Read file %s Failed:' % self.path, -ret)

        return b''.join(buffers)  # <-- this of course has to create a copy again
```
I suggest adding the possibility to specify the output buffer, without doing any additional copying/buffering. Here is a prototype implementation that, in one of my tests, speeds up reading large binary data by a factor of about 4:

```python
        def byob_read(self, length, out):
            """
            Read ``length`` bytes from the file into the ``out`` buffer

            ``out`` needs to be a ctypes array, for example created
            with ``ctypes.create_string_buffer``, and must be at least ``length`` bytes long.
            """
            _lib = hdfs3.core._lib
            if not _lib.hdfsFileIsOpenForRead(self._handle):
                raise IOError('File not read mode')
            bufsize = length
            bufpos = 0

            while length:
                bufp = ctypes.byref(out, bufpos)
                ret = _lib.hdfsRead(
                    self._fs, self._handle, bufp, ctypes.c_int32(bufsize - bufpos))
                if ret == 0:  # EOF
                    break
                if ret > 0:
                    length -= ret
                    bufpos += ret
                else:
                    raise IOError('Read file %s Failed:' % self.path, -ret)
            return out
```

The final interface should probably be something like ``read(self, length=None, out=None)`` to support both modes of operation and not pollute the API namespace, and there should be some range checks to prevent overflows.

Thoughts?

By the way, there are still copies happening inside of ``libhdfs3`` which, when patched out/worked around, give another nice speedup (in short: setting ``input.localread.default.buffersize=1`` and patching out checksumming → reads go directly into the buffer given by the user). Where is the ``libhdfs3`` development happening these days? Is [ContinuumIO/libhdfs3-downstream](https://github.com/ContinuumIO/libhdfs3-downstream) the right place to work on this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BYOB ("Bring Your Own Buffer") read interface #160

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

BYOB ("Bring Your Own Buffer") read interface #160

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions