Skip to content
This repository was archived by the owner on Feb 10, 2021. It is now read-only.
This repository was archived by the owner on Feb 10, 2021. It is now read-only.

BYOB ("Bring Your Own Buffer") read interface #160

Closed
@sk1p

Description

@sk1p

Currently, HDFile.read(...) involves both allocation and copying of buffers. When reading locally, with read short circuiting enabled, this can become a bottleneck. Here is HDFile.read annotated with the allocations and copies:

    def read(self, length=None):
        """ Read bytes from open file """
        if not _lib.hdfsFileIsOpenForRead(self._handle):
            raise IOError('File not read mode')
        buffers = []
        buffer_size = self.buff if self.buff != 0 else DEFAULT_READ_BUFFER_SIZE

        if length is None:
            out = 1
            while out:
                out = self.read(buffer_size)
                buffers.append(out)
        else:
            while length:
                bufsize = min(buffer_size, length)
                p = ctypes.create_string_buffer(bufsize)  # <-- allocation for each slice that is read
                ret = _lib.hdfsRead(
                    self._fs, self._handle, p, ctypes.c_int32(bufsize))
                if ret == 0:
                    break
                if ret > 0:
                    if ret < bufsize:
                        buffers.append(p.raw[:ret])   # <-- .raw creates a copy
                    elif ret == bufsize:
                        buffers.append(p.raw)  # <-- .raw again
                    length -= ret
                else:
                    raise IOError('Read file %s Failed:' % self.path, -ret)

        return b''.join(buffers)  # <-- this of course has to create a copy again

I suggest adding the possibility to specify the output buffer, without doing any additional copying/buffering. Here is a prototype implementation that, in one of my tests, speeds up reading large binary data by a factor of about 4:

        def byob_read(self, length, out):
            """
            Read ``length`` bytes from the file into the ``out`` buffer

            ``out`` needs to be a ctypes array, for example created
            with ``ctypes.create_string_buffer``, and must be at least ``length`` bytes long.
            """
            _lib = hdfs3.core._lib
            if not _lib.hdfsFileIsOpenForRead(self._handle):
                raise IOError('File not read mode')
            bufsize = length
            bufpos = 0

            while length:
                bufp = ctypes.byref(out, bufpos)
                ret = _lib.hdfsRead(
                    self._fs, self._handle, bufp, ctypes.c_int32(bufsize - bufpos))
                if ret == 0:  # EOF
                    break
                if ret > 0:
                    length -= ret
                    bufpos += ret
                else:
                    raise IOError('Read file %s Failed:' % self.path, -ret)
            return out

The final interface should probably be something like read(self, length=None, out=None) to support both modes of operation and not pollute the API namespace, and there should be some range checks to prevent overflows.

Thoughts?

By the way, there are still copies happening inside of libhdfs3 which, when patched out/worked around, give another nice speedup (in short: setting input.localread.default.buffersize=1 and patching out checksumming → reads go directly into the buffer given by the user). Where is the libhdfs3 development happening these days? Is ContinuumIO/libhdfs3-downstream the right place to work on this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions