BYOB ("Bring Your Own Buffer") read interface #160
Description
Currently, HDFile.read(...)
involves both allocation and copying of buffers. When reading locally, with read short circuiting enabled, this can become a bottleneck. Here is HDFile.read
annotated with the allocations and copies:
def read(self, length=None):
""" Read bytes from open file """
if not _lib.hdfsFileIsOpenForRead(self._handle):
raise IOError('File not read mode')
buffers = []
buffer_size = self.buff if self.buff != 0 else DEFAULT_READ_BUFFER_SIZE
if length is None:
out = 1
while out:
out = self.read(buffer_size)
buffers.append(out)
else:
while length:
bufsize = min(buffer_size, length)
p = ctypes.create_string_buffer(bufsize) # <-- allocation for each slice that is read
ret = _lib.hdfsRead(
self._fs, self._handle, p, ctypes.c_int32(bufsize))
if ret == 0:
break
if ret > 0:
if ret < bufsize:
buffers.append(p.raw[:ret]) # <-- .raw creates a copy
elif ret == bufsize:
buffers.append(p.raw) # <-- .raw again
length -= ret
else:
raise IOError('Read file %s Failed:' % self.path, -ret)
return b''.join(buffers) # <-- this of course has to create a copy again
I suggest adding the possibility to specify the output buffer, without doing any additional copying/buffering. Here is a prototype implementation that, in one of my tests, speeds up reading large binary data by a factor of about 4:
def byob_read(self, length, out):
"""
Read ``length`` bytes from the file into the ``out`` buffer
``out`` needs to be a ctypes array, for example created
with ``ctypes.create_string_buffer``, and must be at least ``length`` bytes long.
"""
_lib = hdfs3.core._lib
if not _lib.hdfsFileIsOpenForRead(self._handle):
raise IOError('File not read mode')
bufsize = length
bufpos = 0
while length:
bufp = ctypes.byref(out, bufpos)
ret = _lib.hdfsRead(
self._fs, self._handle, bufp, ctypes.c_int32(bufsize - bufpos))
if ret == 0: # EOF
break
if ret > 0:
length -= ret
bufpos += ret
else:
raise IOError('Read file %s Failed:' % self.path, -ret)
return out
The final interface should probably be something like read(self, length=None, out=None)
to support both modes of operation and not pollute the API namespace, and there should be some range checks to prevent overflows.
Thoughts?
By the way, there are still copies happening inside of libhdfs3
which, when patched out/worked around, give another nice speedup (in short: setting input.localread.default.buffersize=1
and patching out checksumming → reads go directly into the buffer given by the user). Where is the libhdfs3
development happening these days? Is ContinuumIO/libhdfs3-downstream the right place to work on this?