-
Notifications
You must be signed in to change notification settings - Fork 51
Allow for unbuffered reading #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is there any telling how much of the extra time is in copying and how much in the CRC verification? Of course, I agree that the user should be able to disable verification for the sake of performance, but perhaps the copy can be avoided in every case. I see you mention that in the linked issue. To your specific question, yes we can add any configuration parameter at the time of creating the C hdfs client. Hadoop services generally ignore any parameter not meant for them anyway, but I would not put this into any site.xml. The user need never know what that config exist, there could be a |
I'm actually not sure. The profile I'm looking at right now is a bit confusing (all over the place), I'll look at it in detail tomorrow. First guess: the buffering destroys cache efficiency and other parts of the program are affected by that. Anyways, here is the profile with And this with Both profiles of my python hdf3 testcase recorded with
Ahh, nice! Hadn't looked into how the lib is initialized, makes sense now! |
So, the really bad performance was actually a combination of setting a very small
Here is a profile with verification and buffering enabled (more detailed as I took it with The CRC verification takes up ~34% of samples, |
@sk1p , this is now released on condda-forge conda-forge/libhdfs3-feedstock#19 |
If you would like to add a kwarg to HDFileSystem, which passes the correct config to turn off CRC, that would be very reasonable. |
Thanks! PR for |
In this dask/hdfs3 issue I detailed that for some I/O-bound workloads, buffering and copying can become a bottleneck. For these cases it would be nice if the buffering could be disabled. I found out that setting
input.localread.default.buffersize
to a small number and disabling CRC verification results zero memory copies, that is, data is read directly into the buffer provided by the user. That should be exactly the case if this condition is hit.I think the best long-term way would be to compute the CRC on the user buffer somehow, but in the meantime, it would be nice if verification could be disabled from the C API.
I propose adding a new configuration property, let's say
input.read.default.verify
, to set the value for verification when opening the file via the C APINow, my question is: does the configuration only apply to this one place in libhdfs3, or others? And: can we just invent our own configuration parameters - is the namespace somehow divided such that libhdfs3 owns
input.*
?The text was updated successfully, but these errors were encountered: