Skip to content

Fix file-like objects without seek support not working #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ulikoehler
Copy link
Contributor

This is the revised version of #80

Currently the Reader constructor checks if hasattr(shp, "seek"). If not, then seek(0) is not called - however, the other functions in the Reader class will call seek() in any case, so it doesn't make sense to avoid calling seek() here.

This PR fixes this by introducing a new argument to Reader.__init__ (allow_copy). If this is set to True, the code will initialize a io.BytesIO with the file-like object's content.

Docs for allow_copy:

    If initializing the reader with a file-like object which
    does not support seek(), you must set allow_copy=True
    to allow the Reader to copy the entire file in memory.
    This is set to False by default in order to avoid
    large files being copyied into memory without user intention.

One usecase where this is useful is to support reading directly from a ZIP file without unzipping, because the file-likes which can be opened using the standard python zipfile library derive from io.BufferedIOBase which raises io.UnsupportedOperation when calling seek().

@ulikoehler
Copy link
Contributor Author

I wrote a blog article detailing how to read from a ZIP file without actually unzipping which depends on this pull request:
https://techoverflow.net/2017/02/22/reading-a-shapefile-directly-from-a-zip-using-pyshp/

Any feedback is welcome.

@karimbahgat
Copy link
Collaborator

This is a great initiative. In a way, this PR seems to internalize Joel's recipe for reading directly from zipfiles.

However, the use-case is also a relatively rare one, so that the cost of adding a new argument (confusion and complexity for the user) seems to outweighs the benefits.

So my two cents for now is to not merge this. I would however be interested in a PR that automatically supports this functionality only when necessary, e.g. loading with BytesIO if the path is inside a zipfile.

PS: The link to your article seems to be broken.

@ulikoehler
Copy link
Contributor Author

Thanks for your feedback. The link didnt work due to to a server misconfiguration on my side, it has been fixed now.

Maybe I can first tell you about my usecase in order to convince you that the usecase isn't that esoteric ;-)
I dont really deal with GIS data on a day-to-day level but I noticed there is no openly available, customizable source for country contours (including vector graphics).

That is why I decided to develop MapzMaker which allows anyone to create vector or rasterized contour maps with customizable projections, colors and formats.

MapzMaker uses NaturalEarth. Although, as I said, I'm not often dealing with shapefiles, I assume files being distributed just like NE (as a ZIP) is very common.

That being said I'd like to offer you a compromise.

The reason why I added the extra option is to avoid a problem I often encounter with large datasets: Some libraries copy data to memory without asking - but I might not have enough RAM to deal with that. Therefore, my computer starts to thrash. The option I added in the PR tries to avoid that by forcing the user to actively allow the implementation to copy the data (allow_copy=True).

But as your complaint seems to be only the option (which is not required to implement the fix for ZIP files and directly-from-the-internet features), I'll be happy to remove it in order to avoid confusing the users.

Just to make sure its purpose is clear to you: The allow_copy option is only relevant if the following conditions are met simultaneously:

  • The user reads a dataset which is too large to be copied into RAM and
  • The file-like object does not support the seek operation
    The 2nd condition is true for all types of in-memory files and other wrapper structures. AFAIK there is no technical reason why io.BytesIO etc. cant implement seek. It just has not been done yet.

Copying the data is required in some cases to avoid issues with a file-like object being invalidated (this is the case e.g. for ZIP files).

In all other cases, allow_copy won't have any effect at all.

I hope this compromise and information will allow you to reiterate the decision not to merge that ;-) . If you'd like the option to be removed, please feel free to tell me. I'll implement it as soon as possible.

@karimbahgat
Copy link
Collaborator

Thanks for your detailed followup. Shapefiles do indeed frequently come inside zipfiles, so it would be useful with out-of-the-box support for reading them, esp reading from the internet via urllib.

Curious to hear what @GeospatialPython thinks, but I think its worth it to drop the option and just default to loading the entire file in memory when there is no seek method (eg zip and urllib file objects). This memory aspect can be mentioned in the readme, i think theres already a section on reading from zipfiles, so great if you update that too.

So if you make a new PR without the option, I think this would be a nice convenience to the user, and I would be happy to merge it.

@karimbahgat
Copy link
Collaborator

Updated and merged in #96.

@karimbahgat karimbahgat closed this May 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants