Skip to content

write large dataset without iterating? #311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ear9mrn opened this issue Mar 6, 2025 · 1 comment
Open

write large dataset without iterating? #311

ear9mrn opened this issue Mar 6, 2025 · 1 comment
Labels

Comments

@ear9mrn
Copy link

ear9mrn commented Mar 6, 2025

What's your question?

Hi,
I have my data (and records) in list/array. Is there a way to create a new shape file from these without having to iterate over all the points/records (its a bit slow), i.e. a pass list/array direct?

Thanks.

@JamesParrott
Copy link
Collaborator

JamesParrott commented Mar 10, 2025

Hi there,
Unfortunately, I'm fairly sure this is a no.

Can PyShp skip any unneccessary extra internal iterations, and add a shortcut to the user?
No. There aren't any to skip. After defining fields and shape types, sequential definition of shapes and records, e.g. via iteration, is fundamentally how PyShp must be used. Under the hood, PyShp expects to write to a file-like or stream-like object, which is naturally traversed by iteration (although they may have tell and seek methods).

Can any Shapefile library do this?
Creating an async/non-sequential parallelisable way of creating a shapefile is challenging for any Shapefile library that supports interspersing with null shapes. It's not unreasonable to drop null shapes (certainly not to offer the option to do so). This would allow the record indexes in the .shp file to be calculated a-priori (and use f.seek).

For there to be any speed up in practise however, some way to allow multiple processes to write to the same file would be needed. I don't know about SSDs, but I'm pretty sure that won't work with a traditional spinning HDD with only a single magnetic read/write head (nor with tape). And I'm not sure how that works with hard disk buses and operating system file permissions. Let alone how to manage it from Python. It's an IO-bound task too (not CPU-bound). Ultimately the speed the user sees when writing a shapefile (using anything) is limited by the maximum write speed of their file system. Even if there are storage systems and OSs, that allow multiple concurrent writes to the same file, in GIS applications, it often makes sense to split the shapes and records over multiple shapefiles, according to some geographical or other criteria the user chooses.

I recommend looking into alternative data formats to shapefiles, e.g. geospatial databases, that are designed for backend storage options, that better support multiple concurrent write operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants