Skip to content

Performance to construct a large ItemCollection #49

@TomAugspurger

Description

@TomAugspurger

This issue documents some slowness on moderately large queries. In the snippet below we fetch 8,234 items. It takes about a minute to construct the results.

import rasterio.features
import pystac_client

area_of_interest = {
    "type": "Polygon",
        "coordinates": [
          [
            [
              -123.46435546875,
              46.4605655457854
            ],
            [
              -119.608154296875,
              46.4605655457854
            ],
            [
              -119.608154296875,
              48.26125565204099
            ],
            [
              -123.46435546875,
              48.26125565204099
            ],
            [
              -123.46435546875,
              46.4605655457854
            ]
          ]
        ]
}
bbox = rasterio.features.bounds(area_of_interest)
stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = stac.search(
    bbox=bbox,
    datetime="2016-01-01/2020-12-31",
    collections=["sentinel-2-l2a"],
    limit=2500,  # fetch items in batches of 2500
)

print(search.matched())  # 8234
items = list(search.items())

I ran list(search.items()) under snakeviz and came up with this result: https://gistcdn.rawgit.org/TomAugspurger/fb5b3bde8cee09d2d9aa2f7215edf2b2/94e4ec2ae97bec2169f9263e8f41183418e885d9/mosaic-static.html

A few notes:

  1. We're spend roughly 2/3s of our time in stac_io.get_pages, which includes IO, waiting for the endpoint (and maybe parsing the JSON into Python objects?)
  2. We spend the other 1/3 of our time in item_collection.from_dict

Some ideas for optimization:

  1. Most of the time in item_collections.from_dict is spent on a deepcopy in pystac.Item.from_dict. It might be safe to skip that copy (since these should be coming off the network with no other references) and provide a copy=False flag to pystac.Item.from_dict, to allow it to mutate the incoming dict.
  2. Maybe pystac_client.Client or .search could provide a raw=True/False flag to allow skipping constructing pystac Items?
  3. Maybe some kind of async magic would speed up the reads? Hard to say, since I don't know how much time is spent waiting for results vs. parsing JSON. I don't know if it's a good idea to parse JSON on the asyncio event loop.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions