-
Notifications
You must be signed in to change notification settings - Fork 57
Description
While converting some stackstac examples to pystac-client for gjoseph92/stackstac#81, I noticed some queries that were substantially slower with pystac-client than the equivalent query with sat-search. Not as bad as stac-utils/pystac#546, but 4x slower:
import satsearch
import pystac_client
%%time
items = satsearch.Search(
url="https://earth-search.aws.element84.com/v0",
intersects=dict(type="Point", coordinates=[-106, 35.7]),
collections=["sentinel-s2-l2a-cogs"],
datetime="2019-01-01/2020-01-01"
).items()
# CPU times: user 110 ms, sys: 31.9 ms, total: 142 ms
# Wall time: 2.23 s
%%time
items = pystac_client.Client.open(
"https://earth-search.aws.element84.com/v0"
).search(
intersects=dict(type="Point", coordinates=[-106, 35.7]),
collections=["sentinel-s2-l2a-cogs"],
datetime="2019-01-01/2020-01-01"
).get_all_items()
# CPU times: user 334 ms, sys: 59.5 ms, total: 394 ms
# Wall time: 7.05 sProfiling, I realized pystack-client is fetching substantially more pages. sat-search sets limit=10000 by default for you, so this search required just a single request. Whereas pystac-client does not specify a limit by default, and ends up getting only 10 elements per page. Indeed, setting limit=10000 for pystack-client gives the same performance:
%%time
items = pystac_client.Client.open(
"https://earth-search.aws.element84.com/v0"
).search(
intersects=dict(type="Point", coordinates=[-106, 35.7]),
collections=["sentinel-s2-l2a-cogs"],
datetime="2019-01-01/2020-01-01",
limit=10000,
).get_all_items()
# CPU times: user 125 ms, sys: 22.9 ms, total: 148 ms
# Wall time: 1.59 sI think we should set a default value for limit in pystac-client, to something high (like 10,000). It makes things work out-of-the-box better, and from an API compliance perspective I think it's a valid choice.
Justification
The docstring for limit in pystac-client says
The maximum number of items to return per page. Defaults to None, which falls back to the limit set by the service.
This is slightly misleading: it uses the default client limit set by the service, not the service's own limit for the maximum number of items it's willing to send per page.
Looking at the STAC API spec:
| Parameter | Type | Source API | Description |
|---|---|---|---|
| limit | integer | OAFeat | The maximum number of results to return (page size). Defaults to 10 |
Hence where the 10 comes from.
If we read about limits in the OGC Features API spec (OAFeat) [emphasis mine]:
So (using the default/maximum values of 10/10000 from the OpenAPI fragment in requirement /req/core/fc-limit-definition):
- If you ask for 10, you will get 0 to 10 (as requested) and if there are more, a next link;
- If you don’t specify a limit, you will get 0 to 10 (default) and if there are more, a next link;
- If you ask for 50000, you might get up to 10000 (server-limited) and if there are more, a next link;
- If you follow the next link from the previous response, you might get up to 10000 additional features and if there are more, a next link.
Basically, the limit parameter is meant to protect the client from getting too much data at once. The server will protect itself from processing too much. So pystac-client should set limit to the maximum number of items it's reasonably willing to process at once. Of course this could vary based on the data source, but I feel like >10 is a reasonable default. But I don't know what the policy is in pystac-client around setting defaults that aren't derived from the STAC spec.