Description
I am working on creating a mosaic of sentinel data for a sizable area of interest -- around 600km by 400km. The goal is to get a median of the non-cloudy pixels over a 2 month period. My experience is that, with the code I've developed, it's not doable reliably even on a 400-core cluster.
To dig in to this I have a reproducible example of a smaller area -- 1/16 of the above example (roughly 150km x 100km), running on a 100 core cluster. This works about 1 out of 3 runs, otherwise dying with various errors. Below is the code, and then the output of 3 runs. The first died with a missing dependent, the second worked, and the third died with set size changed. (Note as regards my previously filed issue #11 -- this is with a static cluster size where I wait for all workers before proceeding.)
Just to clarify why the code is organized the way it is -- I am applying the median over time of each band, after NaNing out pixels that are identified in the Sentinel-2 SCL band as being clouds, I.e. I'm using pixel-by-pixel cloud removal instead of using a stac query based on overall percentage cloudiness, in order to maximize the information I can get from the tiles. This is probably contributing to the complexity. But it also seems like a very reasonable thing to do.
My questions are
- Is the code inefficiently organized? Is there an obviously better way to do it? Can the where/isin/median process be merged into the stack call? (even though the 'isin' uses the SCL band to mask pixels in other bands)
- Why are these various missing dependent / set size changed type errors happening? Is it that I'm trying to do too much with too little worker memory? Is there some kind of rule of thumb that would let me know how many workers I need for a particular computation? The full 670GB cube should be fitting roughly in the memory of the workers -- that is the rule of thumb I used to get to 100 cores.
Side note as regards scheduler: For larger areas (e.g. the full 600km x 400km) I can't even reliable get the computation going, even with 400 cores. I think it's overwhelming the scheduler. Should I try to (can I) increase memory to the scheduler somehow?
Or perhaps the real problem is the computation is just out of scope unless it's done in a more algorithmicly aligned manner.
Very interested in your thoughts.
import numpy as np
import dask
import matplotlib.pyplot as plt
from pystac_client import Client as pystac_client
#from pystac.extensions.eo import EOExtension as eo
import planetary_computer as pc
import stackstac
from dask.distributed import LocalCluster, Client, wait
import time
start_time = time.time()
def log(*args):
el = time.time() - start_time
current_time = time.strftime("%H:%M:%S", time.localtime())
print(current_time, el, flush=True, *args)
log("STARTING")
import dask_gateway
nworkers = 100
cluster = dask_gateway.GatewayCluster()
client = cluster.get_client()
cluster.scale(nworkers)
log("DASHBOARD:", cluster.dashboard_link)
client.wait_for_workers(nworkers)
log("WORKERS READY.")
aoi_bounds_zim = (27.11764655, -21.18948202, 32.8563901, -16.99215633) # big chunk of zimbabwe
(xmn,ymn,xmx,ymx) = aoi_bounds_zim
aoi_bounds_half = (xmn, ymn, (xmn+xmx)/2, (ymn+ymx)/2) # cut it in half in each dimension (/4 in area)
aoi_bounds_quarter = (xmn, ymn, (xmn+xmx)/2, (ymn+ymx)/2) # cut it in quarter in each dimension (/16 in area)
#aoi_bounds_tiny = (xmn, ymn, xmn+0.1, ymn+0.1)
aoi_bounds = aoi_bounds_quarter
date_range = "2020-01-01/2020-03-01"
all_bands = ['B02', 'B03', 'B04', 'B08', 'SCL']
rgb_bands = ['B04', 'B03', 'B02']
catalog = pystac_client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
search = catalog.search(
collections=["sentinel-2-l2a"],
bbox=aoi_bounds,
datetime=date_range,
#query={"eo:cloud_cover": {"lt": 10}},
)
# Check how many items were returned
items = list(search.get_items())
print(f"Returned {len(items)} Items")
# Note that there are more than 1 tile here:
print("Sentinel 2 Unique Tiles:", np.unique([i.properties['s2:mgrs_tile'] for i in items]))
# Print unique CRSs:
print("EPSGS:", np.unique([i.properties['proj:epsg'] for i in items]))
# Stack it
epsg_int = 4326 # because we (may) have multiple UTM zones, stick with lat/long
chunksize = 2048
signed_items = [pc.sign(item).to_dict() for item in search.get_items()]
data = (
stackstac.stack(
signed_items,
epsg=epsg_int,
#resolution=10,
#bounds=bounds,
bounds_latlon=aoi_bounds,
assets=all_bands,
chunksize=chunksize,)
.where(lambda x: x > 0, other=np.nan) # sentinel-2 uses 0 as nodata
.assign_coords(
time=lambda x: x.time.dt.round("D")) # round time to daily for nicer plot labels
.groupby('time') # one tile per datetime
.apply(stackstac.mosaic) # mosaic together tiles for same datetime
)
log("Full data size GB:", data.nbytes/1e9)
log("Dimensions:", data.dims)
cloud_scl_values = [8,9,10] # sentinel 2 cloudy pixel SCL values
mask = ~data.sel(band='SCL').isin(cloud_scl_values)
img = (data.sel(band=rgb_bands)
.where(mask)
.median(dim='time'))
log("Image size GB:", img.nbytes/1e9)
log("CALLING PERSIST")
img_p = img.persist()
log("AFTER CALL TO PERSIST, WAITING")
wait(img_p)
log("PERSIST COMPLETE, PLOTTING")
SCALE = 4
img_c = (img_p
.coarsen({'x': SCALE, 'y': SCALE}, boundary='pad')
.mean(skipna=True, keep_attrs=True)
.compute() )
fig,ax = plt.subplots(1,1)
img_c.plot.imshow(ax=ax, x='x', y='y', robust=True)
fig.savefig("foo.png")
plt.show()
log("AFTER PLOT")
cluster.close()
RUN 1 -- missing dependent
17:20:04 9.5367431640625e-07 starting
17:20:17 13.552658557891846 DASHBOARD: https://pccompute.westeurope.cloudapp.azure.com/compute/services/dask-gateway/clusters/prod.cb21467cb71b46dc9aa9634aa6e82591/status
17:25:24 319.86250376701355 workers ready.
Returned 204 Items
Sentinel 2 Unique Tiles: ['35KNS' '35KNT' '35KNU' '35KPS' '35KPT' '35KPU' '35KQS' '35KQT' '35KQU'
'35KRS' '35KRT' '35KRU']
EPSGS: [32735]
/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/accumulate_metadata.py:151: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
props_arr = np.squeeze(np.array(props))
17:25:28 323.98693227767944 Full data size GB: 673.44177984
17:25:28 323.9869964122772 Dimensions: ('time', 'band', 'y', 'x')
17:25:29 324.8450291156769 Image size GB: 16.836044496
17:25:29 324.84509015083313 CALLING PERSIST
17:26:34 390.26140332221985 AFTER CALL TO PERSIST, WAITING
17:32:58 773.6941854953766 PERSIST COMPLETE, PLOTTING
Traceback (most recent call last):
File "big_image.py", line 99, in <module>
img_c = (img_p
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataarray.py", line 951, in compute
return new.load(**kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataarray.py", line 925, in load
ds = self._to_temp_dataset().load(**kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataset.py", line 862, in load
evaluated_data = da.compute(*lazy_data.values(), **kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/base.py", line 568, in compute
results = schedule(dsk, keys, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 2671, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 1948, in gather
return self.sync(
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 845, in sync
return sync(
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
raise exc.with_traceback(tb)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
result[0] = yield future
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 1813, in _gather
raise exception.with_traceback(traceback)
ValueError: Could not find dependent ('where-getitem-db29c7abdfe1267a3a46298a39ef2497', 2, 0, 0). Check worker logs
real 13m6.952s
user 1m57.003s
sys 0m4.014s
RUN 2 -- successful
(notebook) jovyan@jupyter-mike-40beller-2etech:~/github/carbon/msft$ time python big_image.py
17:34:32 1.9073486328125e-06 starting
17:34:44 11.641934633255005 DASHBOARD: https://pccompute.westeurope.cloudapp.azure.com/compute/services/dask-gateway/clusters/prod.ec4908e8f66c4ef78afd69b17e6ace5b/status
17:34:56 23.47306203842163 workers ready.
Returned 204 Items
Sentinel 2 Unique Tiles: ['35KNS' '35KNT' '35KNU' '35KPS' '35KPT' '35KPU' '35KQS' '35KQT' '35KQU'
'35KRS' '35KRT' '35KRU']
EPSGS: [32735]
/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/accumulate_metadata.py:151: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
props_arr = np.squeeze(np.array(props))
17:35:01 28.43052911758423 Full data size GB: 673.44177984
17:35:01 28.430601119995117 Dimensions: ('time', 'band', 'y', 'x')
17:35:01 29.03855848312378 Image size GB: 16.836044496
17:35:01 29.038624048233032 CALLING PERSIST
17:36:08 95.49182105064392 AFTER CALL TO PERSIST, WAITING
17:42:17 464.612357378006 PERSIST COMPLETE, PLOTTING
17:42:45 492.436571598053 AFTER PLOT
real 8m16.301s
user 2m2.096s
sys 0m8.124s
RUN 3 -- set changed size error
(notebook) jovyan@jupyter-mike-40beller-2etech:~/github/carbon/msft$ time python big_image.py
17:45:01 2.1457672119140625e-06 STARTING
17:45:12 10.9552001953125 DASHBOARD: https://pccompute.westeurope.cloudapp.azure.com/compute/services/dask-gateway/clusters/prod.45fe85a379c447da88f06e9e6c407135/status
17:45:25 23.309569597244263 WORKERS READY.
Returned 204 Items
Sentinel 2 Unique Tiles: ['35KNS' '35KNT' '35KNU' '35KPS' '35KPT' '35KPU' '35KQS' '35KQT' '35KQU'
'35KRS' '35KRT' '35KRU']
EPSGS: [32735]
/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/accumulate_metadata.py:151: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
props_arr = np.squeeze(np.array(props))
17:45:30 28.56306266784668 Full data size GB: 673.44177984
17:45:30 28.563136100769043 Dimensions: ('time', 'band', 'y', 'x')
17:45:31 29.170450448989868 Image size GB: 16.836044496
17:45:31 29.17051339149475 CALLING PERSIST
17:46:37 95.30292534828186 AFTER CALL TO PERSIST, WAITING
17:53:00 478.3570439815521 PERSIST COMPLETE, PLOTTING
Traceback (most recent call last):
File "big_image.py", line 94, in <module>
img_c = (img_p
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataarray.py", line 951, in compute
return new.load(**kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataarray.py", line 925, in load
ds = self._to_temp_dataset().load(**kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataset.py", line 862, in load
evaluated_data = da.compute(*lazy_data.values(), **kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/base.py", line 568, in compute
results = schedule(dsk, keys, **kwargs)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 2671, in get
results = self.gather(packed, asynchronous=asynchronous, direct=direct)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 1948, in gather
return self.sync(
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 845, in sync
return sync(
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
raise exc.with_traceback(tb)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
result[0] = yield future
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 1813, in _gather
raise exception.with_traceback(traceback)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/optimization.py", line 969, in __call__
return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/core.py", line 151, in get
result = _execute_task(task, cache)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
return func(*(_execute_task(a, cache) for a in args))
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/to_dask.py", line 172, in fetch_raster_window
data = reader.read(current_window)
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 423, in read
reader = self.dataset
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 419, in dataset
self._dataset = self._open()
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 369, in _open
log_event("open_dataset_initial", dict(url=self.url))
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 42, in log_event
worker.log_event(topic, dict(msg, thread=_curthread()))
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 820, in log_event
self.batched_stream.send(
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/batched.py", line 142, in send
self.waker.set()
File "/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/locks.py", line 224, in set
for fut in self._waiters:
RuntimeError: Set changed size during iteration
real 8m8.199s
user 1m53.300s
sys 0m3.607s