How to make a big-ish median mosaic

I am working on creating a mosaic of sentinel data for a sizable area of interest -- around 600km by 400km.  The goal is to get a median of the non-cloudy pixels over a 2 month period. My experience is that, with the code I've developed, it's not doable reliably even on a 400-core cluster.   

To dig in to this I have a reproducible example of a smaller area -- 1/16 of the above example (roughly 150km x 100km), running on a 100 core cluster.  This works about 1 out of 3 runs, otherwise dying with various errors.  Below is the code, and then the output of 3 runs.  The first died with a missing dependent, the second worked, and the third died with set size changed.  (Note as regards my previously filed issue #11 -- this is with a static cluster size where I wait for all workers before proceeding.)

Just to clarify why the code is organized the way it is -- I am applying the median over time of each band, after NaNing out pixels that are identified in the Sentinel-2 SCL band as being clouds,   I.e. I'm using pixel-by-pixel cloud removal instead of using a stac query based on overall percentage cloudiness, in order to maximize the information I can get from the tiles.  This is probably contributing to the complexity.  But it also seems like a very reasonable thing to do.

My questions are
* Is the code inefficiently organized?  Is there an obviously better way to do it?  Can the where/isin/median process be merged into the stack call? (even though the 'isin' uses the SCL band to mask pixels in other bands)
* Why are these various missing dependent / set size changed type errors happening?    Is it that I'm trying to do too much with too little worker memory?  Is there some kind of rule of thumb that would let me know how many workers I need for a particular computation?  The full 670GB cube should be fitting roughly in the memory of the workers -- that is the rule of thumb I used to get to 100 cores.

Side note as regards scheduler:  For larger areas (e.g. the full 600km x 400km) I can't even reliable get the computation going, even with 400 cores.  I think it's overwhelming the scheduler.  Should I try to  (can I) increase memory to the scheduler somehow? 

Or perhaps the real problem is the computation is just out of scope unless it's done in a more algorithmicly aligned manner.

Very interested in your thoughts.

```python
import numpy as np
import dask
import matplotlib.pyplot as plt
from pystac_client import Client as pystac_client
#from pystac.extensions.eo import EOExtension as eo
import planetary_computer as pc
import stackstac
from dask.distributed import LocalCluster, Client, wait

import time
start_time = time.time()
def log(*args):
    el = time.time() - start_time
    current_time = time.strftime("%H:%M:%S", time.localtime())
    print(current_time, el, flush=True, *args)

log("STARTING")

import dask_gateway

nworkers = 100
cluster = dask_gateway.GatewayCluster()
client = cluster.get_client()
cluster.scale(nworkers)
log("DASHBOARD:", cluster.dashboard_link)
client.wait_for_workers(nworkers)
log("WORKERS READY.")

aoi_bounds_zim = (27.11764655, -21.18948202, 32.8563901, -16.99215633) # big chunk of zimbabwe
(xmn,ymn,xmx,ymx) = aoi_bounds_zim
aoi_bounds_half = (xmn, ymn, (xmn+xmx)/2, (ymn+ymx)/2)  # cut it in half in each dimension (/4 in area)
aoi_bounds_quarter = (xmn, ymn, (xmn+xmx)/2, (ymn+ymx)/2)  # cut it in quarter in each dimension (/16 in area)
#aoi_bounds_tiny = (xmn, ymn, xmn+0.1, ymn+0.1)
aoi_bounds = aoi_bounds_quarter

date_range = "2020-01-01/2020-03-01"
all_bands = ['B02', 'B03', 'B04', 'B08', 'SCL']
rgb_bands = ['B04', 'B03', 'B02']

catalog = pystac_client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = catalog.search(
    collections=["sentinel-2-l2a"],
    bbox=aoi_bounds,
    datetime=date_range,
    #query={"eo:cloud_cover": {"lt": 10}},
)

# Check how many items were returned
items = list(search.get_items())
print(f"Returned {len(items)} Items")
# Note that there are more than 1 tile here:
print("Sentinel 2 Unique Tiles:", np.unique([i.properties['s2:mgrs_tile'] for i in items]))
# Print unique CRSs:
print("EPSGS:", np.unique([i.properties['proj:epsg'] for i in items]))

# Stack it
epsg_int = 4326  # because we (may) have multiple UTM zones, stick with lat/long
chunksize = 2048

signed_items = [pc.sign(item).to_dict() for item in search.get_items()]
data = (
        stackstac.stack(
            signed_items,
            epsg=epsg_int,
            #resolution=10, 
            #bounds=bounds,
            bounds_latlon=aoi_bounds,
            assets=all_bands,
            chunksize=chunksize,)
    .where(lambda x: x > 0, other=np.nan)  # sentinel-2 uses 0 as nodata
    .assign_coords(
        time=lambda x: x.time.dt.round("D"))  # round time to daily for nicer plot labels
    .groupby('time')          # one tile per datetime
    .apply(stackstac.mosaic)  # mosaic together tiles for same datetime
    )
log("Full data size GB:", data.nbytes/1e9)
log("Dimensions:", data.dims)

cloud_scl_values = [8,9,10]  # sentinel 2 cloudy pixel SCL values
mask = ~data.sel(band='SCL').isin(cloud_scl_values)
img = (data.sel(band=rgb_bands)
        .where(mask)
        .median(dim='time'))
log("Image size GB:", img.nbytes/1e9)

log("CALLING PERSIST")
img_p = img.persist()
log("AFTER CALL TO PERSIST, WAITING")
wait(img_p)
log("PERSIST COMPLETE, PLOTTING")

SCALE = 4
img_c = (img_p
         .coarsen({'x': SCALE, 'y': SCALE}, boundary='pad')
         .mean(skipna=True, keep_attrs=True)
         .compute() )

fig,ax = plt.subplots(1,1)
img_c.plot.imshow(ax=ax, x='x', y='y', robust=True)
fig.savefig("foo.png")
plt.show()

log("AFTER PLOT")
      
cluster.close()
```

## RUN 1 -- missing dependent

```
17:20:04 9.5367431640625e-07 starting
17:20:17 13.552658557891846 DASHBOARD: https://pccompute.westeurope.cloudapp.azure.com/compute/services/dask-gateway/clusters/prod.cb21467cb71b46dc9aa9634aa6e82591/status
17:25:24 319.86250376701355 workers ready.
Returned 204 Items
Sentinel 2 Unique Tiles: ['35KNS' '35KNT' '35KNU' '35KPS' '35KPT' '35KPU' '35KQS' '35KQT' '35KQU'
 '35KRS' '35KRT' '35KRU']
EPSGS: [32735]
/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/accumulate_metadata.py:151: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  props_arr = np.squeeze(np.array(props))
17:25:28 323.98693227767944 Full data size GB: 673.44177984
17:25:28 323.9869964122772 Dimensions: ('time', 'band', 'y', 'x')
17:25:29 324.8450291156769 Image size GB: 16.836044496
17:25:29 324.84509015083313 CALLING PERSIST
17:26:34 390.26140332221985 AFTER CALL TO PERSIST, WAITING
17:32:58 773.6941854953766 PERSIST COMPLETE, PLOTTING
Traceback (most recent call last):
  File "big_image.py", line 99, in <module>
    img_c = (img_p
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataarray.py", line 951, in compute
    return new.load(**kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataarray.py", line 925, in load
    ds = self._to_temp_dataset().load(**kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataset.py", line 862, in load
    evaluated_data = da.compute(*lazy_data.values(), **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/base.py", line 568, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 2671, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 1948, in gather
    return self.sync(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 845, in sync
    return sync(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
    raise exc.with_traceback(tb)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
    result[0] = yield future
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 1813, in _gather
    raise exception.with_traceback(traceback)
ValueError: Could not find dependent ('where-getitem-db29c7abdfe1267a3a46298a39ef2497', 2, 0, 0).  Check worker logs

real    13m6.952s
user    1m57.003s
sys     0m4.014s

```

## RUN 2 -- successful

```
(notebook) jovyan@jupyter-mike-40beller-2etech:~/github/carbon/msft$ time python big_image.py
17:34:32 1.9073486328125e-06 starting
17:34:44 11.641934633255005 DASHBOARD: https://pccompute.westeurope.cloudapp.azure.com/compute/services/dask-gateway/clusters/prod.ec4908e8f66c4ef78afd69b17e6ace5b/status
17:34:56 23.47306203842163 workers ready.
Returned 204 Items
Sentinel 2 Unique Tiles: ['35KNS' '35KNT' '35KNU' '35KPS' '35KPT' '35KPU' '35KQS' '35KQT' '35KQU'
 '35KRS' '35KRT' '35KRU']
EPSGS: [32735]
/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/accumulate_metadata.py:151: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  props_arr = np.squeeze(np.array(props))
17:35:01 28.43052911758423 Full data size GB: 673.44177984
17:35:01 28.430601119995117 Dimensions: ('time', 'band', 'y', 'x')
17:35:01 29.03855848312378 Image size GB: 16.836044496
17:35:01 29.038624048233032 CALLING PERSIST
17:36:08 95.49182105064392 AFTER CALL TO PERSIST, WAITING
17:42:17 464.612357378006 PERSIST COMPLETE, PLOTTING
17:42:45 492.436571598053 AFTER PLOT

real    8m16.301s
user    2m2.096s
sys     0m8.124s

```

## RUN 3 -- set changed size error

```
(notebook) jovyan@jupyter-mike-40beller-2etech:~/github/carbon/msft$ time python big_image.py 
17:45:01 2.1457672119140625e-06 STARTING
17:45:12 10.9552001953125 DASHBOARD: https://pccompute.westeurope.cloudapp.azure.com/compute/services/dask-gateway/clusters/prod.45fe85a379c447da88f06e9e6c407135/status
17:45:25 23.309569597244263 WORKERS READY.
Returned 204 Items
Sentinel 2 Unique Tiles: ['35KNS' '35KNT' '35KNU' '35KPS' '35KPT' '35KPU' '35KQS' '35KQT' '35KQU'
 '35KRS' '35KRT' '35KRU']
EPSGS: [32735]
/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/accumulate_metadata.py:151: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  props_arr = np.squeeze(np.array(props))
17:45:30 28.56306266784668 Full data size GB: 673.44177984
17:45:30 28.563136100769043 Dimensions: ('time', 'band', 'y', 'x')
17:45:31 29.170450448989868 Image size GB: 16.836044496
17:45:31 29.17051339149475 CALLING PERSIST
17:46:37 95.30292534828186 AFTER CALL TO PERSIST, WAITING
17:53:00 478.3570439815521 PERSIST COMPLETE, PLOTTING
Traceback (most recent call last):
  File "big_image.py", line 94, in <module>
    img_c = (img_p
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataarray.py", line 951, in compute
    return new.load(**kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataarray.py", line 925, in load
    ds = self._to_temp_dataset().load(**kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/xarray/core/dataset.py", line 862, in load
    evaluated_data = da.compute(*lazy_data.values(), **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/base.py", line 568, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 2671, in get
    results = self.gather(packed, asynchronous=asynchronous, direct=direct)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 1948, in gather
    return self.sync(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 845, in sync
    return sync(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py", line 326, in sync
    raise exc.with_traceback(tb)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/utils.py", line 309, in f
    result[0] = yield future
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/client.py", line 1813, in _gather
    raise exception.with_traceback(traceback)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/optimization.py", line 969, in __call__
    return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args)))
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/core.py", line 151, in get
    result = _execute_task(task, cache)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/dask/core.py", line 121, in _execute_task
    return func(*(_execute_task(a, cache) for a in args))
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/to_dask.py", line 172, in fetch_raster_window
    data = reader.read(current_window)
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 423, in read
    reader = self.dataset
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 419, in dataset
    self._dataset = self._open()
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 369, in _open
    log_event("open_dataset_initial", dict(url=self.url))
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/stackstac/rio_reader.py", line 42, in log_event
    worker.log_event(topic, dict(msg, thread=_curthread()))
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/worker.py", line 820, in log_event
    self.batched_stream.send(
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/distributed/batched.py", line 142, in send
    self.waker.set()
  File "/srv/conda/envs/notebook/lib/python3.8/site-packages/tornado/locks.py", line 224, in set
    for fut in self._waiters:
RuntimeError: Set changed size during iteration

real    8m8.199s
user    1m53.300s
sys     0m3.607s

```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to make a big-ish median mosaic #12

RUN 1 -- missing dependent

RUN 2 -- successful

RUN 3 -- set changed size error

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to make a big-ish median mosaic #12

Description

RUN 1 -- missing dependent

RUN 2 -- successful

RUN 3 -- set changed size error

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions