Add snapshot configuration links and examples to cluster pages

pcholakov · pcholakov · commit 03acdecea723 · 2025-02-28T15:36:02.000+02:00
We want to steer users towards always using a snapshot repository.

- expand Docker Compose example with Minio
- update all cluster references to link to the snapshots page
- add strong recommendation to always use snapshots for clusters
diff --git a/docs/deploy/server/cluster/deployment.mdx b/docs/deploy/server/cluster/deployment.mdx
@@ -11,7 +11,7 @@ import Admonition from '@theme/Admonition';
 This page describes how you can deploy a distributed Restate cluster.
 
 <Admonition type="tip" title="Quickstart using Docker">
-    Check out the [Restate cluster guide](/guides/cluster) for a docker-compose ready-made example.
+    Check out the [Restate cluster guide](/guides/cluster) for a Docker Compose ready-made example.
 </Admonition>
 
 <Admonition type="tip" title="Migrating an existing single-node deployment">
@@ -24,6 +24,10 @@ This page describes how you can deploy a distributed Restate cluster.
     To understand the terminology used on this page, it might be helpful to read through the [architecture reference](/references/architecture).
 </Admonition>
 
+<Admonition type="caution">
+    Snapshots are essential to support safe log trimming and also allow you to set partition replication to a subset of all cluster nodes, while still allowing for fast partition fail-over to to any live node. Snapshots are also necessary to add more nodes in the future.
+</Admonition>
+
 To deploy a distributed Restate cluster without external dependencies, you need to configure the following settings in your [server configuration](/operate/configuration/server):
 
 ```toml restate.toml
diff --git a/docs/deploy/server/cluster/growing-cluster.mdx b/docs/deploy/server/cluster/growing-cluster.mdx
@@ -15,7 +15,7 @@ This allows the new node to discover the metadata servers and join the cluster.
 <Admonition type="note" title="Growing the cluster in the future">
     If you plan to scale your cluster over time, we strongly recommend enabling snapshotting.
     Without it, newly added nodes may not be fully utilized by the system.
-    See the [snapshotting documentation](/operate/data-backup#snapshotting) for more details.
+    See the [snapshotting documentation](/operate/snapshots) for more details.
 </Admonition>
 
 <Admonition type="note" title="Shrinking the cluster">
diff --git a/docs/guides/cluster.mdx b/docs/guides/cluster.mdx
@@ -19,88 +19,95 @@ This guide shows how to deploy a distributed Restate cluster consisting of 3 nod
 
 <Step stepLabel="1" title="Deploy the Restate cluster using Docker">
 
-To deploy a 3 node distributed Restate cluster, copy the `docker-compose.yml` and run `docker compose up`.
+To deploy a 3 node distributed Restate cluster, create a file `docker-compose.yml` and run `mkdir restate-data object-store
+ && docker compose up`.
 
 ```yaml docker-compose.yml
-x-environment: &common-envs
-  RESTATE_CLUSTER_NAME: "my-cluster"
-  # In this setup every node fulfills every role.
-  RESTATE_ROLES: '["admin","worker","log-server","metadata-server"]'
-  # To customize logging, check https://docs.restate.dev/operate/monitoring/logging
+x-environment: &common-env
+  RESTATE_CLUSTER_NAME: "restate-cluster"
+  # Every node runs every role
+  RESTATE_ROLES: '["admin", "worker", "log-server", "metadata-server"]'
+  # For more on logging, see: https://docs.restate.dev/operate/monitoring/logging
   RESTATE_LOG_FILTER: "restate=info"
   RESTATE_BIFROST__DEFAULT_PROVIDER: "replicated"
-  RESTATE_BIFROST__REPLICATED_LOGLET__DEFAULT_LOG_REPLICATION: 2
+  RESTATE_BIFROST__REPLICATED_LOGLET__DEFAULT_LOG_REPLICATION: 2  # We require minimum of 2 nodes to accept writes
   RESTATE_METADATA_SERVER__TYPE: "replicated"
-  # This needs to be configured with the hostnames/ports the nodes can use to talk to each other.
-  # In this setup, they interact within the "internal" Docker compose network setup.
+  # The addresses where nodes can reach each other over the "internal" Docker Compose network
   RESTATE_METADATA_CLIENT__ADDRESSES: '["http://restate-1:5122","http://restate-2:5122","http://restate-3:5122"]'
+  # Partition snapshotting, see: https://docs.restate.dev/operate/snapshots
+  RESTATE_WORKER__SNAPSHOTS__DESTINATION: "s3://restate/snapshots"
+  RESTATE_WORKER__SNAPSHOTS__SNAPSHOT_INTERVAL_NUM_RECORDS: "1000"
+  RESTATE_WORKER__SNAPSHOTS__AWS_REGION: "local"
+  RESTATE_WORKER__SNAPSHOTS__AWS_ENDPOINT_URL: "http://minio:9000"
+  RESTATE_WORKER__SNAPSHOTS__AWS_ALLOW_HTTP: true
+  RESTATE_WORKER__SNAPSHOTS__AWS_ACCESS_KEY_ID: "minioadmin"
+  RESTATE_WORKER__SNAPSHOTS__AWS_SECRET_ACCESS_KEY: "minioadmin"
+
+x-defaults: &defaults
+  image: docker.restate.dev/restatedev/restate:1.2
+  extra_hosts:
+    - "host.docker.internal:host-gateway"
 
 services:
   restate-1:
-    image: docker.restate.dev/restatedev/restate:1.2
+    <<: *defaults
     ports:
-      # Ingress port
-      - "8080:8080"
-      # Admin/UI port
-      - "9070:9070"
-      # Admin query port (psql)
-      - "9071:9071"
-      # Node port
-      - "5122:5122"
+      - "8080:8080"  # Ingress
+      - "9070:9070"  # Admin
+      - "5122:5122"  # Node-to-node communication
     environment:
-      <<: *common-envs
+      <<: *common-env
       RESTATE_NODE_NAME: restate-1
       RESTATE_FORCE_NODE_ID: 1
-      # This needs to be configured with the hostname/port the other Restate nodes can use to talk to this node.
-      RESTATE_ADVERTISED_ADDRESS: "http://restate-1:5122"
-      # Only restate-1 provisions the cluster
-      RESTATE_AUTO_PROVISION: "true"
-    extra_hosts:
-      - "host.docker.internal:host-gateway"
+      RESTATE_ADVERTISED_ADDRESS: "http://restate-1:5122"  # Other Restate nodes must be able to reach us using this address
+      RESTATE_AUTO_PROVISION: "true"                       # Only the first node provisions the cluster
 
   restate-2:
-    image: docker.restate.dev/restatedev/restate:1.2
+    <<: *defaults
     ports:
       - "25122:5122"
       - "29070:9070"
-      - "29071:9071"
       - "28080:8080"
     environment:
-      <<: *common-envs
+      <<: *common-env
       RESTATE_NODE_NAME: restate-2
       RESTATE_FORCE_NODE_ID: 2
       RESTATE_ADVERTISED_ADDRESS: "http://restate-2:5122"
-      # Only restate-1 provisions the cluster
       RESTATE_AUTO_PROVISION: "false"
-    extra_hosts:
-      - "host.docker.internal:host-gateway"
 
   restate-3:
-    image: docker.restate.dev/restatedev/restate:1.2
+    <<: *defaults
     ports:
       - "35122:5122"
       - "39070:9070"
-      - "39071:9071"
       - "38080:8080"
     environment:
-      <<: *common-envs
+      <<: *common-env
       RESTATE_NODE_NAME: restate-3
       RESTATE_FORCE_NODE_ID: 3
       RESTATE_ADVERTISED_ADDRESS: "http://restate-3:5122"
-      # Only restate-1 provisions the cluster
       RESTATE_AUTO_PROVISION: "false"
-    extra_hosts:
-      - "host.docker.internal:host-gateway"
+
+  minio:
+    image: quay.io/minio/minio
+    # volumes:
+    #   - object-store:/data
+    entrypoint: "/bin/sh"
+    # Ensure a bucket called "restate" exists on startup:
+    command: "-c 'mkdir -p /data/restate && /usr/bin/minio server --quiet /data'"
+    ports:
+      - "9000:9000"
 ```
 
-The cluster uses the `replicated` Bifrost provider and replicates data to 2 nodes.
+The cluster uses the `replicated` Bifrost provider and replicates log writes to a minimum of 2 nodes.
 Since we are running with 3 nodes, the cluster can tolerate 1 node failure without becoming unavailable.
+By default, partition state is replicated to all workers (though each partition has only one acting leader at a time).
 
 The `replicated` metadata cluster consists of all nodes since they all run the `metadata-server` role.
 Since the `replicated` metadata cluster requires a majority quorum to operate, the cluster can tolerate 1 node failure without becoming unavailable.
 
 Take a look at the [cluster deployment documentation](/deploy/server/cluster/deployment) for more information on how to configure and deploy a distributed Restate cluster.
-
+In this example we also deployed a Minio server to host the cluster snapshots bucket. Visit [Snapshots](/operate/snapshots) to learn more about whis is strongly recommended for all clusters.
 </Step>
 
 <Step stepLabel="2" title="Check the cluster status">
@@ -143,10 +150,19 @@ Take a look at the [cluster deployment documentation](/deploy/server/cluster/dep
     ```
 </Step>
 
+<Step stepLabel="7" title="Create snapshots">
+    Try instructing the partition processors to create a snapshot of their state in the object store bucket:
+    ```shell
+    docker compose exec restate-1 restatectl snapshot create
+    ```
+    Navigate to the Minio console at [http://localhost:9000](http://localhost:9000) and browse the bucket contents (default credentials: `minioadmin`/`minioadmin`).
+</Step>
+
 <Step end={true} stepLabel="🎉" title="Congratulations, you managed to run your first distributed Restate cluster and simulated some failures!"/>
 
 
 Here are some next steps for you to try:
 
 - Try to configure a 5 server Restate cluster that can tolerate up to 2 server failures.
+- Trim the logs (either manually, or by setting up automatic trimming) _before_ adding more nodes.
 - Try to deploy a 3 server Restate cluster using Kubernetes.
diff --git a/docs/guides/local-to-replicated.mdx b/docs/guides/local-to-replicated.mdx
@@ -28,7 +28,7 @@ Once you restart your Restate server, it will start using the replicated metadat
 type = "replicated"
 ```
 
-If you plan to extend your single-node deployment to a multi-node deployment, you also need to [configure the snapshot repository](/operate/data-backup#snapshotting).
+If you plan to extend your single-node deployment to a multi-node deployment, you also need to [configure the snapshot repository](/operate/snapshots).
 This allows new nodes to join the cluster by restoring the latest snapshot.
 
 ```toml restate.toml
diff --git a/docs/operate/snapshots.mdx b/docs/operate/snapshots.mdx
@@ -11,12 +11,16 @@ import Admonition from '@theme/Admonition';
     This page covers configuring a Restate cluster to share partition snapshots for fast fail-over and bootstrapping new nodes. For backup of Restate nodes, see [Data Backup](/operate/data-backup).
 </Admonition>
 
-Restate workers can be configured to periodically publish snapshots of their partition state to a shared destination. Snapshots are not necessarily backups. Rather, snapshots allow nodes that had not previously served a partition to bootstrap a copy of its state. Without snapshots, placing a partition processor on a node that wasn't previously a follower would require the full replay of that partition's log. Replaying the log might take a long time - and is impossible if the log gets trimmed.
-
 <Admonition type="note" title="Architectural overview">
     To understand the terminology used on this page, it might be helpful to read through the [architecture reference](/references/architecture).
 </Admonition>
 
+<Admonition type="caution">
+    Snapshots are essential to support safe log trimming and also allow you to set partition replication to a subset of all cluster nodes, while still allowing for fast partition fail-over to to any live node. Snapshots are also necessary to add more nodes in the future.
+</Admonition>
+
+Restate workers can be configured to periodically publish snapshots of their partition state to a shared destination. Snapshots are not necessarily backups. Rather, snapshots allow nodes that had not previously served a partition to bootstrap a copy of its state. Without snapshots, placing a partition processor on a node that wasn't previously a follower would require the full replay of that partition's log. Replaying the log might take a long time - and is impossible if the log gets trimmed.
+
 ## Configuring Snapshots
 Restate clusters should always be configured with a snapshot repository to allow nodes to efficiently share partition state, and for new nodes to be added to the cluster in the future.
 Restate currently supports using Amazon S3 (or an API-compatible object store) as a shared snapshot repository.