- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.1k
Room deletion (shutdown) fail in a constant loop due to non-serializable access caused by PostgreSQL isolation levels #10294
Description
Description
When using the room deletion api to remove a large room (such as Matrix HQ) from the server, the purging process, if it needs more than a few seconds to finish, can sometimes enter a constant fail-retry loop due to unable to serialize access (because the tables are concurrently accessed and modified by other transactions constantly on a running server).
Steps to reproduce
- On a moderately busy server (e.g. being in multiple moderately-sized federated rooms), try to purge a large federated room using the delete room api
- Observe that the process gets stuck with unable to serialize accessbeing reported in the logs. You can also observe the behavior using PgHero, in which one exact long-running query will appear again and again on a regular interval, indicating Synapse has been retrying it again and again.
Version information
- 
Version: 1.37.1 
- 
Install method: pip 
- Platform: Debian 10 "buster", in a LXC container
Notes
The error will go away if the isolation level is changed to the lowest READ COMMITTED for the room-purging transaction, though I am not sure if this is correct or not, but I assume it should be fine given that we are just deleting everything related to a room.
diff --git a/synapse/storage/databases/main/purge_events.py b/synapse/storage/databases/main/purge_events.py
index 7fb7780d0..2619a6602 100644
--- a/synapse/storage/databases/main/purge_events.py
+++ b/synapse/storage/databases/main/purge_events.py
@@ -313,6 +313,7 @@ class PurgeEventsStore(StateGroupWorkerStore, CacheInvalidationWorkerStore):
         )
     def _purge_room_txn(self, txn, room_id: str) -> List[int]:
+        txn.execute("SET TRANSACTION ISOLATION LEVEL READ COMMITTED")
         # First we fetch all the state groups that should be deleted, before
         # we delete that information.
         txn.execute(
On a second note, is there a reason why the isolation level is set to REPEATABLE READ by default globally? Does Synapse really need REPEATABLE READ on every transaction?