Skip to content
This repository was archived by the owner on Apr 26, 2024. It is now read-only.

Commit 40bb37e

Browse files
Stop getting missing prev_events after we already know their signature is invalid (#13816)
While #13635 stops us from doing the slow thing after we've already done it once, this PR stops us from doing one of the slow things in the first place. Related to - #13622 - #13635 - #13676 Part of #13356 Follow-up to #13815 which tracks event signature failures. With this PR, we avoid the call to the costly `_get_state_ids_after_missing_prev_event` because the signature failure will count as an attempt before and we filter events based on the backoff before calling `_get_state_ids_after_missing_prev_event` now. For example, this will save us 156s out of the 185s total that this `matrix.org` `/messages` request. If you want to see the full Jaeger trace of this, you can drag and drop this `trace.json` into your own Jaeger, https://gist.github.com/MadLittleMods/4b12d0d0afe88c2f65ffcc907306b761 To explain this exact scenario around `/messages` -> backfill, we call `/backfill` and first check the signatures of the 100 events. We see bad signature for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` and `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` (both member events). Then we process the 98 events remaining that have valid signatures but one of the events references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event`. So we have to do the whole `_get_state_ids_after_missing_prev_event` rigmarole which pulls in those same events which fail again because the signatures are still invalid. - `backfill` - `outgoing-federation-request` `/backfill` - `_check_sigs_and_hash_and_fetch` - `_check_sigs_and_hash_and_fetch_one` for each event received over backfill - ❗ `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)` - ❗ `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` fails with `Signature on retrieved event was invalid.`: `unable to verify signature for sender domain xxx: 401: Failed to find any key to satisfy: _FetchKeyRequest(...)` - `_process_pulled_events` - `_process_pulled_event` for each validated event - ❗ Event `$Q0iMdqtz3IJYfZQU2Xk2WjB5NDF8Gg8cFSYYyKQgKJ0` references `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` as a `prev_event` which is missing so we try to get it - `_get_state_ids_after_missing_prev_event` - `outgoing-federation-request` `/state_ids` - ❗ `get_pdu` for `$luA4l7QHhf_jadH3mI-AyFqho0U2Q-IXXUbGSMq6h6M` which fails the signature check again - ❗ `get_pdu` for `$zuOn2Rd2vsC7SUia3Hp3r6JSkSFKcc5j3QTTqW_0jDw` which fails the signature check
1 parent 616dcc1 commit 40bb37e

File tree

7 files changed

+386
-2
lines changed

7 files changed

+386
-2
lines changed

changelog.d/13816.feature

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Stop fetching missing `prev_events` after we already know their signature is invalid.

synapse/api/errors.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -640,6 +640,27 @@ def get_dict(self) -> "JsonDict":
640640
}
641641

642642

643+
class FederationPullAttemptBackoffError(RuntimeError):
644+
"""
645+
Raised to indicate that we are are deliberately not attempting to pull the given
646+
event over federation because we've already done so recently and are backing off.
647+
648+
Attributes:
649+
event_id: The event_id which we are refusing to pull
650+
message: A custom error message that gives more context
651+
"""
652+
653+
def __init__(self, event_ids: List[str], message: Optional[str]):
654+
self.event_ids = event_ids
655+
656+
if message:
657+
error_message = message
658+
else:
659+
error_message = f"Not attempting to pull event_ids={self.event_ids} because we already tried to pull them recently (backing off)."
660+
661+
super().__init__(error_message)
662+
663+
643664
class HttpResponseException(CodeMessageException):
644665
"""
645666
Represents an HTTP-level failure of an outbound request

synapse/handlers/federation.py

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@
4545
Codes,
4646
FederationDeniedError,
4747
FederationError,
48+
FederationPullAttemptBackoffError,
4849
HttpResponseException,
4950
LimitExceededError,
5051
NotFoundError,
@@ -1720,7 +1721,22 @@ async def _sync_partial_state_room(
17201721
destination, event
17211722
)
17221723
break
1724+
except FederationPullAttemptBackoffError as exc:
1725+
# Log a warning about why we failed to process the event (the error message
1726+
# for `FederationPullAttemptBackoffError` is pretty good)
1727+
logger.warning("_sync_partial_state_room: %s", exc)
1728+
# We do not record a failed pull attempt when we backoff fetching a missing
1729+
# `prev_event` because not being able to fetch the `prev_events` just means
1730+
# we won't be able to de-outlier the pulled event. But we can still use an
1731+
# `outlier` in the state/auth chain for another event. So we shouldn't stop
1732+
# a downstream event from trying to pull it.
1733+
#
1734+
# This avoids a cascade of backoff for all events in the DAG downstream from
1735+
# one event backoff upstream.
17231736
except FederationError as e:
1737+
# TODO: We should `record_event_failed_pull_attempt` here,
1738+
# see https://github.com/matrix-org/synapse/issues/13700
1739+
17241740
if attempt == len(destinations) - 1:
17251741
# We have tried every remote server for this event. Give up.
17261742
# TODO(faster_joins) giving up isn't the right thing to do

synapse/handlers/federation_event.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@
4444
AuthError,
4545
Codes,
4646
FederationError,
47+
FederationPullAttemptBackoffError,
4748
HttpResponseException,
4849
RequestSendFailed,
4950
SynapseError,
@@ -567,6 +568,9 @@ async def update_state_for_partial_state_event(
567568
event: partial-state event to be de-partial-stated
568569
569570
Raises:
571+
FederationPullAttemptBackoffError if we are are deliberately not attempting
572+
to pull the given event over federation because we've already done so
573+
recently and are backing off.
570574
FederationError if we fail to request state from the remote server.
571575
"""
572576
logger.info("Updating state for %s", event.event_id)
@@ -901,6 +905,18 @@ async def _process_pulled_event(
901905
context,
902906
backfilled=backfilled,
903907
)
908+
except FederationPullAttemptBackoffError as exc:
909+
# Log a warning about why we failed to process the event (the error message
910+
# for `FederationPullAttemptBackoffError` is pretty good)
911+
logger.warning("_process_pulled_event: %s", exc)
912+
# We do not record a failed pull attempt when we backoff fetching a missing
913+
# `prev_event` because not being able to fetch the `prev_events` just means
914+
# we won't be able to de-outlier the pulled event. But we can still use an
915+
# `outlier` in the state/auth chain for another event. So we shouldn't stop
916+
# a downstream event from trying to pull it.
917+
#
918+
# This avoids a cascade of backoff for all events in the DAG downstream from
919+
# one event backoff upstream.
904920
except FederationError as e:
905921
await self._store.record_event_failed_pull_attempt(
906922
event.room_id, event_id, str(e)
@@ -947,6 +963,9 @@ async def _compute_event_context_with_maybe_missing_prevs(
947963
The event context.
948964
949965
Raises:
966+
FederationPullAttemptBackoffError if we are are deliberately not attempting
967+
to pull the given event over federation because we've already done so
968+
recently and are backing off.
950969
FederationError if we fail to get the state from the remote server after any
951970
missing `prev_event`s.
952971
"""
@@ -957,6 +976,18 @@ async def _compute_event_context_with_maybe_missing_prevs(
957976
seen = await self._store.have_events_in_timeline(prevs)
958977
missing_prevs = prevs - seen
959978

979+
# If we've already recently attempted to pull this missing event, don't
980+
# try it again so soon. Since we have to fetch all of the prev_events, we can
981+
# bail early here if we find any to ignore.
982+
prevs_to_ignore = await self._store.get_event_ids_to_not_pull_from_backoff(
983+
room_id, missing_prevs
984+
)
985+
if len(prevs_to_ignore) > 0:
986+
raise FederationPullAttemptBackoffError(
987+
event_ids=prevs_to_ignore,
988+
message=f"While computing context for event={event_id}, not attempting to pull missing prev_event={prevs_to_ignore[0]} because we already tried to pull recently (backing off).",
989+
)
990+
960991
if not missing_prevs:
961992
return await self._state_handler.compute_event_context(event)
962993

synapse/storage/databases/main/event_federation.py

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1501,6 +1501,12 @@ async def record_event_failed_pull_attempt(
15011501
event_id: The event that failed to be fetched or processed
15021502
cause: The error message or reason that we failed to pull the event
15031503
"""
1504+
logger.debug(
1505+
"record_event_failed_pull_attempt room_id=%s, event_id=%s, cause=%s",
1506+
room_id,
1507+
event_id,
1508+
cause,
1509+
)
15041510
await self.db_pool.runInteraction(
15051511
"record_event_failed_pull_attempt",
15061512
self._record_event_failed_pull_attempt_upsert_txn,
@@ -1530,6 +1536,54 @@ def _record_event_failed_pull_attempt_upsert_txn(
15301536

15311537
txn.execute(sql, (room_id, event_id, 1, self._clock.time_msec(), cause))
15321538

1539+
@trace
1540+
async def get_event_ids_to_not_pull_from_backoff(
1541+
self,
1542+
room_id: str,
1543+
event_ids: Collection[str],
1544+
) -> List[str]:
1545+
"""
1546+
Filter down the events to ones that we've failed to pull before recently. Uses
1547+
exponential backoff.
1548+
1549+
Args:
1550+
room_id: The room that the events belong to
1551+
event_ids: A list of events to filter down
1552+
1553+
Returns:
1554+
List of event_ids that should not be attempted to be pulled
1555+
"""
1556+
event_failed_pull_attempts = await self.db_pool.simple_select_many_batch(
1557+
table="event_failed_pull_attempts",
1558+
column="event_id",
1559+
iterable=event_ids,
1560+
keyvalues={},
1561+
retcols=(
1562+
"event_id",
1563+
"last_attempt_ts",
1564+
"num_attempts",
1565+
),
1566+
desc="get_event_ids_to_not_pull_from_backoff",
1567+
)
1568+
1569+
current_time = self._clock.time_msec()
1570+
return [
1571+
event_failed_pull_attempt["event_id"]
1572+
for event_failed_pull_attempt in event_failed_pull_attempts
1573+
# Exponential back-off (up to the upper bound) so we don't try to
1574+
# pull the same event over and over. ex. 2hr, 4hr, 8hr, 16hr, etc.
1575+
if current_time
1576+
< event_failed_pull_attempt["last_attempt_ts"]
1577+
+ (
1578+
2
1579+
** min(
1580+
event_failed_pull_attempt["num_attempts"],
1581+
BACKFILL_EVENT_EXPONENTIAL_BACKOFF_MAXIMUM_DOUBLING_STEPS,
1582+
)
1583+
)
1584+
* BACKFILL_EVENT_EXPONENTIAL_BACKOFF_STEP_MILLISECONDS
1585+
]
1586+
15331587
async def get_missing_events(
15341588
self,
15351589
room_id: str,

0 commit comments

Comments
 (0)