Skip to content

qsync testing plan

Sergey Bronnikov edited this page Jun 29, 2020 · 17 revisions

current state of RFC - 15.06.2020 (https://github.com/tarantool/tarantool/commit/a0236e5891f97426a62634557560c4adf32fc967)

Bugs

1st iteration

  • [RFC, summary] switch async replicas into sync ones and vice versa, expected success and data consistency on a leader and replicas
  • [RFC, summary] switch from leader to replica and vice versa, expected success and data consistency on a leader and replicas
  • [RFC, quorum commit] happy path: write/read data to a leader in sync cluster, expected data consistency on a leader and replicas
  • happy path: read/write data to a sync cluster with max allowed replicas number, expected success and data consistency on a leader and replicas
  • [RFC, quorum commit] no quorum achieved, expected transaction rollback and data consistency on a leader and replicas
  • [RFC, quorum commit] check behaviour with no answer from a replica during write, expected to set failure answer
  • [RFC, quorum commit] check behaviour with failure answer from a replica during write, expected disconnect from the replication
  • [RFC, quorum commit] attempt to write multiple transactions, expected the same order as on client in case of achieved quorum
  • [RFC, quorum commit] attempt to write multiple transactions, expected that latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it
  • [RFC, quorum commit] failure on a leader transaction confirm, expected rollback and data consistency on a leader and replicas
  • leader got a quorum but one replica participated in a quorum leave cluster right after answering to a leader, expected (TBD)
  • [RFC, quorum commit] проверить ситуацию, когда в WAL записали и ответили SUCCESS, но потом потеряли WAL
  • почитать код для rollback ("guarantee of rollback on leader and sync replicas")
  • consistency on replicas on enabling and disabling sync replication
  • [RFC, connection liveness] replication_connect_timeout works as expected with sync cluster (see documentation)
  • [RFC, connection liveness] replication_sync_lag works as expected with sync cluster (see documentation)
  • [RFC, connection liveness] replication_sync_timeout works as expected with sync cluster (see documentation)
  • [RFC, connection liveness] replication_timeout works as expected with sync cluster (see documentation)
  • [RFC, connection liveness] replication_synchro_timeout
  • [RFC, connection liveness] replication_synchro_quorum
  • [RFC, connection liveness] when Leader has no response for another heartbeat interval, it should consider the replica is lost
  • [RFC, connection liveness] when leader appears in a situation it has not enough replicas to achieve quorum, it should stop accepting write requests
  • [RFC, connection liveness] leader stopped to accept write requests can be switched back to write mode when configuration of a cluster updated.
  • [RFC, connection liveness] some of replicas become unavailable during the quorum collection, expected - a leader should wait at most for replication_synchro_quorum_timeout after which it issues a rollback pointing to the oldest TXN in the waiting list
  • test with a leader and a single replica in a cluster, expected ??? (TBD)
  • [RFC, Leader role assignment] promote a replica to a leader manually with unconfirmed transactions, expected success and leader switched to rw mode and "all transactions that have no corresponding confirm message should be confirmed (see the leader role assignment)"
  • [RFC, Leader role assignment] promote a replica to a leader manually, expected success and leader switched to rw mode and "Awrites CURRENT_LEADER_ID as a FORMER_LEADER_ID in the _voting space and put its ID as a CURRENT_LEADER_ID."
  • [RFC, Leader role assignment] demote a leader manually, expected success and leader switched to ro mode and "The Leader has to switch in ro mode and wait for it's undo log is empty. This effectively means all transactions are committed in the cluster and it is safe pass the leadership. Then it should write CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID into 0."
  • [RFC, Recovery and failover] "In case there's not enough replicas to set up a quorum the cluster can be switched into a read-only mode."
  • [RFC, Recovery and failover] TODO
  • [RFC, Snapshot generation] all txns confirmed, then snapshot, expected success (check both master and replica)
  • [RFC, Snapshot generation] snapshot started, then confirm arrived, expected success (check both master and replica)
  • [RFC, Snapshot generation] snapshot started, then rollback arrived, expected snapshot abort (check both master and replica)
  • [RFC, Snapshot generation] successful snapshot contains all txns created before LSN that was latest when snapshot creation started (check both master and replica)
  • [RFC, Asynchronous replication] successful transaction applied on async replica
  • [RFC, Asynchronous replication] failed transaction rolled back on async replica

2nd iteration

Notes

  • Testing should be done with both engines: memtx and vinyl
  • How many nodes should be in a cluster?
    • Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm:

Almost all (98%) of the failures are guaranteed to manifest on no more than 3 nodes. 84% will manifest on no more than 2 nodes…. It is not necessary to have a large cluster to test for and reproduce failures.

References

Clone this wiki locally