qsync testing plan

current state of RFC - 15.06.2020 (https://github.com/tarantool/tarantool/commit/a0236e5891f97426a62634557560c4adf32fc967)

Bugs

1st iteration

~~[RFC, summary] switch async replicas into sync ones and vice versa, expected success and data consistency on a leader and replicas~~
~~[RFC, summary] switch from leader to replica and vice versa, expected success and data consistency on a leader and replicas~~
~~[RFC, quorum commit] happy path: write/read data to a leader in sync cluster, expected data consistency on a leader and replicas~~
happy path: read/write data to a sync cluster with max allowed replicas number, expected success and data consistency on a leader and replicas
~~[RFC, quorum commit] no quorum achieved, expected transaction rollback and data consistency on a leader and replicas~~
[RFC, quorum commit] check behaviour with no answer from a replica during write, expected to set failure answer
~~[RFC, quorum commit] check behaviour with failure answer from a replica during write, expected disconnect from the replication~~
~~[RFC, quorum commit] attempt to write multiple transactions, expected the same order as on client in case of achieved quorum~~
[RFC, quorum commit] attempt to write multiple transactions, expected that latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it
[RFC, quorum commit] failure on a leader transaction confirm, expected rollback and data consistency on a leader and replicas
leader got a quorum but one replica participated in a quorum leave cluster right after answering to a leader, expected (TBD)
[RFC, quorum commit] проверить ситуацию, когда в WAL записали и ответили SUCCESS, но потом потеряли WAL
почитать код для rollback ("guarantee of rollback on leader and sync replicas")
~~consistency on replicas on enabling and disabling sync replication~~
[RFC, connection liveness] replication_connect_timeout works as expected with sync cluster (see documentation)
[RFC, connection liveness] replication_sync_lag works as expected with sync cluster (see documentation)
[RFC, connection liveness] replication_sync_timeout works as expected with sync cluster (see documentation)
[RFC, connection liveness] replication_timeout works as expected with sync cluster (see documentation)
~~[RFC, connection liveness] replication_synchro_timeout~~
[RFC, connection liveness] replication_synchro_quorum
[RFC, connection liveness] when Leader has no response for another heartbeat interval, it should consider the replica is lost
[RFC, connection liveness] when leader appears in a situation it has not enough replicas to achieve quorum, it should stop accepting write requests
[RFC, connection liveness] leader stopped to accept write requests can be switched back to write mode when configuration of a cluster updated.
[RFC, connection liveness] some of replicas become unavailable during the quorum collection, expected - a leader should wait at most for replication_synchro_quorum_timeout after which it issues a rollback pointing to the oldest TXN in the waiting list
~~test with a leader and a single replica in a cluster, expected ??? (TBD)~~
[RFC, Leader role assignment] promote a replica to a leader manually with unconfirmed transactions, expected success and leader switched to rw mode and "all transactions that have no corresponding confirm message should be confirmed (see the leader role assignment)"
[RFC, Leader role assignment] promote a replica to a leader manually, expected success and leader switched to rw mode and "Awrites CURRENT_LEADER_ID as a FORMER_LEADER_ID in the _voting space and put its ID as a CURRENT_LEADER_ID."
[RFC, Leader role assignment] demote a leader manually, expected success and leader switched to ro mode and "The Leader has to switch in ro mode and wait for it's undo log is empty. This effectively means all transactions are committed in the cluster and it is safe pass the leadership. Then it should write CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID into 0."
[RFC, Recovery and failover] "In case there's not enough replicas to set up a quorum the cluster can be switched into a read-only mode."
[RFC, Recovery and failover] TODO
~~[RFC, Snapshot generation] all txns confirmed, then snapshot, expected success (check both master and replica)~~
[RFC, Snapshot generation] snapshot started, then confirm arrived, expected success (check both master and replica)
[RFC, Snapshot generation] snapshot started, then rollback arrived, expected snapshot abort (check both master and replica)
[RFC, Snapshot generation] successful snapshot contains all txns created before LSN that was latest when snapshot creation started (check both master and replica)
[RFC, Asynchronous replication] successful transaction applied on async replica
[RFC, Asynchronous replication] failed transaction rolled back on async replica

2nd iteration

fault injections on a different steps to fail "WAL Ok" from replica: network, disk, etc (TBD)
test with time difference on leader and replicas, expected success
proper quorum number calculation (quorum is a more than half of the number of nodes in cluster. (N/2 + 1) where N is total number of nodes in cluster):
- In a 5-node cluster, quorum is 3
- In a 4-node cluster, quorum is 3
- In a 3-node cluster, quorum is 2
- In a 2-node cluster, quorum is 1
[RFC, Leader role assignment] TODO: automated leader promotion with Raft
test Raft implementation itself with random state generation and invariants described in a Raft paper
test new cluster cli
testing with nemesis
- jepsen tests https://github.com/jepsen-io/jepsen/blob/master/doc/tutorial/index.md
- tipocket https://github.com/pingcap/tipocket#nemesis
transactional consistency checker
- elle (Clojure)
- go-elle (Go)
checking linearizability
- knossos (Clojure)
- porcupine (Go)
- https://github.com/ahorn/linearizability-checker

Notes

Testing should be done with both engines: memtx and vinyl
How many nodes should be in a cluster?
- Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm:

Almost all (98%) of the failures are guaranteed to manifest on no more than 3 nodes. 84% will manifest on no more than 2 nodes…. It is not necessary to have a large cluster to test for and reproduce failures.

Why Your Riak Cluster Should Have At Least Five Nodes

References

Developer Guidelines ↗

Architecture Specifications

How To ...?

Recipes

Lua modules

Useful links

Old discussions

Personal pages

qsync testing plan

Bugs

1st iteration

2nd iteration

Notes

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!