forked from tarantool/tarantool
-
Notifications
You must be signed in to change notification settings - Fork 0
qsync testing plan
Sergey Bronnikov edited this page Jun 29, 2020
·
17 revisions
current state of RFC - 15.06.2020 (https://github.com/tarantool/tarantool/commit/a0236e5891f97426a62634557560c4adf32fc967)
[RFC, summary] switch async replicas into sync ones and vice versa, expected success and data consistency on a leader and replicas[RFC, summary] switch from leader to replica and vice versa, expected success and data consistency on a leader and replicas[RFC, quorum commit] happy path: write/read data to a leader in sync cluster, expected data consistency on a leader and replicas- happy path: read/write data to a sync cluster with max allowed replicas number, expected success and data consistency on a leader and replicas
[RFC, quorum commit] no quorum achieved, expected transaction rollback and data consistency on a leader and replicas- [RFC, quorum commit] check behaviour with no answer from a replica during write, expected to set failure answer
[RFC, quorum commit] check behaviour with failure answer from a replica during write, expected disconnect from the replication[RFC, quorum commit] attempt to write multiple transactions, expected the same order as on client in case of achieved quorum- [RFC, quorum commit] attempt to write multiple transactions, expected that latest transaction that collects the quorum is considered as complete, as well as all transactions prior to it
- [RFC, quorum commit] failure on a leader transaction confirm, expected rollback and data consistency on a leader and replicas
- leader got a quorum but one replica participated in a quorum leave cluster right after answering to a leader, expected (TBD)
- [RFC, quorum commit] проверить ситуацию, когда в WAL записали и ответили SUCCESS, но потом потеряли WAL
- почитать код для rollback ("guarantee of rollback on leader and sync replicas")
consistency on replicas on enabling and disabling sync replication- [RFC, connection liveness]
replication_connect_timeout
works as expected with sync cluster (see documentation) - [RFC, connection liveness]
replication_sync_lag
works as expected with sync cluster (see documentation) - [RFC, connection liveness]
replication_sync_timeout
works as expected with sync cluster (see documentation) - [RFC, connection liveness]
replication_timeout
works as expected with sync cluster (see documentation) [RFC, connection liveness]replication_synchro_timeout
- [RFC, connection liveness]
replication_synchro_quorum
- [RFC, connection liveness] when Leader has no response for another heartbeat interval, it should consider the replica is lost
- [RFC, connection liveness] when leader appears in a situation it has not enough replicas to achieve quorum, it should stop accepting write requests
- [RFC, connection liveness] leader stopped to accept write requests can be switched back to write mode when configuration of a cluster updated.
- [RFC, connection liveness] some of replicas become unavailable during the quorum collection, expected - a leader should wait at most for
replication_synchro_quorum_timeout
after which it issues a rollback pointing to the oldest TXN in the waiting list test with a leader and a single replica in a cluster, expected ??? (TBD)- [RFC, Leader role assignment] promote a replica to a leader manually with unconfirmed transactions, expected success and leader switched to rw mode and "all transactions that have no corresponding confirm message should be confirmed (see the leader role assignment)"
- [RFC, Leader role assignment] promote a replica to a leader manually, expected success and leader switched to rw mode and "Awrites CURRENT_LEADER_ID as a FORMER_LEADER_ID in the
_voting space
and put its ID as a CURRENT_LEADER_ID." - [RFC, Leader role assignment] demote a leader manually, expected success and leader switched to ro mode and "The Leader has to switch in ro mode and wait for it's undo log is empty. This effectively means all transactions are committed in the cluster and it is safe pass the leadership. Then it should write CURRENT_LEADER_ID as a FORMER_LEADER_ID and put CURRENT_LEADER_ID into 0."
- [RFC, Recovery and failover] "In case there's not enough replicas to set up a quorum the cluster can be switched into a read-only mode."
- [RFC, Recovery and failover] TODO
[RFC, Snapshot generation] all txns confirmed, then snapshot, expected success (check both master and replica)- [RFC, Snapshot generation] snapshot started, then confirm arrived, expected success (check both master and replica)
- [RFC, Snapshot generation] snapshot started, then rollback arrived, expected snapshot abort (check both master and replica)
- [RFC, Snapshot generation] successful snapshot contains all txns created before LSN that was latest when snapshot creation started (check both master and replica)
- [RFC, Asynchronous replication] successful transaction applied on async replica
- [RFC, Asynchronous replication] failed transaction rolled back on async replica
- fault injections on a different steps to fail "WAL Ok" from replica: network, disk, etc (TBD)
- test with time difference on leader and replicas, expected success
- proper quorum number calculation (quorum is a more than half of the number of nodes in cluster. (N/2 + 1) where N is total number of nodes in cluster):
- In a 5-node cluster, quorum is 3
- In a 4-node cluster, quorum is 3
- In a 3-node cluster, quorum is 2
- In a 2-node cluster, quorum is 1
- [RFC, Leader role assignment] TODO: automated leader promotion with Raft
- test Raft implementation itself with random state generation and invariants described in a Raft paper
- test new cluster cli
- testing with nemesis
- transactional consistency checker
- checking linearizability
- Testing should be done with both engines:
memtx
andvinyl
- How many nodes should be in a cluster?
- Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-intensive Systems Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm:
Almost all (98%) of the failures are guaranteed to manifest on no more than 3 nodes. 84% will manifest on no more than 2 nodes…. It is not necessary to have a large cluster to test for and reproduce failures.
- RFC https://github.com/tarantool/tarantool/blob/sergos/quorum-based-synchro/doc/rfc/quorum-based-synchro.md
- https://www.tarantool.io/ru/doc/2.3/reference/configuration/
- https://www.tarantool.io/ru/doc/2.3/book/replication/repl_architecture/
- https://github.com/tarantool/tarantool/issues/980
Architecture Specifications
- Server architecture
- Feature specifications
- What's in a good specification
- Functional indexes
- Space _index structure
- R tree index quick start and usage
- LuaJIT
- Vinyl
- SQL
- Testing
- Performance
How To ...?
- ... add new fuzzers
- ... build RPM or Deb package using packpack
- ... calculate memory size
- ... debug core dump of stripped tarantool
- ... debug core from different OS
- ... debug Lua state with GDB
- ... generate new bootstrap snapshot
- ... use Address Sanitizer
- ... collect a coredump
Lua modules
Useful links