bank-lua fails with 1.8% probability by construction #94

Totktonada · 2021-10-30T15:58:58Z

The test creates 10 bank accounts and set initial balances:

$ grep :read tarantool-3f568c339221ac10948c0dd29e405c0f224597dc/20211029T163440.000Z/jepsen.log | grep -v :invoke | head -n 1
2021-10-29 16:44:43,402{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:ok	:read	{0 100, 1 0, 2 0, 3 0, 4 0, 5 0, 6 0, 7 0, 8 0, 9 0}

The 0th account has balance 100, others have 0.

Next, the test starts to perform transfers from account to account:

$ grep :transfer tarantool-3f568c339221ac10948c0dd29e405c0f224597dc/20211029T163440.000Z/jepsen.log | grep -v :invoke | head -n 10
2021-10-29 16:44:44,169{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:fail	:transfer	{:from 4, :to 2, :amount 5}
2021-10-29 16:44:47,393{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:fail	:transfer	{:from 9, :to 4, :amount 2}
2021-10-29 16:44:48,177{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:fail	:transfer	{:from 2, :to 0, :amount 2}
2021-10-29 16:44:49,756{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:fail	:transfer	{:from 7, :to 6, :amount 5}
2021-10-29 16:44:51,306{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:fail	:transfer	{:from 4, :to 9, :amount 5}
2021-10-29 16:44:53,736{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:fail	:transfer	{:from 7, :to 0, :amount 4}
2021-10-29 16:44:54,518{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:fail	:transfer	{:from 1, :to 2, :amount 5}
2021-10-29 16:44:55,311{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:fail	:transfer	{:from 2, :to 7, :amount 4}
2021-10-29 16:44:56,089{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:fail	:transfer	{:from 2, :to 1, :amount 5}
2021-10-29 16:44:57,632{GMT}	INFO	[jepsen worker 0] jepsen.util: 0	:fail	:transfer	{:from 9, :to 3, :amount 4}

When 'from' account has not enough balance, the transfer fails (it does not fail the whole test). 'from'/'to' accounts are choosen randomly. When all transfer fail, the test fails (see here).

The test is time bounded: 60 seconds. We have no concurrency in the test and amount of transfers is around 38. There is (9/10)^38 ≈ 1.8% probability that we'll never attempt to transfer from 0th account, so all transfers will fail.

The result of the test is reported this way in given case:

2021-10-29 16:46:06,559{GMT}	INFO	[jepsen test runner] jepsen.core: {:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :clock-skew {:valid? true},
 :crash {:valid? true},
 :timeline {:valid? true},
 :stats
 {:valid? false,
  :count 77,
  :ok-count 39,
  :fail-count 38,
  :info-count 0,
  :by-f
  {:read
   {:valid? true,
    :count 39,
    :ok-count 39,
    :fail-count 0,
    :info-count 0},
   :transfer
   {:valid? false,
    :count 38,
    :ok-count 0,
    :fail-count 38,
    :info-count 0}}},
 :exceptions {:valid? true},
 :workload
 {:SI
  {:valid? true,
   :read-count 39,
   :error-count 0,
   :first-error nil,
   :errors {}},
  :plot {:valid? true},
  :valid? true},
 :valid? false}


Analysis invalid! (ﾉಥ益ಥ）ﾉ ┻━┻

(Not quite obvious on the first glance, but look at :ok-count 0.)

There are some ways to resolve it or decrease probability of the fail:

Distribute the initial total amount across all accounts.
Enable concurrency.
Always make the first transfer from 0th account.
Allow negative balances.
A fix for Don't create a new connection for each request #96 may also decrease probability of the fail due to insreased overall amount of transfer attempts per second.

Uniform initial distribution or permitting negative balances are only true solutions, they also resolve the problem with a lot of failed transfers. See below.

I looked over several other tests implementations:

CockroachDB distributed balance uniformly (see here and here).
Ignite distributes them too (see here, here and here).
YugabyteDB sets the first account to an initial value and others to zero (see here and here).

So, it seems, there is no canonical way to choose initial balances. OTOH, if we'll distribute them uniformly, we'll make more successful transfers, so the test will be more powerful.

Full log: logs.txt.
More detailed info: jepsen-single-instance.zip (see tarantool-3f568c339221ac10948c0dd29e405c0f224597dc/20211029T163440.000Z/).

I would also highlight that even when the test in its current implementation does not fails, it generates much less successful transactions, if we'll compare it with the uniform balances distribution case or negative balances case. So our test lacks some stress testing power just because of the bad initial balances distribution.

The text was updated successfully, but these errors were encountered:

Before this commit the initial distribution was the following: first account has `total-amount` balances, other 9 ones have zero balance. From time to time is appears that there are no attempts to transfer from the first account. It means that all transfers are marked as failed and the history analysis reports fail. Even when there are some successful transfers, there are many attempts to transfer from an account with zero balance: those transfers are useless in terms of tarantool testing. This commit changes the initial distribution to the uniform one and resolves problems described above. See the linked issue for details. Aside of the problematic bank-lua I changed other banking tests in the same way to unify the approach. How each test is changed: * bank-lua: [100 0 0 0 0 0 0 0 0 0] -> [10 10 10 10 10 10 10 10 10 10]. * bank-multitable-lua: the test already distributes initial balances uniformly, but hardcodes balance value. Now it is deduced from the `total-amount` parameter. * bank: same as bank-lua. * bank-multitable: same as bank-lua. Fixes #94

Before this commit the initial distribution was the following: first account has `total-amount` balances, other 9 ones have zero balance. From time to time is appears that there are no attempts to transfer from the first account. It means that all transfers are marked as failed and the history analysis reports fail. Even when there are some successful transfers, there are many attempts to transfer from an account with zero balance: those transfers are useless in terms of tarantool testing. This commit changes the initial distribution to the uniform one and resolves the problems, which are described above. See the linked issue for details. Aside of the problematic bank-lua I changed other banking tests in the same way to unify the approach. How each test is changed: * bank-lua: [100 0 0 0 0 0 0 0 0 0] -> [10 10 10 10 10 10 10 10 10 10]. * bank-multitable-lua: the test already distributes initial balances uniformly, but hardcodes balance value. Now it is deduced from the `total-amount` parameter. * bank: same as bank-lua. * bank-multitable: same as bank-lua. 'bank' and 'bank-multitable' fixes are blind, because they're disabled. See #83. 'bank-lua' and 'bank-multitable-lua' are passed in a manual run. 'bank-lua' gives 2-3 failed transfers of ~40 ones. Before it was like 5-15 *successful* ones of 30-40. (I have no representative statistics, just looked over several runs in CI, but the numbers looks expected.) Fixes #94

Totktonada mentioned this issue Nov 1, 2021

Sometime dependencies installation failed #87

Closed

Totktonada added the bug Something isn't working label Nov 14, 2021

Totktonada self-assigned this Nov 14, 2021

Totktonada mentioned this issue Nov 14, 2021

bank*: distribute initial balances uniformly #99

Merged

ligurio closed this as completed in #99 Nov 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bank-lua fails with 1.8% probability by construction #94

bank-lua fails with 1.8% probability by construction #94

Totktonada commented Oct 30, 2021 •

edited

Loading

bank-lua fails with 1.8% probability by construction #94

bank-lua fails with 1.8% probability by construction #94

Comments

Totktonada commented Oct 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Totktonada commented Oct 30, 2021 •

edited

Loading