BATCH-1767: fix optimistic locking exception in multi-threaded step #591

fmbenhassine · 2018-03-21T13:52:02Z

This PR is an attempt to resolve BATCH-1767. I will first explain the problem and then suggest a possible solution.

The problem:

Currently, when the commit of the chunk fails in a multi-threaded step, the step execution is rolled back to a previous, eventually obsolete version. This previous version might be obsolete because it could have been modified by another thread. In this case, a OptimisticLockingFailureException is thrown when trying to persist the step execution leaving the step in an UNKNOWN state while it should be in a FAILED state. Here is a simple diagram that illustrates the issue:

SE: StepExecution | SC: StepContribution | OV: old version variable (of type StepExecution)

In this diagram, 3 threads are running in parallel to execute the TaskletStep. All of them start with a StepExecution instance at version 1. This instance is shared between threads. In pseudo code, each thread will do the following:

1. copy the StepExecution in oldVersion variable (which is also shared)
2. create a StepContribution instance (not shared), do the work and assign metrics to it
3. take the lock {
      3.1 apply contribution to the StepExecution
      3.2 try {
               3.2.1 update the StepExecution in the database 
               3.2.2 commit
          catch(Exception e) {
               3.2.3 rollback the in-memory StepExecution instance to oldVersion
          }
}

There are two issues with this:

The StepContribution might be applied on an obsolete version of the StepExecution (which might have changed between step 1 and and step 3). This results in OptimisticLockingFailureException when one of the threads tries to persist the step execution after applying its contribution
In case of a commit failure, the StepExecution is rolled back to a possibly obsolete version (oldVersion which is always at version 1). This also results in a OptimisticLockingFailureException when the control returns back to AbstractStep which tries to persist the in-memory StepExecution at version 1 while the StepExecution in the database is at version 3 (This is the reported issue in BATCH-1767).

A possible solution

This PR fixes:

Issue 1) by refreshing the step execution to the latest correctly persisted version just before applying the step contribution so that each thread applies its contribution on a fresh and correct state.
Issue 2) by reverting all metrics but the version when reverting the in-memory StepExecution in case of rollback. The goal of the rollback is to undo the (failed) contribution metrics and not the technical field version. "Reverting" a version is not compatible with an optimistic locking strategy, we use the version to check if there is a version conflict but we don't need to revert it (in the in-memory instance of StepExecution, note that the StepExecution is correctly reverted in the database since the update is executed within the transaction (doInTransaction method) being rolled back).

The most important detail in the story is that each thread creates its own contribution to the step, but the ChunkContext (and the StepExecution) is shared between threads. This shared state might end up in an inconsistent state if one of the transactions is rolled back. Tests in this PR do not assert any result on the (undefined) execution context.

It is technically possible to make the execution context consistent by using ThreadLocals for instance, but I would not go down this path for two reasons:

It will increase memory footprint by making each thread have its own copy of the execution context (and probably introduce a performance degradation).
It will also introduce another level of complexity to the code (and hence impacts maintainability) just to solve a very specific problem. It's always a trade-off..

The "practical limitations of multi-threaded Step" section of the documentation already mentions to turn-off the state in this case. This PR updates the section by mentioning that it is not advised to manipulate the execution context in a multi-threaded step.

quaff · 2019-12-10T02:16:15Z

@benas Any updates? I have encounter OptimisticLockingFailureException while testing distributed database such as TiDB and YugabyteDB, but works fine with traditional database such as MySQL and PostgreSQL and SqlServer.

2019-12-10 10:06:24,911           taskExecutor-52                                      o.s.b.c.s.t.TaskletStep ERROR JobRepository failure forcing rollback
 org.springframework.dao.OptimisticLockingFailureException: Attempt to update step execution id=4 with wrong version (50), where current version is 1
	at org.springframework.batch.core.repository.dao.JdbcStepExecutionDao.updateStepExecution(JdbcStepExecutionDao.java:279) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.repository.support.SimpleJobRepository.update(SimpleJobRepository.java:196) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at sun.reflect.GeneratedMethodAccessor149.invoke(Unknown Source) ~[?:?]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_231]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_231]
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:366) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:99) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:212) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at com.sun.proxy.$Proxy190.update(Unknown Source) ~[?:?]
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:457) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:262) ~[spring-batch-infrastructure-4.2.1.jar:4.2.1.RELEASE]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_231]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_231]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]
2019-12-10 10:06:24,912           taskExecutor-52                                      o.s.b.c.s.t.TaskletStep INFO Commit failed while step execution data was already updated. Reverting to old version.
2019-12-10 10:06:24,920           taskExecutor-68                                      o.s.b.c.s.t.TaskletStep ERROR JobRepository failure forcing rollback
 org.springframework.dao.OptimisticLockingFailureException: Attempt to update step execution id=4 with wrong version (1), where current version is 50
	at org.springframework.batch.core.repository.dao.JdbcStepExecutionDao.updateStepExecution(JdbcStepExecutionDao.java:279) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.repository.support.SimpleJobRepository.update(SimpleJobRepository.java:196) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at sun.reflect.GeneratedMethodAccessor149.invoke(Unknown Source) ~[?:?]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_231]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_231]
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:366) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:99) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:212) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at com.sun.proxy.$Proxy190.update(Unknown Source) ~[?:?]
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:457) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:262) ~[spring-batch-infrastructure-4.2.1.jar:4.2.1.RELEASE]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_231]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_231]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]

quaff · 2019-12-10T02:27:50Z

I tried this patch, it doesn't fix the problem, but job ends with FAILED instead of UNKNOWN.
I'm wondering why failed with those newSQL, perhaps they are not so ACID as advertised.

fmbenhassine · 2020-03-24T11:58:45Z

@quaff

I have encounter OptimisticLockingFailureException while testing distributed database such as TiDB and YugabyteDB, but works fine with traditional database such as MySQL and PostgreSQL and SqlServer

If it works fine with traditional databases, it is already a good sign about the fix. Distributed databases are not supported in Spring Batch yet (there are open issues to support them: #2060, #1870, #1317) so they are not in the scope of this issue anyway right now.

I tried this patch, it doesn't fix the problem, but job ends with FAILED instead of UNKNOWN.

Please note that fixing the problem does not mean the OptimisticLockingFailureException does not happen anymore, we can't prevent that (that's the idea behind an optimistic strategy in general). Fixing the problem means when an optimistic locking exception happens, the job should end with status FAILED and not UNKNOWN. This is required to be able to restart it where it left off. I explained this in the "problem" section:

In this case, a OptimisticLockingFailureException is thrown when trying
to persist the step execution leaving the step in an UNKNOWN state 
while it should be in a FAILED state.

So for me, if you tried this patch and your job ended in a FAILED state, that means the patch fixes the problem.

Pessimistic strategies are safer in regards to concurrency at the cost of performance. Optimistic strategies usually perform better in concurrent environments at the cost of predictability. Spring Batch adopted an optimistic strategy (See MetaData versioning) by design for good reasons I believe, so the goal here is not preventing the exception from happening, it is rather making things possible to be retied/restarted, in this case, it is the step and its surrounding job: both should end in a FAILED state and not in an UNKNOWN state.

Thank you for your feedback!

quaff · 2020-03-25T00:26:33Z

@benas What about retrying if OptimisticLockingFailureException occurred? similar to limited spin with CAS.

fmbenhassine · 2020-03-26T11:11:28Z

That would be a feature, and I'm not sure if it is possible to implement it without a big refactoring which is out of scope for this issue.

We would like to focus on fixing the issue first (which is making sure the job fails with the correct status of FAILED and not UNKNOWN to be able to restart it where it left off from a consistent state, at least manually) before extending the scope to add automatic retries, etc. Make it work, then make it better 😉

quaff · 2020-03-27T00:11:28Z

This PR is pending for two years, is there any concern?

fmbenhassine · 2020-03-31T16:32:29Z

No, please see my comment here. We will let you know when this is ready to merge.

snagunoori · 2020-04-30T05:25:37Z

@benas Before available this in release do you have any hack so that mean while which we can use to fix our prod issue . Looks like this issue is pending from long time since 2011 if am not wrong #1826

Currently, when the commit of the chunk fails in a multi-threaded step, the step execution is rolled back to a previous, eventually obsolete version. This previous version might be obsolete because it could be modified by another thread. In this case, a OptimisticLockingFailureException is thrown when trying to persist the step execution leaving the step in an UNKNOWN state while it should be FAILED. This commit fixes the issue by refreshing the step execution to the latest correctly persisted version before applying the step contribution so that the contribution is applied on a fresh correct state. Resolves BATCH-1767

…g-projects#1289

fmbenhassine · 2021-03-23T16:29:34Z

I rebased this PR on the latest master branch and updated the tests as needed (7ba5c10). However, there is now one regression in FaultTolerantStepFactoryBeanUnexpectedRollbackTests that was not present when I first created this PR. This regression could be fixed by reverting this line, but this goes against the idea of the PR in the first place..

This issue is actually deeper than I initially thought. The problem is that the current implementation of a chunk-oriented step is not friendly to concurrent access. The current model tries to lock access to shared state in memory by using a semaphore coupled with an optimistic locking strategy at the job repository level. However, this shared state is not properly synchronized and ensuring a consistent state between what's in memory and what's in the database is not guaranteed when things go wrong (as detailed in my initial comment). There are several issues related to that behaviour like #1189 and #1145 which are still valid.

For this reason, I decided to close this PR without merging the fix. While this PR could be merged, I believe it only treats the symptom, not the underlying problem. I will open an issue to reconsider the implementation of a chunk-oriented step in order to properly fix all related issues.

fmbenhassine · 2022-08-05T08:44:24Z

I will open an issue to reconsider the implementation of a chunk-oriented step in order to properly fix all related issues.

Done: #3950.

fmbenhassine force-pushed the BATCH-1767 branch from 1d45ed3 to 0561396 Compare April 4, 2018 21:43

fmbenhassine added ready-to-review labels Apr 10, 2018

fmbenhassine added the pr-for: bug label Oct 6, 2018

spring-projects-issues mentioned this pull request Dec 16, 2019

OptimisticLockingFailureException updating step execution after commit failure #1826

Open

fmbenhassine removed pr-status: do-not-merge-yet labels Jan 13, 2020

fmbenhassine added in: core status: waiting-for-triage Issues that we did not analyse yet labels Nov 9, 2020

fmbenhassine linked an issue Jan 14, 2021 that may be closed by this pull request

OptimisticLockingFailureException updating step execution after commit failure #1826

Open

fmbenhassine removed the status: waiting-for-triage Issues that we did not analyse yet label Jan 14, 2021

fmbenhassine and others added 2 commits March 23, 2021 16:54

Update tests to use the new way of providing a tx manager as in sprin…

7ba5c10

…g-projects#1289

fmbenhassine force-pushed the BATCH-1767 branch from 0561396 to 7ba5c10 Compare March 23, 2021 16:02

fmbenhassine closed this Mar 23, 2021

fmbenhassine mentioned this pull request May 22, 2025

Retry support for exceptions in transaction commit phase without re-executing reader #4114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BATCH-1767: fix optimistic locking exception in multi-threaded step #591

BATCH-1767: fix optimistic locking exception in multi-threaded step #591

Uh oh!

fmbenhassine commented Mar 21, 2018 •

edited

Loading

Uh oh!

quaff commented Dec 10, 2019

Uh oh!

quaff commented Dec 10, 2019

Uh oh!

fmbenhassine commented Mar 24, 2020 •

edited

Loading

Uh oh!

quaff commented Mar 25, 2020

Uh oh!

fmbenhassine commented Mar 26, 2020

Uh oh!

quaff commented Mar 27, 2020

Uh oh!

fmbenhassine commented Mar 31, 2020

Uh oh!

snagunoori commented Apr 30, 2020

Uh oh!

fmbenhassine commented Mar 23, 2021

Uh oh!

fmbenhassine commented Aug 5, 2022

Uh oh!

Uh oh!

BATCH-1767: fix optimistic locking exception in multi-threaded step #591

BATCH-1767: fix optimistic locking exception in multi-threaded step #591

Uh oh!

Conversation

fmbenhassine commented Mar 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The problem:

A possible solution

Uh oh!

quaff commented Dec 10, 2019

Uh oh!

quaff commented Dec 10, 2019

Uh oh!

fmbenhassine commented Mar 24, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quaff commented Mar 25, 2020

Uh oh!

fmbenhassine commented Mar 26, 2020

Uh oh!

quaff commented Mar 27, 2020

Uh oh!

fmbenhassine commented Mar 31, 2020

Uh oh!

snagunoori commented Apr 30, 2020

Uh oh!

fmbenhassine commented Mar 23, 2021

Uh oh!

fmbenhassine commented Aug 5, 2022

Uh oh!

Uh oh!

fmbenhassine commented Mar 21, 2018 •

edited

Loading

fmbenhassine commented Mar 24, 2020 •

edited

Loading