Skip to content

BATCH-1767: fix optimistic locking exception in multi-threaded step #591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

fmbenhassine
Copy link
Contributor

@fmbenhassine fmbenhassine commented Mar 21, 2018

This PR is an attempt to resolve BATCH-1767. I will first explain the problem and then suggest a possible solution.

The problem:

Currently, when the commit of the chunk fails in a multi-threaded step, the step execution is rolled back to a previous, eventually obsolete version. This previous version might be obsolete because it could have been modified by another thread. In this case, a OptimisticLockingFailureException is thrown when trying to persist the step execution leaving the step in an UNKNOWN state while it should be in a FAILED state. Here is a simple diagram that illustrates the issue:

batch-1767

SE: StepExecution | SC: StepContribution | OV: old version variable (of type StepExecution)

In this diagram, 3 threads are running in parallel to execute the TaskletStep. All of them start with a StepExecution instance at version 1. This instance is shared between threads. In pseudo code, each thread will do the following:

1. copy the StepExecution in oldVersion variable (which is also shared)
2. create a StepContribution instance (not shared), do the work and assign metrics to it
3. take the lock {
      3.1 apply contribution to the StepExecution
      3.2 try {
               3.2.1 update the StepExecution in the database 
               3.2.2 commit
          catch(Exception e) {
               3.2.3 rollback the in-memory StepExecution instance to oldVersion
          }
}

There are two issues with this:

  1. The StepContribution might be applied on an obsolete version of the StepExecution (which might have changed between step 1 and and step 3). This results in OptimisticLockingFailureException when one of the threads tries to persist the step execution after applying its contribution

  2. In case of a commit failure, the StepExecution is rolled back to a possibly obsolete version (oldVersion which is always at version 1). This also results in a OptimisticLockingFailureException when the control returns back to AbstractStep which tries to persist the in-memory StepExecution at version 1 while the StepExecution in the database is at version 3 (This is the reported issue in BATCH-1767).

A possible solution

This PR fixes:

  • Issue 1) by refreshing the step execution to the latest correctly persisted version just before applying the step contribution so that each thread applies its contribution on a fresh and correct state.

  • Issue 2) by reverting all metrics but the version when reverting the in-memory StepExecution in case of rollback. The goal of the rollback is to undo the (failed) contribution metrics and not the technical field version. "Reverting" a version is not compatible with an optimistic locking strategy, we use the version to check if there is a version conflict but we don't need to revert it (in the in-memory instance of StepExecution, note that the StepExecution is correctly reverted in the database since the update is executed within the transaction (doInTransaction method) being rolled back).

The most important detail in the story is that each thread creates its own contribution to the step, but the ChunkContext (and the StepExecution) is shared between threads. This shared state might end up in an inconsistent state if one of the transactions is rolled back. Tests in this PR do not assert any result on the (undefined) execution context.

It is technically possible to make the execution context consistent by using ThreadLocals for instance, but I would not go down this path for two reasons:

  • It will increase memory footprint by making each thread have its own copy of the execution context (and probably introduce a performance degradation).
  • It will also introduce another level of complexity to the code (and hence impacts maintainability) just to solve a very specific problem. It's always a trade-off..

The "practical limitations of multi-threaded Step" section of the documentation already mentions to turn-off the state in this case. This PR updates the section by mentioning that it is not advised to manipulate the execution context in a multi-threaded step.

@quaff
Copy link
Contributor

quaff commented Dec 10, 2019

@benas Any updates? I have encounter OptimisticLockingFailureException while testing distributed database such as TiDB and YugabyteDB, but works fine with traditional database such as MySQL and PostgreSQL and SqlServer.

2019-12-10 10:06:24,911           taskExecutor-52                                      o.s.b.c.s.t.TaskletStep ERROR JobRepository failure forcing rollback
 org.springframework.dao.OptimisticLockingFailureException: Attempt to update step execution id=4 with wrong version (50), where current version is 1
	at org.springframework.batch.core.repository.dao.JdbcStepExecutionDao.updateStepExecution(JdbcStepExecutionDao.java:279) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.repository.support.SimpleJobRepository.update(SimpleJobRepository.java:196) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at sun.reflect.GeneratedMethodAccessor149.invoke(Unknown Source) ~[?:?]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_231]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_231]
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:366) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:99) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:212) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at com.sun.proxy.$Proxy190.update(Unknown Source) ~[?:?]
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:457) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:262) ~[spring-batch-infrastructure-4.2.1.jar:4.2.1.RELEASE]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_231]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_231]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]
2019-12-10 10:06:24,912           taskExecutor-52                                      o.s.b.c.s.t.TaskletStep INFO Commit failed while step execution data was already updated. Reverting to old version.
2019-12-10 10:06:24,920           taskExecutor-68                                      o.s.b.c.s.t.TaskletStep ERROR JobRepository failure forcing rollback
 org.springframework.dao.OptimisticLockingFailureException: Attempt to update step execution id=4 with wrong version (1), where current version is 50
	at org.springframework.batch.core.repository.dao.JdbcStepExecutionDao.updateStepExecution(JdbcStepExecutionDao.java:279) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.repository.support.SimpleJobRepository.update(SimpleJobRepository.java:196) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at sun.reflect.GeneratedMethodAccessor149.invoke(Unknown Source) ~[?:?]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_231]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_231]
	at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:366) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:99) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:212) ~[spring-aop-5.2.2.jar:5.2.2.RELEASE]
	at com.sun.proxy.$Proxy190.update(Unknown Source) ~[?:?]
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:457) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.step.tasklet.TaskletStep$ChunkTransactionCallback.doInTransaction(TaskletStep.java:331) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.transaction.support.TransactionTemplate.execute(TransactionTemplate.java:140) ~[spring-tx-5.2.2.jar:5.2.2.RELEASE]
	at org.springframework.batch.core.step.tasklet.TaskletStep$2.doInChunkContext(TaskletStep.java:273) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.core.scope.context.StepContextRepeatCallback.doInIteration(StepContextRepeatCallback.java:82) ~[spring-batch-core-4.2.1.jar:4.2.1.RELEASE]
	at org.springframework.batch.repeat.support.TaskExecutorRepeatTemplate$ExecutingRunnable.run(TaskExecutorRepeatTemplate.java:262) ~[spring-batch-infrastructure-4.2.1.jar:4.2.1.RELEASE]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_231]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_231]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]

@quaff
Copy link
Contributor

quaff commented Dec 10, 2019

I tried this patch, it doesn't fix the problem, but job ends with FAILED instead of UNKNOWN.
I'm wondering why failed with those newSQL, perhaps they are not so ACID as advertised.

@fmbenhassine
Copy link
Contributor Author

fmbenhassine commented Mar 24, 2020

@quaff

I have encounter OptimisticLockingFailureException while testing distributed database such as TiDB and YugabyteDB, but works fine with traditional database such as MySQL and PostgreSQL and SqlServer

If it works fine with traditional databases, it is already a good sign about the fix. Distributed databases are not supported in Spring Batch yet (there are open issues to support them: #2060, #1870, #1317) so they are not in the scope of this issue anyway right now.

I tried this patch, it doesn't fix the problem, but job ends with FAILED instead of UNKNOWN.

Please note that fixing the problem does not mean the OptimisticLockingFailureException does not happen anymore, we can't prevent that (that's the idea behind an optimistic strategy in general). Fixing the problem means when an optimistic locking exception happens, the job should end with status FAILED and not UNKNOWN. This is required to be able to restart it where it left off. I explained this in the "problem" section:

In this case, a OptimisticLockingFailureException is thrown when trying
to persist the step execution leaving the step in an UNKNOWN state 
while it should be in a FAILED state.

So for me, if you tried this patch and your job ended in a FAILED state, that means the patch fixes the problem.

Pessimistic strategies are safer in regards to concurrency at the cost of performance. Optimistic strategies usually perform better in concurrent environments at the cost of predictability. Spring Batch adopted an optimistic strategy (See MetaData versioning) by design for good reasons I believe, so the goal here is not preventing the exception from happening, it is rather making things possible to be retied/restarted, in this case, it is the step and its surrounding job: both should end in a FAILED state and not in an UNKNOWN state.

Thank you for your feedback!

@quaff
Copy link
Contributor

quaff commented Mar 25, 2020

@benas What about retrying if OptimisticLockingFailureException occurred? similar to limited spin with CAS.

@fmbenhassine
Copy link
Contributor Author

That would be a feature, and I'm not sure if it is possible to implement it without a big refactoring which is out of scope for this issue.

We would like to focus on fixing the issue first (which is making sure the job fails with the correct status of FAILED and not UNKNOWN to be able to restart it where it left off from a consistent state, at least manually) before extending the scope to add automatic retries, etc. Make it work, then make it better 😉

@quaff
Copy link
Contributor

quaff commented Mar 27, 2020

This PR is pending for two years, is there any concern?

@fmbenhassine
Copy link
Contributor Author

No, please see my comment here. We will let you know when this is ready to merge.

@snagunoori
Copy link

@benas Before available this in release do you have any hack so that mean while which we can use to fix our prod issue . Looks like this issue is pending from long time since 2011 if am not wrong #1826

@fmbenhassine fmbenhassine added in: core status: waiting-for-triage Issues that we did not analyse yet labels Nov 9, 2020
@fmbenhassine fmbenhassine removed the status: waiting-for-triage Issues that we did not analyse yet label Jan 14, 2021
fmbenhassine and others added 2 commits March 23, 2021 16:54
Currently, when the commit of the chunk fails in a multi-threaded step,
the step execution is rolled back to a previous, eventually obsolete
version. This previous version might be obsolete because it could be
modified by another thread. In this case, a OptimisticLockingFailureException
is thrown when trying to persist the step execution leaving the step in
an UNKNOWN state while it should be FAILED.

This commit fixes the issue by refreshing the step execution to the
latest correctly persisted version before applying the step contribution
so that the contribution is applied on a fresh correct state.

Resolves BATCH-1767
@fmbenhassine
Copy link
Contributor Author

I rebased this PR on the latest master branch and updated the tests as needed (7ba5c10). However, there is now one regression in FaultTolerantStepFactoryBeanUnexpectedRollbackTests that was not present when I first created this PR. This regression could be fixed by reverting this line, but this goes against the idea of the PR in the first place..

This issue is actually deeper than I initially thought. The problem is that the current implementation of a chunk-oriented step is not friendly to concurrent access. The current model tries to lock access to shared state in memory by using a semaphore coupled with an optimistic locking strategy at the job repository level. However, this shared state is not properly synchronized and ensuring a consistent state between what's in memory and what's in the database is not guaranteed when things go wrong (as detailed in my initial comment). There are several issues related to that behaviour like #1189 and #1145 which are still valid.

For this reason, I decided to close this PR without merging the fix. While this PR could be merged, I believe it only treats the symptom, not the underlying problem. I will open an issue to reconsider the implementation of a chunk-oriented step in order to properly fix all related issues.

@fmbenhassine
Copy link
Contributor Author

I will open an issue to reconsider the implementation of a chunk-oriented step in order to properly fix all related issues.

Done: #3950.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OptimisticLockingFailureException updating step execution after commit failure
3 participants