-
Notifications
You must be signed in to change notification settings - Fork 2.4k
BATCH-1767: fix optimistic locking exception in multi-threaded step #591
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@benas Any updates? I have encounter
|
I tried this patch, it doesn't fix the problem, but job ends with |
If it works fine with traditional databases, it is already a good sign about the fix. Distributed databases are not supported in Spring Batch yet (there are open issues to support them: #2060, #1870, #1317) so they are not in the scope of this issue anyway right now.
Please note that fixing the problem does not mean the
So for me, if you tried this patch and your job ended in a Pessimistic strategies are safer in regards to concurrency at the cost of performance. Optimistic strategies usually perform better in concurrent environments at the cost of predictability. Spring Batch adopted an optimistic strategy (See MetaData versioning) by design for good reasons I believe, so the goal here is not preventing the exception from happening, it is rather making things possible to be retied/restarted, in this case, it is the step and its surrounding job: both should end in a Thank you for your feedback! |
That would be a feature, and I'm not sure if it is possible to implement it without a big refactoring which is out of scope for this issue. We would like to focus on fixing the issue first (which is making sure the job fails with the correct status of |
This PR is pending for two years, is there any concern? |
No, please see my comment here. We will let you know when this is ready to merge. |
Currently, when the commit of the chunk fails in a multi-threaded step, the step execution is rolled back to a previous, eventually obsolete version. This previous version might be obsolete because it could be modified by another thread. In this case, a OptimisticLockingFailureException is thrown when trying to persist the step execution leaving the step in an UNKNOWN state while it should be FAILED. This commit fixes the issue by refreshing the step execution to the latest correctly persisted version before applying the step contribution so that the contribution is applied on a fresh correct state. Resolves BATCH-1767
0561396
to
7ba5c10
Compare
I rebased this PR on the latest master branch and updated the tests as needed (7ba5c10). However, there is now one regression in This issue is actually deeper than I initially thought. The problem is that the current implementation of a chunk-oriented step is not friendly to concurrent access. The current model tries to lock access to shared state in memory by using a semaphore coupled with an optimistic locking strategy at the job repository level. However, this shared state is not properly synchronized and ensuring a consistent state between what's in memory and what's in the database is not guaranteed when things go wrong (as detailed in my initial comment). There are several issues related to that behaviour like #1189 and #1145 which are still valid. For this reason, I decided to close this PR without merging the fix. While this PR could be merged, I believe it only treats the symptom, not the underlying problem. I will open an issue to reconsider the implementation of a chunk-oriented step in order to properly fix all related issues. |
Done: #3950. |
This PR is an attempt to resolve BATCH-1767. I will first explain the problem and then suggest a possible solution.
The problem:
Currently, when the commit of the chunk fails in a multi-threaded step, the step execution is rolled back to a previous, eventually obsolete version. This previous version might be obsolete because it could have been modified by another thread. In this case, a
OptimisticLockingFailureException
is thrown when trying to persist the step execution leaving the step in anUNKNOWN
state while it should be in aFAILED
state. Here is a simple diagram that illustrates the issue:SE: StepExecution | SC: StepContribution | OV: old version variable (of type StepExecution)
In this diagram, 3 threads are running in parallel to execute the
TaskletStep
. All of them start with aStepExecution
instance at version 1. This instance is shared between threads. In pseudo code, each thread will do the following:There are two issues with this:
The
StepContribution
might be applied on an obsolete version of theStepExecution
(which might have changed between step 1 and and step 3). This results inOptimisticLockingFailureException
when one of the threads tries to persist the step execution after applying its contributionIn case of a commit failure, the
StepExecution
is rolled back to a possibly obsolete version (oldVersion which is always at version 1). This also results in aOptimisticLockingFailureException
when the control returns back toAbstractStep
which tries to persist the in-memoryStepExecution
at version 1 while theStepExecution
in the database is at version 3 (This is the reported issue in BATCH-1767).A possible solution
This PR fixes:
Issue 1) by refreshing the step execution to the latest correctly persisted version just before applying the step contribution so that each thread applies its contribution on a fresh and correct state.
Issue 2) by reverting all metrics but the version when reverting the in-memory
StepExecution
in case of rollback. The goal of the rollback is to undo the (failed) contribution metrics and not the technical fieldversion
. "Reverting" a version is not compatible with an optimistic locking strategy, we use the version to check if there is a version conflict but we don't need to revert it (in the in-memory instance ofStepExecution
, note that theStepExecution
is correctly reverted in the database since the update is executed within the transaction (doInTransaction
method) being rolled back).The most important detail in the story is that each thread creates its own contribution to the step, but the
ChunkContext
(and theStepExecution
) is shared between threads. This shared state might end up in an inconsistent state if one of the transactions is rolled back. Tests in this PR do not assert any result on the (undefined) execution context.It is technically possible to make the execution context consistent by using
ThreadLocal
s for instance, but I would not go down this path for two reasons:The "practical limitations of multi-threaded Step" section of the documentation already mentions to turn-off the state in this case. This PR updates the section by mentioning that it is not advised to manipulate the execution context in a multi-threaded step.