-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Explicit handling of nested splits in SplitState #3903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @hpoettker, thank you for your PR!
I did not check this PR and its corresponding bug report in details yet, but it seems that it could introduce a small breaking change. So yes, this could be addressed in v5. However, I'm not sure it will be in v5.0 as we are planning to focus on major changes first. This is a good candidate for a minor release of the v5 line (v5.1, v5.2, etc). Please follow the updates on this issue to see in which release it will be addressed. Thank you. |
Thanks for the update! I'd also consider the PR a breaking change in behavior although I assume that no one knowingly relies on the current behavior as it is rather unexpected. Just for completeness: The fix would also solve a problem in the Data Flow's Composed Task Runner which actively uses nested splits and the thread pool size of a |
I've resolved some merge conflicts and bumped the year in the copyright header. It would be great to have open discussion about the PR and the issue that it addresses. I think it's closely linked to the deprecation of the throttle limit, i.e. #2218. Concurrency of batch steps should be fully controllable by using only a suitable Input from the Spring Cloud Dataflow team would be particularly interesting as the Composed Task Runner would benefit greatly from this change, in my opinion. |
This PR is now more than 2 years old but I'm convinced more than ever that it addresses an important issue. I think that the concept of the chunk-oriented step in Spring Batch is extremely valuable. But it can only reach its full potential if it allows to scale batch jobs in a straightforward manner by "just" adding a thread pool. The removal of the throttle limit is the most important part of this story. But concurrency control for split jobs follows right after that in my opinion.
As we're already closer in time to the release of 5.1 than that of 5.0, I'd say that this is now! Please share your ideas on this topic. I'd be happy to implement any change or test that is required. |
Yes, this PR can now be merged. I will plan it for 5.1.
Intra-step concurrency or step-level concurrency (ie processing items concurrently within the same step) is different than inter-step / job-level concurrency (ie processing entire steps concurrently or in parallel within a flow). Those two types of concurrency are at different levels and should not be related in any way. This PR addresses inter-step concurrency, and I don't think the deprecation of throttle-limit is related to the handling of nested splits. The only case I see a possible entanglement is probably when the same task executor is used to run both step flows and concurrent threads of the same step. The throttle-limit is really an implementation detail of the I correctly read your comment that This PR addresses #3857 and will be merged in 5.1. However, the deprecation/removal of throttle-limit will be reviewed as part of #3950. |
Rebased and merged as 58a4134. Thanks again for yet another great contribution! |
@fmbenhassine Thanks for sharing your insight and clearly delineating the distinction between job-level and step-level concurrency. It's really appreciated! The throttle limit is indeed not directly related to this job-level concurreny issue. They are only connected in the broader sense that both are about controlling the maximum degree of concurrency in a job/step. From a user perspective, I think it is nice if I can reliably control concurrency by passing task executors with the desired maximal thread pool size. For the job level, this didn't work before this change as jobs experienced liveness failure when thread pools were too small or executed a far smaller number of steps in parallel than the maximal thread pool size. The ideal behavior is of course that there are no liveness failures in any case and that as many steps are executed in parallel as there are threads assigned to this purpose (and where the flow of the job allows it). The same should be true for the step-level analogously. Here the problem is currently more that it's not enough to provide a large thread pool if I want a high degree of concurrency because I still have to set the throttle limit or a custom The plans for the step level in #3950 are much more ambitious but also very welcome. Is there anything the community could do to assist? |
I'm wondering whether my reasoning about concurrency control above through thread pools still makes sense in the age of virtual threads. It may very well not! The threads for the job level don't do much work at all and the threads for the step level are usually much more IO bound than CPU bound. Thinking along this line could lead to the conclusion that the throttle limit should not be removed. It should rather be introduced for the job level too. Then I could use But despite the lively discussion in #1008, I'm not sure that the overhead of platform threads is a real-life problem for batch work loads. 😄 |
Fixes #3857 and allows to control the maximal number of concurrently running steps with the thread pool size of the
TaskExecutor
.Nested splits are explicitly handled in
SplitState
and do not require dedicated threads. The added test stalls if executed against main.@benas Do you see a chance to merge this PR (with requested changes, of course) such that it is included in Spring Batch 5.0?