-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Modify the mechanism to pause indexing #128405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The problem with the original implementation of the pause lock mechanism was as follows:
This is because the semaphore based approach relies on acquiring and releasing precise number of permits during activate and deactivate, which is not possible to synchronize. Let me try to explain with the the following scenario: |
…eLock Refresh branch
Pinging @elastic/es-distributed-indexing (Team:Distributed Indexing) |
Hi @ankikuma, I've created a changelog YAML for you. |
…csearch into 05232025/ModifyPauseLock Pull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments.
I am ok to revert to this if you find it simpler. I see how it reduces the race condition surface, but there is still a race where if a thread deactivates throttling and at the same time another thread activates throttling, we risk running with no throttling at all.
I think that we could make activate/deactive throttling synchronized on a lock object instead to avoid all of this?
if (lock == pauseLockReference) { | ||
pauseLockReference.acquire(); | ||
try { | ||
while (pauseIndexing.getAcquire()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be:
while (pauseIndexing.getAcquire()) { | |
while (lock == pauseLockReference) { |
and thus we can avoid the extra boolean?
pauseLockReference.acquire(); | ||
try { | ||
// System.out.println("Deactivate pause"); | ||
pauseIndexing.setRelease(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the pauseIndexing
boolean as suggested above, we need to set the lock = noop
no later than here.
I think we could simply this to:
lock = NOOP_LOCK;
pauseLockReference.acquire();
try {
// System.out.println("Deactivate pause");
pauseIndexing.setRelease(false);
Comment
To avoid the `pauseIndexing` boolean as suggested above, we need to set the `lock = noop` no later than here.
I think we could simply this to:
lock = NOOP_LOCK;
try (pauseLockReference.acquire()) {
pauseCondition.signalAll();
}
// System.out.println("Acquired pause indexing lock"); | ||
logger.trace("Acquired pause indexing lock"); | ||
} | ||
return pauseLockReference; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might as well release the lock immediately and return a noop releasable?
That would allow us to use try-with-resource when acquiring the pauseLockReference
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to see this carried through.
As we wake up after pause throttling, with the current mechanism, we'll only wake up one of the paused threads at a time until each of their indexing requests complete. Notice that await
only allows one thread at a time to come out of await
until the lock is released, since returning from await
reacquires the lock.
This means that if the situation is cured, but the first few indexing requests take "a long time", we are still essentially throttling to fewer threads for a while.
I'll LGTM it for now, since it is not that harmful, but would prefer to see this sorted in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had also thought about this problem that even after throttling gets deactivated, we still only let one indexing job pass at a time. But I reasoned that we were already doing that before I moved to the pause indexing option.
But I think I understand now your solution.
Thanks for reviewing @henningandersen . I made some changes based on your comments. |
…eLock Refresh branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
private final Lock pauseIndexingLock = new ReentrantLock(); | ||
private final Condition pauseCondition = pauseIndexingLock.newCondition(); | ||
private final ReleasableLock pauseLockReference = new ReleasableLock(pauseIndexingLock); | ||
private volatile AtomicBoolean pauseThrottling = new AtomicBoolean(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name this and the method differently, like suspendThrottling
? pauseThrottling
sounds like a variable indicating that the "pause throttling" mechanism is enabled when instead it is that it is disabled ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
// System.out.println("Acquired pause indexing lock"); | ||
logger.trace("Acquired pause indexing lock"); | ||
} | ||
return pauseLockReference; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still like to see this carried through.
As we wake up after pause throttling, with the current mechanism, we'll only wake up one of the paused threads at a time until each of their indexing requests complete. Notice that await
only allows one thread at a time to come out of await
until the lock is released, since returning from await
reacquires the lock.
This means that if the situation is cured, but the first few indexing requests take "a long time", we are still essentially throttling to fewer threads for a while.
I'll LGTM it for now, since it is not that harmful, but would prefer to see this sorted in this PR.
} | ||
|
||
public Releasable acquireThrottle() { | ||
return lock.acquire(); | ||
if (lock == pauseLockReference) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We read lock
twice when doing regular throttling, once for the condition here and once below. Since it is volatile, it would be good to only read it once, i.e., copy it to a local variable:
var lock = this.lock;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But don't we need to read the latest value of lock inside the try block ?
This PR changes the mechanism to pause indexing which was introduced in #127173.
The original PR caused
IndexStatsIT#testThrottleStats
to fail. See #126359.