Skip to content

Support Session suspend and resume #2034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: 4.x
Choose a base branch
from

Conversation

rvansa
Copy link

@rvansa rvansa commented Apr 9, 2025

This PR is somewhat based on discussions on https://lists.apache.org/thread/9sms1sk8fd739mp7699wrbj0vnd0kzd1

If an application wants to use OpenJDK CRaC it must terminate all connections to nodes before checkpoint. Here we expose a high-level API in SessionLifecycleManager without relying on CRaC itself. Also it is pretty clear that it does not pose any risk to application that won't use this API.

Our current goal is to support CRaC checkpoint in Spring Boot application. My first attempt in spring-projects/spring-boot#44505 was declined because the way Cassandra Java Driver was accessed seemed too low-level for Spring Boot; I expect that with the API this PR introduces the Spring Boot integration could just invoke the methods without relying on driver internals.

I expect that in the future SessionLifecycleManager could expose methods for hinting the driver if all nodes died and the driver has to reconnect to a completely new node.

@rvansa rvansa force-pushed the crac_suspend_support branch from d679372 to 7379588 Compare April 9, 2025 14:20
If an application wants to use OpenJDK CRaC it must terminate all
connections to nodes before checkpoint. Here we expose a high-level API
in SessionLifecycleManager without relying on CRaC itself.
@rvansa rvansa force-pushed the crac_suspend_support branch from 7379588 to 701df2e Compare April 10, 2025 21:47
@rvansa
Copy link
Author

rvansa commented Apr 22, 2025

Hi, is there anything I could improve about this PR? Are the CI failures known issues or is it something I could address?

@lukasz-antoniak
Copy link
Member

If the purpose of CRaC is to take a consistent snapshot of the system, then how is it persisting memory state of the JVM process? I am thinking about all in-flight requests that client application may be waiting for the response. My initial thought is that we should wait for all of them to complete before considering Cassandra part suspended.

@rvansa
Copy link
Author

rvansa commented Apr 23, 2025

@lukasz-antoniak The necessity of waiting for in-flight requests depends on the usecase. Checkpoint is certainly a disruptive operation, and we expect that this is executed when the application is in a quiescent state - for example after removing the node from load balancer. In other cases the checkpoint is executed in the staging environment after a build and warmup load - though in this case we might need a way to reconfigure (add) nodes.
Generally speaking it is also not a problem if the request fails; any network system is expected to experience and handle failures.

@aratno
Copy link
Contributor

aratno commented Apr 28, 2025

There are two main areas I wanted to understand better:

  1. If a user has a complex application with lots of dependencies that manage state (connections to databases, local files, etc), would all those dependencies need to support CRaC? There’s a circular dependency between users expecting CRaC support and libraries providing it, and I’m wondering how you’re approaching that.
  2. How much of an improvement in startup time should a user expect when restoring their application from a snapshot? How does that compare to Leyden1? My understanding is that Leyden will support AOT cache building without any application changes, and CRaC currently requires application support2.

Regarding (1) above, you mentioned this on the mailing list:

Naturally it is possible to close the session object completely and create a new one, but the ideal solution would require no application changes beyond dependency upgrade.

I’m concerned that restoring a driver session from a checkpoint (rather than close + re-create) could be a source for hard-to-track bugs, due to stale topology metadata, in-progress queue state, etc. Users would also be limited in where they could restore their checkpoints, since driver internal state is dependent on the local datacenter, for example. But if a restored session re-creates connections, then that’s likely going to dominate start-up time and make the gains of CRaC less visible. How are you thinking about this trade-off?

@absurdfarce
Copy link
Contributor

I don't claim any familiarity with the current work on CRaC but I have some concerns here that aren't different from what @lukasz-antoniak and @aratno are talking about above. The question with in-flight requests isn't simply whether the driver will retry them; we use counts of in-flight requests as a proxy for the load on a node in the default load balancing policy. The disconnect between the (potentially very stale) data represented by in-flight requests and snapshot time vs. the current state of a system could very easily lead to some strange LBP operations.

You'd almost be better off somehow capturing the driver state after establishing a control connection but before any individual connections are established... but that brings you into the areas @aratno was referring to (or at least I think he was). The driver gathers a fair amount of metadata information when it establishes a control connection to the cluster; if we include this data in any kind of snapshot state it would have to be revalidated when we (re)connect anyway. But if we're already doing such a revalidation is there... a benefit to pre-loading it in the first place? So much of what the control connection does is built up from information it gets from the server.. it just seems hard to imagine how much of it could be safely cached in a way that would give you clear performance wins.

I also haven't spent any time looking at what Project Leyden is doing but it does seem like AOT classloading might be a reasonable way to get some level of performance gain without delving too far into the handling of information we get from the cluster. I'm very interested to hear how these approaches compare in your thinking @rvansa.

@rvansa
Copy link
Author

rvansa commented Apr 29, 2025

@aratno

  1. If you want to use CRaC, yes, none of the dependencies (as it is used) must prevent the checkpoint. I wouldn't call this a circular dependency, but CRaC adoption depends on the level of support within libraries/frameworks, but those communities are more motivated if CRaC is more prevalent. That's why we are talking to framework communities and are actively pushing changes (as in here) rather than expecting frameworks to jump on the train. See e.g. https://docs.azul.com/core/crac/crac-frameworks for some overview of which frameworks claim CRaC support - the presence does not mean 100% compatibility, usually only the more common setups are tested.
    Also, there's workarounds. For some simple use-cases you can get by with [https://docs.azul.com/core/crac/fd-policies](FD policies configuration), and if the community wants to postpone the support e.g. until next major version we publish an artifact from fork with the fix, or in this case I've created an artifact that you can drop into dependenices. However these are meant rather as a temporary workaround.
    We hope that eventually most of the fixes will be in the libraries, transparent to the users. Naturally that is simpler in stateless apps.

  2. Yes, CRaC is somewhere in between GraalVM native and Leyden. Leyden can certainly offer some speedup by assuming closed-world app and moving some operations to build time, but it is not AOT compilation. It certainly does not save anything your application does during boot time. In a nutshell, as it is more 'generic' it won't be able to go as far. It is up to the app developer to decide what level of improvement is sufficient and how much energy is worth putting in.
    If you want some numbers from third-party, check out e.g. this Helidon blogpost from Oracle.

I’m concerned that restoring a driver session from a checkpoint (rather than close + re-create) could be a source for hard-to-track bugs, due to stale topology metadata, in-progress queue state, etc. Users would also be limited in where they could restore their checkpoints, since driver internal state is dependent on the local datacenter, for example.

This is a valid concern. I would expect that stale metadata shouldn't affect correctness (distributed applications should tolerate that). Regrettably I don't have enough insight into Cassandra to speak more concretely - I am roughly basing my expectations on Infinispan as I've spent couple of years developing that in the past.

But if a restored session re-creates connections, then that’s likely going to dominate start-up time and make the gains of CRaC less visible.

The setup of connections is dominated by network latency, and with a local datacenter that means milliseconds or lower tens if multiple roundtrips are required for the handshake. Compare that to overall startup time in seconds for small application, and sometimes minutes for legacy leviathans. Anecdotally speaking, CRaC can restore app from, say 200 MB image in 50-100 ms, if we're talking 200 GB apps this goes to ~5 seconds.

@rvansa
Copy link
Author

rvansa commented Apr 29, 2025

@absurdfarce Shouldn't these statistics that affect load balancing be reset when you force the nodes down? If that bit is missing, could you point me to parts of code that I should adjust, or ideally a test that validates a behaviour that uses those? Normally I would try to keep PR minimal, but if this would be severe problem I can try to address that.

I think that the discussion is a bit vague when we don't have data to back up the claims. But we're touching something @aratno mentioned above - CRaC needs all parts of the application to be ready. The application might not be based around the Cassandra driver, that could be just a small part of its interface to a larger system. The driver setup might not be performance bottleneck at all, so maybe we're not saving so much on this front, but enable savings in a completely different part of the application. That's why my focus in here would be correctness, not performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants