Skip to content

Resist session race conditions using reference counting #1713

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 5, 2023

Conversation

MKRhere
Copy link
Member

@MKRhere MKRhere commented Oct 7, 2022

Fixes #1372, fixes #1695, fixes #1799
Supersedes #1436

This is a very involved, but well-commented solution. Short explanation follows.

Edit: This PR now also introduces defaultSession as an optional parameter to session({}). Unit tests have been updated.

Current state:

  1. Telegraf receives two messages from the same Telegram chat+user
  2. Telegraf reads session from store in both updates
  3. Both updates run through the middleware system
  4. When they're done, they both update store, BUT one doesn't know that the other message also performed an update, so it overwrites session, losing data

Solution:

  1. Receive two messages concurrently
  2. First update to fetch the session caches it in memory with a reference and counter
  3. Everyone else checks the cache, and see that it's already in memory, so they just increment the counter and reuse the reference
  4. When every update is done processing, counter is decremented
  5. If counter reaches zero, the reference object is cleared from memory cache

Problems:

  1. ⚠️VERY IMPORTANT!
    If an update is terminated halfway (because it threw an error), any changes it already made are not rolled back in the memory cache. If other concurrent requests are using the reference, they will see the updated object, and if they complete processing successfully, the updated session will persist! We could make this behaviour more predictable by also persisting session in the error catch branch.
    This behaviour has been changed so session is always written back, regardless of whether middleware chain completes or errors. This means assigning to session is always almost as good as having written to store.
  2. getSessionKey is async in v4, so it still allows some racing, but provided solution automatically corrects for this by ignoring duplicate store fetches if a reference is already cached when they return. In v5, getSessionKey should probably be made synchronous—counter-arguments welcome.
  3. This solution does not protect the user from introducing race-conditions of their own. We cannot prevent them from doing that, but we can document this as a possible footgun.
  4. Obviously any solution relying on synchronous in-memory cache does not scale horizontally. This solution will protect single-process session users only.
    • In the future (v5?), this may be expanded to use Worker threads / Web Workers where they synchronise using SharedArrayBuffer and Atomics. Will explore this if there is demand for such a thing.
    • Multi-process support (especially useful for FaaS and edge runtimes and probably more important than workers) could be implemented using db mutexes. But this also enforces sequentialisation across all processes for a given session key. Concurrency is easy; safe concurrency is hard. Maybe there are other solutions; we're not exploring them for now, but maybe in the future.
    • Meanwhile, we can recommend simply using a custom getSessionKey :: ctx -> ctx.chat.id for webhooks. Telegram ensures that webhook updates are serial for a given chatId already!
  5. session has to be an object for references to be shared. It cannot be a primitive value. This is already true for Telegraf, because Scenes and Wizards cannot work otherwise. This is not true; since this PR, session can be a primitive, but we're leaving the type-level constraint in place until someone asks for it.

If there are concerns about this solution subtly changing the behaviour of existing deprecated session, particularly in multi-process situations, I'm open to discussion about splitting this into an external module at @telegraf/session which users can opt-in to, with the naive implementation as the default.

However, I don't like the idea of permanently moving session out of telegraf core as previously discussed in #1436, because other core (Scenes and Wizards) and thirdparty modules inherently rely on sessions to exist.

Testing

The original reproduction code posted in #1372 works as expected with the proposed version of session. Try it yourself by installing this branch:

npm i "telegraf/telegraf#feat-better-session"
"use strict"

const { Telegraf, session } = require("telegraf")
const createDebug = require("debug")

const bot = new Telegraf(process.env.BOT_TOKEN)
const debug = createDebug('test')

bot.use(session())
bot.on('message', async ctx => {
  if (ctx.session === undefined) {
    const { update_id } = ctx.update
    ctx.session = { update_id }
    debug('Defaulting session to', ctx.session)
  } else {
    debug('Session already set to', ctx.session)
  }
})

createDebug.enable('telegraf:client test')
bot.launch()

Further, unit tests have been added with synchronous and asynchronous session stores. They appear rock-solid, but if anyone wants to produce a breaking test, I'd be happy to fix the implementation to conform.

Copy link

@trgwii trgwii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@KnorpelSenf KnorpelSenf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to break sessions that are primitive values., say S is string.

@KnorpelSenf
Copy link
Collaborator

Ah telegraf cannot even do this so nvm

Copy link
Member

@wojpawlik wojpawlik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite complex, I'll need to re-read it before approving

MKRhere added a commit that referenced this pull request Nov 8, 2022
* move debug to top-level
* don't check if cached is truthy, it must exist
@wojpawlik wojpawlik linked an issue Nov 10, 2022 that may be closed by this pull request
MKRhere added a commit that referenced this pull request Nov 18, 2022
* move debug to top-level
* don't check if cached is truthy, it must exist
@MKRhere MKRhere force-pushed the feat-better-session branch from 3e6a4e0 to b5c2986 Compare November 18, 2022 22:59
MKRhere added a commit that referenced this pull request Nov 22, 2022
* move debug to top-level
* don't check if cached is truthy, it must exist
@MKRhere MKRhere force-pushed the feat-better-session branch from 4958d39 to 7424224 Compare November 22, 2022 13:15
MKRhere added a commit that referenced this pull request Feb 12, 2023
* move debug to top-level
* don't check if cached is truthy, it must exist
@MKRhere MKRhere force-pushed the feat-better-session branch from 5a7382f to 74a5339 Compare February 12, 2023 12:39
@Viiprogrammer
Copy link

This seems to break sessions that are primitive values., say S is string.

This is not a breaking change because it was never said anywhere that sessions could be strings 😁

@MKRhere
Copy link
Member Author

MKRhere commented Feb 12, 2023

This seems to break sessions that are primitive values., say S is string.

Just to clarify, this is only a type-level constraint that has already existed on Telegraf, and is useful because scenes need session to be an object. However, since this PR, it is not a technical requirement, and sessions can be primitive. After discussing with @wojpawlik, we decided to leave the constraint be until someone asks for it.

@MKRhere MKRhere force-pushed the feat-better-session branch from c9bc6be to e4dfb70 Compare March 5, 2023 12:13
@MKRhere MKRhere merged commit 096df1c into v4 Mar 5, 2023
@MKRhere MKRhere deleted the feat-better-session branch March 5, 2023 12:33
@MKRhere
Copy link
Member Author

MKRhere commented Mar 5, 2023

FINALLY!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants