Skip to content

Conversation

@JamesVanBoxtel
Copy link
Contributor

This commit implements the Glicko 2 rating system model designed by Mark Glickman.

Taken from Wikipedia
“The main difference of Glicko is measurement of the "ratings reliability", called RD, for ratings deviation. The Reliability Deviation (RD) measures the accuracy of a player's rating, where the RD is equal to one standard deviation. For example, a player with a rating of 1500 and an RD of 50 has a real strength between 1400 and 1600 (two standard deviations from 1500) with 95% confidence. Twice (exact: 1.96) the RD is added and subtracted from their rating to calculate this range. After a game, the amount the rating changes depends on the RD: the change is smaller when the player's RD is low (since their rating is already considered accurate), and also when their opponent's RD is high (since the opponent's true rating is not well known, so little information is being gained). The RD itself decreases after playing a game, but it will increase slowly over time of inactivity.”

I have tweaked the constants and ran tests so that the rating deviation is configured to be provisional when greater than 125. This corresponds to playing ranked a lot and then not playing ranked for about 3 months. The client should be setup to mark the rating as provisional when above that to help the player know its not confident. Some interfaces put a question mark next the rating. We should probably set a second threshold to not show the rating at all, maybe like 200 or 250.
 Players also should only gain about 2 or so rating points when playing similar players with low RD. Another benefit of this system is newcomers can get their rating more easily and deterministically, and established players won’t be affected so much by playing newcomers.

This commit implements the Glicko 2 rating system model designed by Mark Glickman.

Taken from Wikipedia
“The main difference of Glicko is measurement of the "ratings reliability", called RD, for ratings deviation.
The Reliability Deviation (RD) measures the accuracy of a player's rating, where the RD is equal to one standard deviation. For example, a player with a rating of 1500 and an RD of 50 has a real strength between 1400 and 1600 (two standard deviations from 1500) with 95% confidence. Twice (exact: 1.96) the RD is added and subtracted from their rating to calculate this range. After a game, the amount the rating changes depends on the RD: the change is smaller when the player's RD is low (since their rating is already considered accurate), and also when their opponent's RD is high (since the opponent's true rating is not well known, so little information is being gained). The RD itself decreases after playing a game, but it will increase slowly over time of inactivity.”

I have tweaked the constants and ran tests so that the rating deviation is configured to be provisional when greater than 125. This corresponds to playing ranked a lot and then not playing ranked for about 3 months. The client should be setup to mark the rating as provisional when above that to help the player know its not confident. Some interfaces put a question mark next the rating. We should probably set a second threshold to not show the rating at all, maybe like 200 or 250.

Players also should only gain about 2 or so rating points when playing similar players with low RD.
Another benefit of this system is newcomers can get their rating more easily and deterministically, and established players won’t be affected so much by playing newcomers.
Copy link
Collaborator

@Endaris Endaris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think tweaking the system with the primary goal of sifting out via an inactivity period is not a good idea.
As far as I can tell the rating period is now 16h which seems...fairly random?
Even players that play daily will frequently have their RD increase from "inactivity" this way.
I think if we want to remove inactive players from display we have better ways than tweaking the ranking system, e.g. by utilizing any logic that is actually based on time, e.g. last login time, time of last ranked match played.
I think our tweaking of the ranking system should primarily focus on accuracy and creating a good experience.

I'm worrying that if the system punishes even slight inactivity in terms of ranked that the ladder will just empty and rating won't be displayed for a huge number of players which is not conducive to secondary interests like gauging another player's strength when looking for non-ranked matches to see whether they would make for a fun matchup.

Is there a particular reason why the starting deviation is set to 250 instead of the maximum deviation of 350?
That seems counterintuitive as no information cannot really be better than severely outdated information.


-- Step 7: Update to the new rating

local newRD = 1/sqrt(1/ratingDeviation^2 + 1/v)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The example ranking you generated shows players with an RD below 30.
In the paper on Glicko it is stated that

One practical problem with the Glicko system is that when a player competes very frequently,
his/her rating stops changing appreciably which reflects that the RD is very small. This may
sometimes prevent a player’s rating from changing substantially when the player is truly
improving. I would therefore recommend that an RD never drop below a threshold value,
such as 30, so that ratings can change appreciably even in a relatively short time.

For the Panel Attack ladder I believe this to be highly relevant with a small number of players that always keep ranked on driving their RD low compared to everyone playing ranked only on a weekly or monthly basis.

@JamesVanBoxtel
Copy link
Contributor Author

As far as I can tell all the constants picked in this algorithm are left to the implementer to figure out what works best for their use cases.

I didn't pick the numbers solely to filter out inactivity.

When coming up with these numbers I have used the tests and data analysis to try to get a good balance of everything.

For example, the rating period time amount is left completely up in the air with not much guidance. However this variable is the only way for RD to go up, so I have picked a value that I felt had a good balance of driving it up over time, but not too fast. As you said, if it goes up too slow, than you can't go up the ranks very fast even though you improve a lot. On the other hand if it goes up too fast, you will jump all over the place.

I have tried to make tests to verify this. If there is a case we think isn't working well we should document it and add a test.

How we display the leaderboard / ranking / provisional is another important issue but kind of separate from the actual model / rating deviations. We want the model correct first, then we can pick constants for display that match what we want after.

As for starting / max deviation, I don't think I have strong data for the max, we could possibly set it to 250 or change starting if we can prove it meets all the tests, but we should only really change things after we prove there's a problem with a test.

@Endaris
Copy link
Collaborator

Endaris commented Jan 3, 2025

I suppose I'm mainly confused because the description of the original Glicko implicitly suggests that starting RD and maximum are identical.

The formula above ensures that an RD at the beginning of a rating period is never larger than 350, the RD for an unrated player.

And logically it also makes sense that the default rating for a new player has the lowest possible confidence because there is 0 information.

As far as I can tell, Glicko is largely geared to a certain value range via the inclusion of the ln(10)/400 factor that projects the ratings into the 3-4 digit space (173.7178 is the inverse of that).
So I think it's generally wise to stick to 1500 default rating and 350 max RD and starting RD == max RD.

In regards to RD scaling up, in Glicko there is the freely chosen constant c that can be used to control that
From what I can tell that constant is basically replaced by the volatility as for inactive players the new RD becomes
φ' = sqrt(φ² + σ²) and σ itself is the player owned volatility. So it seems that besides the duration of each rating period we can primarily affect the speed at which players approach max RD by manipulating the default value for σ and to some degree τ.
Which has the slightly unintuitive implication that depending on the player's individual volatility they approach max RD faster or slower which practically probably does not matter as eventually they hit max RD anyway and it's only driven down by playing more again.
In particular you implement a maxVolatility in this PR which caps the volatility and thus the rate at which a player could decay.
I think given the nature of Panel Attack we could get volatile results relatively often due to how easy it is to drop games from slip-ups even with a substantial skill gap so I think restricting the maximum volatility to its start value may not be a good idea and I think it may be possible that we want default volatility to be higher. I think it probably makes sense to run a simulation without capping volatility and then looking where the values end up to gauge whether 0.06 is a sensible starting value or if volatility starts to converge towards a certain value range for players that frequently play ranked and for players in general.

For the rating period, the Glicko document mentions:

The Glicko system works best when the number of games in a rating period is moderate, say an average of 5-10 games per player in a rating period.

For our ladder that is illusory. I think our players of middling activity play perhaps 1-2 FT20 rankeds per week which would put the rating period at around 24h which I think is good for the pragmatic reason that it is easily understood for players when RD increases.
Makes it all the more important to implement a minimum value for RD to not punish those that play ranked more frequently imo.
I suppose I'll try to toy around a bit with the tests you made to make some more specific suggestions.

Adding a test that currently fails, we shouldn't be able to get a negative rating.
Also added game counts for debug purposes
@JamesVanBoxtel
Copy link
Contributor Author

JamesVanBoxtel commented Jan 4, 2025

Added a test for the negative rating case.

Apparently Lichess calculates ratings immediately after each game and has minimum and maximums on rating.

So there is lots of different options we have to fix this case.
https://github.com/lichess-org/lila/blob/7e7d7cc94e9e2ae7d6e58a1b06359826f7ab0da9/modules/rating/src/main/glicko2/RatingCalculator.scala#L205-L206

https://github.com/niklasf/liglicko2/blob/main/src/rating_system.rs

https://lichess.org/@/toadofsky/blog/lichess-ratings-are-not-glicko-2/x1bpwn38

When a players rating isn't stable yet, doing tons of games in one rating period causes giant jumps.
For our use cases it seems updating games immediately is better and running rating period changes inbetween games.
@JamesVanBoxtel
Copy link
Contributor Author

After much discussion and tweaking I am pretty happy with these final results.

We are evaluating after each game, and the expected outcome vs actual outcome is very close.

Comment on lines 6 to 22
-- If starting RD is too high, or too many matches happen in one rating period, massive swings can happen.
-- This test is to explore that and come up with sane values.
local function testWeirdNumberStability()

local player1 = PlayerRating(1273, 20)
local player2 = PlayerRating(1500, 20)

local wins = 12
local totalGames = 25

local updatedPlayer1 = player1:newRatingForRatingPeriodWithResults(player1:createSetResults(player2, wins, totalGames), 1)
local updatedPlayer2 = player2:newRatingForRatingPeriodWithResults(player1:createSetResults(player1, totalGames-wins, totalGames), 1)

assert(updatedPlayer1:getRating() > 1073)
assert(updatedPlayer2:getRating() < 1500)
assert(updatedPlayer2:getRating() > 1338)
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test seems off.
Not only does it not do what it say it does (instead RD is very small) but the calculation for updatedPlayer2 seems to create incorrect set results.

This fixes all tests to be more accurate to a real world scenario where matches happen one at at time on both players before the player table is updated.
It also fixes that bug along with the bug of not updating the RD after the last game a player played.
As far as I know all feature work is done and tested well.
Copy link
Collaborator

@Endaris Endaris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright actual code review time.

First of all, I think rating updates are a clear server domain so even if these new classes don't have dependencies outside of common, they are not or in my opinion should not be a common format/class between client and server. Client should only ever get numbers for consumption and in that way depends on the ServerProtocol, not on Glicko itself. So I think they should go to the server directory instead, Glicko into server/lib and PlayerRating into server.

We need to add the license status of the Glicko library to COPYING as an unlicensed piece of code we're using in good faith as it was posted in a place explicitly meant to share code meant for public use.

There are some things I don't like about the design of PlayerRating at the moment.

As far as I can gather from playGamesWithResults in PlayerRatingTests the process is as follows:
We take one player's glicko
We take a copy of the other player's glicko and attach the score.
We update the player's glicko with the above copy.

Then we take the other player's glicko
We take a copy of the one player's previous glicko and attach the score.
We update the other player's glicko with the above copy.

I think it would be good if we added a static function that sums up that process, e.g.

---@param glicko1 Glicko
---@param glicko2 Glicko
---@param gameResult number any convention that makes it clear who between first and second arg won
---@return Glicko updatedGlicko1
---@return Glicko updatedGlicko2
function processGameResult(glicko1, glicko2, gameResult)

end

I think an API like that would be much more straightforward to use and easier to test.
It also allows elimination of some duplicate checks like the one against the ALLOWABLE_RATING_SPREAD.
Furthermore I think it can condense the complexity into one place which makes it easier to eliminate possible smells that result from the complexity being spread out over multiple functions.
I have not confirmed whether that is indeed a bug or not but take the following example:
A new player 1 plays against a player 2 that had low RD before going into a longer period of inactivity.
To me it looks like RD from inactivity in the real world data test is only updated as an intermediate step when the player 2's rating is updated.
Due to order of operation, it seems possible for the first game between player 1 and 2, that the rating update for player 1 would take player 2's outdated low RD because it has not gotten updated yet. Only on successive games would it use the then updated higher RD of player 2.

By condensing the update for both players into a single function, there is more control over order of operation within the updates and update for inactivity can occur before update for the game result.

Successively PlayerRating:createGameResult can get eliminated and you get out of that weird naming conflict with privateNewRatingForResultWithElapsedRatingPeriod and newRatingForResultsAndElapsedRatingPeriod, you only keep the "private" version without the private in the name. I could be mistaken but this "private" looks more like a way out for the name collision than actually wanting this to be private.

I would also ask that you annotate the classes since that is something you asked for.

Comment on lines +61 to +81
-- Returns an array of wins vs total games
-- The wins / losses will be distributed by the ratio to try to balance it out.
-- Really only meant for testing.
function PlayerRating.createTestSetResults(player1WinCount, gameCount)

assert(gameCount >= player1WinCount)

local matchSet = {}
for i = 1, gameCount - player1WinCount, 1 do
matchSet[#matchSet+1] = 0
end

local step = gameCount / player1WinCount
local position = 1
for i = 1, player1WinCount, 1 do
matchSet[math.round(position)] = 1
position = position + step
end

return matchSet
end
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move it over to PlayerRatingTests being a raw test code function like playGamesWithResults?

@Endaris Endaris added enhancement New feature or request Server-side labels Jan 8, 2025
@JamesVanBoxtel JamesVanBoxtel marked this pull request as draft April 13, 2025 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Server-side

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants