Why the new British Rowing Personal Ranking Index is flawed
We asked Mel Harbour, one of the critics of the new Personal Ranking Index to outline why he thinks the new system is flawed, and how to fix it
British Rowing has recently been in the process of implementing (as part of its Competition Review) a new points and ranking system. Unfortunately, in doing so, the system they’ve designed has fundamental flaws. British Rowing maintains that this is because they are in the implementation phase of the new system and that it can all be resolved by tweaks. Sadly, this is not the case, and can never be – the maths underlying the system they’ve designed doesn’t work. This article will explain why.
Fair warning – I’m a mathematician, so some of the explanation will be quite technical. I will attempt to keep explanations in simple terms where possible though and illustrate with examples, thereby showing why it’s not fixable without starting again from scratch.
Also, to be clear this article is not about BROE or BROE2, British Rowing’s systems for online entry. My day job is as a software development manager, and it would be possible for me to critique those systems, and make some guesses about how they’ve been developed and how they could be improved, however I don’t know the ins and outs of how it’s actually been done, and I wouldn’t want to comment based on assumptions, as I don’t think it would be fair.
There are a wide variety of flaws that people have pointed out with the new system. I could deal with them all here, but I’m not going to. I’m going to focus on the central problem with it. I believe that many of the other criticisms distract from the main flaw. Clearly, many of these problems with the way the system is being implemented do not help, but it’s important to get to the root of problems.
As the final part of this introduction, I want to also be clear that none of this article is intended as a criticism of those involved in designing and implementing the system. I’m sure that a lot of hard work has gone into it, and that people are doing their best. All I’m doing is calling out where the flaws in the system are and suggesting how it could be fixed. There have been rumours that some of the feedback being sent into British Rowing has become nasty and personal and I do not condone any such behaviour.
Mel Harbour studied mathematics and computer science at Cambridge University where he was introduced to the sport through Peterhouse Boat Club and Cambridge University Lightweight Rowing Club. Mel now works as a Software Development Manager for Redgate Software and coaches in his spare time. You can follow Mel on twitter @melharbour.
The old system
For some time, it’s been recognised that the points system used by British Rowing for domestic racing has had some severe limitations, which has led to much domestic racing failing to meet what the athletes competing are looking for in a regatta. As a primer for those with little familiarity with this the system, it used to run as follows:
‘Qualifying wins at a Regatta with a sufficient number of entries would score you either one, or two, points, depending on the size of the entry.
- There were then bands based on the number of points you had accrued.
- Extra criteria were added to ‘top up’ internationals to ensure they had a higher number of points, and to permit points to gradually drop if you had gone some time without a win.
I want to be very clear that I share British Rowing’s view that the old system was not fit for purpose and needed to be overhauled.
The new system
A few years ago, British Rowing kicked off a ‘Competition Review’, part of which was to look at the points system, with a view to creating a new system that would try to create closer, fairer racing at all levels. Essentially, if you’re a mid-standard athlete, it’s more fun to be racing other mid-standard athletes, so you can see whether the training you’re doing is paying off. Getting thrashed by an international athlete, or conversely thrashing a novice athlete, isn’t much fun.
There are various parts to the new system, but for the purpose of this post, I will focus on some of the key elements as follows:
- Points are now awarded for both regattas and head races. There is talk of looking at the weighting between head races and regattas, but while that will tweak the points awarded for each, it doesn’t fix the problems this article will demonstrate.
- You gain a point for each crew you beat.
- Your current points total is based on a certain number of recent races where you have gained points.
The net result of this is that they have designed a system that is…
Based on the number of athletes you beat, not the quality of those athletes
I put the emphasis in deliberately, as this is the new system’s key feature, and the key reason why it’s broken. To see why this is broken, let’s look at a simple example with two scenarios:
- A mid-standard sculler races in a head race with a large number of novice scullers. He (obviously) beats all of them by a comfortable margin. As a result, he gains a large number of ranking points.
- A high-national level sculler races in a small field, beating the other entrants, who are all of high-national level or international level. He gains a small number of points.
When the dust (splash?) has settled from these two races, we can easily see that the sculler in case 1 will be ranked far ahead of the sculler in case 2. Indeed, to the extent that it may be in practical terms impossible to overturn that difference in the points through future racing.
The important thing to notice is that no amount of tweaking the weightings of various events will ever fix that problem. Indeed it will also be compounded if athletes race with different frequencies.
How to fix it
Unfortunately, the answer to fixing the system isn’t easy. When you have a system that has fundamental flaws as described by the scenario above, there’s no amount of patching that will help.
But there could be hope! Plenty of research has been done over the years on how to rank people in lots of fields in such a way as to be able to get a pretty good read on their standard. I’ll explain one possible example here. Be warned, there’s a little bit of maths ahead – I’ll try and keep explaining it in simple terms though. What I’m about to describe is known as an Elo rating system. It was originally developed by Arpad Elo for use with chess rankings, but since the fundamental ideas are sound, it has further been applied to lots of other games and sports where ranking is required. It’s been used since at least 1960, so has quite a lot of pedigree, and plenty of opportunity for people to work out any flaws and correct for them.
We initially make a few simplifying assumptions – we can back these out later on:
- We assume that a good athlete over a short race is the same as a good athlete over a long race. This is a reasonable assumption since very few rowing events are short enough to physiologically qualify as ‘sprint’ events. As such, we will generate a single ranking per athlete.
- We will assume that if you put a crew together of the best individual athletes, that it will be the best crew. We know that this is not necessarily true in practice, but we’ll see that it isn’t actually a problem for the system I’m going to describe.
Given assumption 1, we will assign a number to each athlete. This is their ‘ranking points’. What we want is a scenario where if someone is a much better athlete, they have a much higher number of ranking points. If they have the same number of ranking points as another athlete, then they should expect to have a close race.
All athletes are inconsistent. Send a sculler out to row a 2k course (in flat conditions) on several successive days, and they will produce a range of times, distributed around their actual standard. In maths terms, this spread can be modelled as something called a probability density function. Now, we don’t have enough information to know what that function actually looks like, but in cases such as this it would be standard practice to use a normal distribution or a logistic distribution. There’s some complicated maths that explains why this is a good place to start, but we don’t need to go through it here.
The thing to have in your head is that every athlete has an ‘average standard time’ (represented by μ in the graph above). What we’re saying is that the times that an athlete actually produces will be distributed according to the graph. So we expect most of their times to be close to their average, with a few of them a long way from the average. Typically the best probability distribution to use is determined by looking at real results, which means that when designing the system you can take into account things like:
- More experienced athletes tend to be more consistent.
- There may be more chance of having a ‘bad’ row than a ‘good’ row relative to your average (or vice versa).
So what happens when you race someone (either side-by-side or in a head race)? You overlay the two probability density functions and look at where the two graphs overlap. In simple terms, the size of the overlap tells you the region where the weaker athlete (blue line in the chart below) beats the stronger athlete (orange line).
If a novice races an international, the distributions barely overlap at all. This matches reality where the novice has only a vanishingly small chance of winning. Put two people of similar standard together and there will be a more significant overlap. Importantly, you can actually calculate from the size of this overlap the probability that one of the athletes beats the other one.
Why is that so significant? Well, if the athletes race multiple times, their wins should basically follow that pattern. If athlete A stands a 66% chance of winning, he should beat athlete B in 2 out of 3 of their races. And it turns out that you can, therefore, adjust the rating of the athletes as you go along so that it gradually puts everyone onto the correct standard.
How it works in practice:
- Win something? You’re going to gain some points.
- Come last? You’re going to lose some points.
- Somewhere in the middle? You could go up or down.
- How many points do you win/lose? That depends on the quality of your opposition. Beat someone good and you’re getting loads of points. Beat someone of a much lower starting standard than you and you’ll only pick up a few.
Essentially the system self-corrects, sorting out problems quickly. People would need to be seeded in at the start, which you can use to give people a kick into the right area of your ranking (e.g. start internationals on lots of points).
A potentially useful quirk of the system is that you would actually be able to look at the difference in ranking points between two competitors and make an estimate of what the time gap between them should be over a given distance. So I could race someone I’ve never raced before, and have an idea of what might represent a good result for me (a bit like using a %Gold standard comparison). You would also gain an impression of where a person ranks overall (nationally). It would be possible to say whether a person was in the top 100, 200 etc. Potentially this would be seen as a mark of status in and of itself, as would the first time a person achieves a personal points tally above a certain level since it would genuinely show something about them having demonstrated their ability.
It should be noted that, as discussed, there are various factors which you can play with in order to control how the system works, including:
- The probability distribution that’s used.
- The scaling factor (called the K-factor) that you use to work out how many points to transfer from one athlete to another after a result.
- Whether you want to put in any ‘rating floors’ (once a person has been above a certain points threshold, they are prevented from dropping below again).
- Whether you want to put in rating floors based on other criteria (full internationals?).
Of course, the system needs to be able to cope with multiple participants – not every race is a match race. This is again fairly standard stuff. You make it so that each participant races every other participant and then adjust the scaling factors appropriately. In effect it’s a bit like running a pairs matrix!
It’s quite likely that people will offer up reasons why the system as I’m outlining ‘couldn’t be implemented’. I’ll call out a few possible objections and illustrate why they’re not a problem.
- Members won’t understand how to calculate their points.
- Arguably true, but reflect on whether members in general actually need to be able to calculate their own points. As I said earlier, they will easily understand the general principle that a win gains you points and a loss loses you points. The fine detail can be explained if people want, but won’t be needed by most.
- Empirically, many sports that use Elo systems will have a lot of participants that don’t know how the mechanics of the system work. It doesn’t stop them.
- It will be complicated to calculate and implement in a system.
- It won’t. It’s just an algorithm. In software terms, it breaks down into small chunks of calculation that can be easily unit tested, which has the added benefit of reducing the chance of problems developing as the system is forced to cope with more ‘real world’ scenarios.
- It’s used so widely that there are actually off-the-shelf software libraries that implement it ready for use. There might need to be some work to customise for the specific case of rowing, but the fundamentals would already be done.
- It may work for single sculls, but won’t work for crew boats as well.
- It just depends on statistics. If a crew full of the best individuals will statistically beat a crew of the less good individuals then the system will work.
- People will be able to game the system.
- It’s possible to game any system. There’s no such thing as a perfect world.
- That said, Elo’s been around a while and had people poking holes in it for a long time, and fixes have been worked out for most of the pathological behaviours.
- We’re already committed to the new system. We can’t turn back now.
- This is effectively a variant on the sunk cost fallacy.
- It’s patently obvious that a pause could be inserted practically instantly, by using a statement such as “we’re having a rethink, so we’re going to revert to using the old system until we’ve better understood the situation”. The old system may not work well, but at least it would be a ‘safe space’ in which flaws could be worked out.
If you’re interested in reading (much!) more about the business of calculating ratings, a good article to read is found here: http://www.moserware.com/2010/03/computing-your-skill.html
It’s written by somebody at Microsoft who was involved in their implementation of an improvement on top of Elo, explaining in a great deal of detail how these things work. In particular, it’s worth noting that they not only describe the method but, as I said earlier, provide the libraries necessary to use it.
You can find more of Mel Harbour’s technical analysis on his website.