Ramblings and ruminations on chess in SE Wisconsin, the USA and the World

First Results In…

… on the tournament with the BAP scoring system. (OK so it took me a while to find the event.)

As I suspected, the points favored Black dramatically, as Black outscored White 52-36 during the 6-round event. The actual game results worked out to +18-15=7 for White, meaning our earlier estimate of White and Black splitting most of the drawn games was the more accurate, which we must admit surprised us a little.

The draw proportions were down, only 18%, but we still see the format as favoring Black slightly, with Black outscoring White by an average of 0.4 points per round.

This edge can be controlled for by insisting on exactly balancing the colors for a tournament, something which is easy to do in a round robin event, but next to impossible to do in a swiss format. (There are other problems with using it in a swiss, not the least of which is that every player in the top score group for round 2 needs to play white to balance their colors.)

We’re not prepared to stick a fork in the idea, but we still see problems ahead. The most interesting aspect of the data we discovered is that had the event been scored “normally” White would have scored 54%, which is almost what White is expected to score under the old scoring system (55%). Not a surprising result if we postulate the “undrawn” games would split 50/50, but that isn’t what we expected to happen.

Thanks to Mr Ballard for running the event and giving us something to think about.

13 Responses to “First Results In…”

  1. Clint Ballard Says:

    Thanks for the thoughtful analysis here.

    The Oakland event had only 1 grandmaster draw and fittingly it was the game that cinched a norm for IM Milman. He had no problem getting 0 BAP to secure a norm. I have found that in tournaments that have non-BAP incentives (like Oakland and 2006 WA championship) combined with BAP based prizes, it doesn’t reflect the effect of BAP purely.

    I am personally responsible for boosting the black score at Oakland as my only win was with black, which meant I lost a lot of games as white. I haven’t analyzed in depth the Oakland results, but any serious analysis needs to factor player strength to eliminate the incredibly dominant effect difference in player strength has. No matter what the point system, a GM will usually win against experts with either color.

    My prediction of the true draw rate in chess being around 25% has been confirmed at the BAP events. With a stronger field, the 18% would have been a bit higher. It might take a super-GM tournament to achieve a 25% draw rate and in fact we have seen some non-BAP tournaments like that, though rarely. My question is why can’t all GM tournaments be that way?

    I guess that is my entire point with BAP. Chess played by GM’s should be drawn less than half the time it is now. There is now empirical data to support this theory.

    I never claimed that BAP scoring would be unbiased at anything lower than the GM level and it could well be biased a bit, even at the GM level. However, now that the 25% draw rate is something that seems pretty solid to assume, the remaining variable is the ratio of white wins vs. black wins.

    This depends on player strength and at the highest levels, white does seem to have a significant edge, but at lower levels it is much more balanced. So, the question becomes how important it is to make an unbiased point system. If it is a requirement, then we do need to revamp the current system as it is already biased. Personally, I think any new point system that has similar levels of biasing based on color is a viable replacement.

    BAP as it stands eliminates the type of draws that we don’t want to see. Also, contrary to some of the rumors out there, I am not trying to eliminate draws as that is a significant part of chess. BAP makes draws decisive, so that’s good enough. Every game has a winner and a loser. No silly grandmaster draws. With equal number of whites and blacks, whatever bias exists doesn’t matter, but it would be better to eliminate any large biasing.

    Another question is what the purpose of a point system should be. Whatever point system is used will reinforce whatever behavior it rewards. As a chess player that has not mastered the game, it is clear to me that I need to learn how to win every game that I can win, but there is no sense in losing games that could be drawn. I have adopted a “BAP mentality” in my own chess playing and it has most definitely improved my chess skills. More importantly, it makes chess a lot more fun and enjoyable for me. I used to hate playing black, especially against strong players. Now I look forward to every game as black. BAP has made me a better chess player.

    Clint

  2. Administrator Says:

    “I guess that is my entire point with BAP. Chess played by GM’s should be drawn less than half the time it is now. There is now empirical data to support this theory.”

    Not quite. It supports the theory that games which might have been agreed drawn, when played out, seem to split evenly between White and Black, something that actually can be argued as supporting the idea that the draw is an appropriate result in the first place. (If the games split evenly, then it implies the positions reached were also even.)

    “Another question is what the purpose of a point system should be. Whatever point system is used will reinforce whatever behavior it rewards.”

    Aye, and there’s the rub. As any new point system rolls out, strategies to “game the system” will naturally evolve. For example, under your system I can’t ever see a reason to offer a draw as White. The potential difficulty this may breed, as more events using this system are played, are that players with the White pieces will intentionally throw games. Think of it: If it doesn’t cost White anything to resign in a level position, what incentive is there to resist tossing an extra couple of points towards a player who might be paying you for it, or (more likely) be in a position to return the favor in a future event, in an example of “You scratch my back and I’ll scratch yours” behavior.

    I already see at events the top seeded players getting together and settling the order of finish in weekend events, something which usually leads to a good number of lifeless draws (at a recent event there were 3 IM’s, and none of them played a real game against each other, for example). I can see worthless decisive games being played under this system at a growing frequency, as players become more comfortable with it, and those games will be of no more value than the 15-move draw is today.

    Strategies will adapt to whatever metric we choose to use for determining winners. I’m becoming more and more convinced the only realistic option on this path is to eliminate the draw as a scoring result. A point for a win, and nothing for any other result.

    But I have a problem with that as well, and it’s similar to one of my problems with your system. These sort of modifications have the side effect of penalizing a style of play. A player who plays a solid game with the White pieces gets penalized, because that style produces more drawn games than other styles. Petroff’s defense starts to look like a great line for Black.

    The main problem I have with your system is Black gets rewarded for achieving nothing positive. If, as is generally postulated, the perfect game of chess is drawn, then to reward one side for achieving the perfect result and not reward the other side for achieving the same level of perfection is fundamentally unfair.

    Perhaps we should adopt some of the old ideas: replay draws until a decisive result is reached, with a modern twist that clocks allow us by shortening the time controls of each successive game. This approach can work with one round/day events, but weekend events couldn’t use it, in all practicality.

    More and more I come to the conclusion that the best solution is to shorten the time control. Yes, it doesn’t allow for deep thought at the board, but how many players in a weekend event are even capable of deep thought at the chessboard? Generally, those players are playing in one game per day events already.

    For example, at the recent Melody Amber rapids, the draw percentage was about 44%, just over half what the “normal” draw rate is among top players. (Wijk aan Zee usually averages in the 60+% draw range.)

    No method is without its cost. (For invitationals, the easiest solution is not to invite players back who draw lots of games.) What price are we prepared to pay to reduce the short draw?

  3. Administrator Says:

    It strikes me as I read that last that I appear more than skeptical about Mr Ballard’s point system. This is not the case. As an organizer, nothing would please me more than to have a workable solution to lazy players. They ruin events for spectators and other players. More than once I’ve considered walking away from organizing because of short draws among good players (part of the reason I organize events is to see good chess games being played, and I feel cheated after nearly every event I run).

    Count me as simply skeptical, in that I hope it works but don’t think it will achieve the aim.

    The bottom line is chess cannot succeed as a sporting event so long as players can manipulate the prize fund at will, through the agreed draw. This may mean chess can never succeed as a sporting event. It’s already unlike most sporting competitions in that one side can simply quit before the game is over. Both boxing and curling allow for that, but neither one of them allow competitors to simply call it even and go home.

  4. Clint Ballard Says:

    The whole cheating issue is something that BAP does not deal with, my assumption is that collusion between players happens when dishonest players play each other. Cheaters will cheat, honest people won’t. While it is possible that some non-cheaters who are in a last round game as white will crossover and accept a bribe, this assumes that the white side has no chance to win in the last round and if they are playing against a contender, they themselves are a contender and much less likely to accept a bribe.

    This whole BAP will increase cheating is only a theoretical problem, that already exists, so I am not sure why it is even something against BAP. Hopefully, we can agree that a good test of a point system is its effect on honest players and that we don’t have to construct point systems based on minimizing the incentives for cheating. In any case, when there isn’t a big prize fund, cheating isn’t an issue and if the prize fund is big enough, the tax liability the winner has will complicate any prize sharing arrangements.

    People have speculated that due to the point system, the white players would go “crazy” and start tossing games for no reason. The evidence so far doesn’t show that at all, in fact, your analysis shows that the ratio of white wins, draws and black wins is virtually unchanged under BAP! Clearly, white is simply playing the way white should. Loss of rating points and not wanting to lose overrides any irrational desire to lose a game that they can draw as white. Would you toss a drawn game as white because it doesn’t affect your BAP score? You lose rating points, you lose to your opponent and your opponent gets 2 more points making your chances of prize money significantly less. There are a lot of reasons for white to fight for a draw instead of being silly.

    This means that BAP does not change the overall results, but it has eliminated the grandmaster draw. Since BAP is not used for ratings, whether white or black ended up with more or less BAP points seems to be moot. Who cares what the white vs. black ratio of BAP points was if the prizes go to the most deserving players.

    Faster controls are certainly one way to reduce the draw rate by increasing blunders. Personally, I like to play as close to a blunder-free game as possible, so I am biased against the speed chess solutions. The good players can just play to draw the normal time control and play the decisive games at the faster and faster time controls. Do we really want to have the world championship of chess decided by a game of speed chess? Tournament chess is not speed chess and deciding tournament games by speed chess is probably the best option if BAP didn’t exist.

    I am not sure what you continue to be skeptical about as BAP has achieved all of its primary goals and even some secondary goals. While current results do show black scoring more BAP points than white, it has not changed who wins the prize money. For chess to be a sport, there needs to be a winner and a loser for every game and BAP achieves this primary goal, since a draw is a small victory for black.

    You really need to see a BAP tournament in person to feel the difference it creates. Players look forward to their black games. There are no agreed draws, until it is getting close to insufficient material and clearly drawn. In fact, it makes all the draws the fighting draws that people say are an important part of chess. I agree and with BAP one quarter of the games are fighting draws and three quarters are decisive games.

    Why not tiptoe to the BAP side of things by holding a small BAP event so you can see firsthand what its like. Pairings are much fairer to the players. I can do the pairings for your event and provide the prize fund.

    Clint

    P.S. With a small prize fund, there is no cheating issue, so you can see the full effect of BAP with a modest winner take all prize. By having only one prize, it minimizes the desire to split the prize as the more people who share the first place prize, the smaller amount each person gets.

  5. Administrator Says:

    “The whole cheating issue is something that BAP does not deal with, my assumption is that collusion between players happens when dishonest players play each other. Cheaters will cheat, honest people won’t.”

    Yes…and no. There’s a grey area that many players feel comfortable moving in. Is it cheating to sit down during rounds 3&4 of a weekend 5-rounder and do “income negotiation?” Lots of players don’t feel that way, it appears. There’s an old saying “If the opportunity for murder always came along with the motive, who among us would escape hanging?”

    I think the vast majority of players are basically honest, yes. But even the honest succumb to temptation, and I see this as putting another temptation before them. So far I’ve seen crosstables for small round-robin affairs, but no swisses. No large population groups, only small test ones. I think the results are interesting, yes, but not interesting enough for me to venture in just yet. I’ll be keeping an eye on things, though. Count on it.

    As for the rest:

    “Who cares what the white vs. black ratio of BAP points was if the prizes go to the most deserving players.”

    Your rhetoric gets away with you here. Are you really meaning to say that a player who is more comfortable playing black is, by definition, “more deserving” than a player who is more comfortable with White?

    What I continue to be skeptical about is the fairness of rewarding people out of proportion to their achievement. If both players play perfectly only one gets rewarded. In a six-round event, a player who wins one black game and two whites will beat a player who wins all three whites. Before you respond that the first player did something more difficult than the second in winning with Black, note that the first player *also* did something weaker than the second, in that he failed to win one of his games with White; surely failing to hold an advantage is a worse fault than failing to overcome a disadvantage. A player who wins 2/3 with Black is equal to a player who wins all three Whites, even though that first player failed to even hold a draw in a single game with white?

    I’m a math geek, I need numbers. My theoretical analysis doesn’t make me optimistic about your system, but I’m well aware theory doesn’t always match practice, so I’ll be watching future events. Maybe I’ll come around to your POV, but not at the moment.

  6. Clint Ballard Says:

    All but one BAP tournament so far have been of the small swiss type, only one round robin. The larger the swiss, the easier the pairings will be to do.

    By “most deserving”, I mean the player that would have won using the old point system is expected to also win based on BAP points. While it is not guaranteed to be this way, all the events so far has come out this way.

    In your example, you are comparing two players in the middle of the field and in a swiss with around log base 2 rounds, there is very little resolution of the players in the middle. Neither of the players you mention would be in the running for prize money, so while one player would have bragging rights about having scored the same number of BAP points, this does not seem to be to be a flaw with BAP. If you are saying that a point system needs to rank all players in a tournament that is not a round robin,then I doubt there exists such a point system. The current point system doesn’t do it and neither does BAP.

    Why does BAP have to do things that the current point system doesn’t do? It makes all players play every game as hard as they can. The player that would have won using the normal system has been winning the BAP tournaments. White does not throw away draws recklessly and ends up with approximately the same overall performance if the old system is used. Current rating system is preserved. Cheating that currently exists does not seem to get worse. One reason is that a lot more players are in the running for the prizes in the last round and that has more than compensated for whatever increased temptations BAP creates for cheating.

    BAP makes chess a bit like tennis. If you lose the first game in tennis, it is either expected or really bad and you can’t fairly assess where you are until you play the second game. Other than prize money, BAP only affects pairings during the tournament and is therefore not something that I consider a “reward”. the prize money has gone to the players that would have won normally. Remember that the games are rated using normal methods.

    http://www.slugfest7.com/public/140.cfm has some math that will hopefully appeal to your math side. If the draw rate is 20%, then BAP is color neutral with a 1.5 white win to black win ratio. I hope we can agree that any point system that is less biased than the current one is actually more fair and at the GM level I would be surprised if the white win ratio isn’t around 1.5 to 1.

    What is the unfair reward that the player who scored 6 BAP with 2 black wins vs. 6 BAP with 3 white wins, when neither win any prizes and presumably the 3 white win player does better rating-wise?

    Clint

  7. Administrator Says:

    BAP makes chess a bit like tennis.

    I don’t see the analogy at all. I actually designed a tennis-like chess tournament once: Play proceeds at 1 round per day, a round consists of best of three matches, match consists of best of 12 games. Because of the duration, time control had to be rapid, used blitz. Never had the time to test it, though.

    What is the unfair reward that the player who scored 6 BAP with 2 black wins vs. 6 BAP with 3 white wins, when neither win any prizes and presumably the 3 white win player does better rating-wise?

    The points on the scoreboard, therefore the standing in the tournament. There are plenty of players who don’t play for prizes or rating points (Heck, if that were the reason I played, I would have quit long ago). If you insist on prizes, they could easily be in the running for a class prize.

    But since that annoys you, let’s use a different example: Player A has 2 black wins and 2 white wins (one white draw) going in to the last round. Player B was perfect with three black wins and two white wins. A gets Black and wins. A and B finish in a tie, despite the facts that A beat B (no big deal there) and also that A finished unbeaten, giving up a single draw while B lost a game! By any rational judgement I can understand, A had a better tournament than B, played better, produced a better score in terms of game results (both were perfect with Black while both only failed to win one game with white, which A drew while B lost) and yet finished tied with B.

    In your own Slugfest, the scoring system would have cost GM Akobian money by dropping him from what would otherwise have been a certain check via a 3-way tie for 2-4 and instead dropping him into a 4-way tie for 4-7. Most events offer a 2nd prize, though not all, but very few small weekend swisses offer a 4th place prize. If this system is applied to bigger events, then little things like 2nd place ties may start to matter to the players.

    Maybe this is all doom and gloom, I don’t know. I’ll keep watching for further info, though. Meanwhile, I’ll keep experimenting on my own with formats and prize structures.

  8. Clint Ballard Says:

    There are certainly rare cases that end up with strange results when BAP is used, however the same goes for any tournament that isn’t a double round-robin. Since you like specific numbers, I wrote a tournament simulation program to quantify what percentage of time various scoring systems end up with strange results.

    Mathematically, the purpose of a tournament is to determine the ranking of all the players that are in it. Chesswise, a double-round robin eliminates all the big variables that any other tournament format has, eg. different opponents, colors, etc. Given a pairing system that has significantly fewer rounds than a DRR, random chance enters into the picture, so I ran iterations until the error rate converged. With an average of 50 players, in about 100 tournaments, the error rates converged to pretty close to the values that I got at thousands of iterations, so 100 tournaments seemed like a sufficient number to eliminate the errors.

    I randomly generate a set of players of specified elo with a random number of rounds and then at the end of the tournament, the finishing rank of the players are compared with the actual rank. In some cases two players could have very similar ratings, or even exactly the same, so some amount of “error” is expected. Anyway, I don’t have all the kinks worked out yet, but initial results using thousands of different tournament configurations shows a RR (not DRR) having an average error of around 0.3 ranks. Swiss using standard pairings ends up averaging around 3 ranks, with “basketball” swiss around 2 ranks. Since there are an average of 50 players, the error rate is 4% to 10%. “basketball” swiss is where the top player in each point group plays the bottom player as opposed to the player in the middle. Just this change cut the error rate in half, so this seems like a very good thing to do.

    Now, when I used BAP to do the pairings, there was no statistically significant difference with the normal midpoint swiss pairing, but there was a bit over 0.5 error rate more when using the basketball method. The results are preliminary and not 100% confirmed, but so far:

    RR 0.3
    Swiss ~3
    BAP Swiss ~3
    Basketball Swiss ~2
    BAP Basketball Swiss ~2.5

    I was a bit surprised with the internal data that showed that the error rate was much smaller in the middle of the finishers vs. the top and bottom.

    I hope you an agree that looking at one specific tournament, or even one specific case is not really meaningful when the statistical variation is literally millions of possibilities. Maybe you have a better way of comparing pairing systems than my compare the average rank of each player? If so, I would be glad to implement that.

    With a statistically verifiable metric, we can objectively compare different point systems, pairing systems and see which one comes closest to the DRR ideal with minimum number of rounds.

    With basketball swiss, around 3% of the tournaments ended up with a winner that wouldn’t have won based on old fashioned points. But that means 97% of the time they are one and the same and that explains why we haven’t seen it in practice yet. I want to thank you for inspiring me to finally write the BAP pairing program! I still haven’t implemented the strict color balancing that I use when I run BAP tournaments, so it is quite possible that the BAP numbers will change. Now that I have a statistical verification framework, I can optimize the best pairing system possible and my feeling is that since BAP provides more information about the result, it will be possible to create a more accurate pairing system using BAP than the old point system.

    Clint

  9. Administrator Says:

    Maybe you have a better way of comparing pairing systems than my compare the average rank of each player?

    Outliers interest me far more than averages. In part because I learned long ago that outliers tell you more about your dataset than “normal” points do.

    But in this case it’s of practical concern as well. As an organizer, I have to field the complaints of the players who play in my events. And I’ve never met a player yet who was satisfied with “in your particular instance this isn’t correct, but on average this is correct.” They’re only concerned with their specific case. Especially where money is concerned.

    You’re also playing to a different audience than I. Only a few players who come to my events care about first prize, mainly because it’s a rare event that has more than 6-8 capable of winning it. The gulf between highest and lowest rated player nearly always over 1000 rating points. When I award upset prizes they are routinely won by players with 700+ point upsets, something that would rarely be able to happen in your slugfests.

    I have to be concerned with prizes other than first, or I lose my players. I have to be able to explain what I’m doing to them and make them understand it, and right now I wouldn’t relish the idea of explaining to them I’m using a scoring method intended to reduce draws to 25% (especially since the draw rate in the last two events I organized was already below that ratio) but which may randomize the results in the middle of the table a little. It’s a price few of my players would be willing to pay.

  10. Clint Ballard Says:

    Not sure why you think that BAP randomizes the results in the middle, or that upsets don’t happen. I thought you were mathematical and that statistics would be relevant for you. Extensive computer simulations show that BAP or old way ends up being very close as far as accurately resolving the middle of the field. Also, 96% of the time, the BAP winner also had the most normal points.

    I have come up with a tournament format that is independent of BAP, but takes advantage of BAP and have described it on gmslugfest.com. The plan is to have a $25,000 first prize, which can be split N-ways in case of an N-way tie. The top 4 players with 3 rounds to go play a big money quad. The rest of the players are also all playing in quads for a $1500 prize.

    I now have a tournament simulation infrastructure so I can test a variety of point systems and pairing systems to see what their statistical behavior is. I can also see what the worst case scenarios are for BAP and non-BAP. So far, the BIGGEST factor that determines who wins the tournament is playing strength. Basically, the best chess players win tournaments, regardless of what scoring or pairing system is used. That being said, my interest is in minimizing any negative factors and maximizing the positive factors.

    Clint

  11. Administrator Says:

    Not sure why you think that BAP randomizes the results in the middle, or that upsets don’t happen.

    I never said upsets don’t happen.

    As for the first part, I first got that impression from your comment (#6 above, Jan 3):

    In your example, you are comparing two players in the middle of the field and in a swiss with around log base 2 rounds, there is very little resolution of the players in the middle.

    You say yourself there is little resolution in the middle, hence my “randomizes” characterization.

  12. Clint Ballard Says:

    My comment applied to both BAP and non-BAP as resolving power is very close between the two.

    The big danger in evaluating point systems and pairing systems is that common sense isn’t always right.

    In fact, in comment #8 I said:

    “I was a bit surprised with the internal data that showed that the error rate was much smaller in the middle of the finishers vs. the top and bottom”

    After I got the actual simulation data, it became clear that the middle actually has the best resolution, while the top and bottom have the worst. Same for BAP and non-BAP. Optimal pairing algorithms are very complicated beasts and the smallest change in pairing preference changes the average rank difference. This cuts both ways as it means with the proper evaluation function of different pairings, it is possible to significantly improve the result. It it also possible to significantly hurt things too.

    Anyway, my simulation results show that BAP vs. non-BAP changes things about as much as giving the higher rated player white instead of black, when both players have no color imbalance, eg. very little. What this means is that I will be able to easily create a BAP pairing program that achieves BETTER metrics than the current swiss. I already have it so it gives everybody equal colors without losing any resolution. While it won’t be possible to get double round robin accuracy with log2 rounds, I can at least optimize it to be as good as it can get.

    BAP does not hurt tournament outcome accuracy because higher rated players do better, regardless of the point system.

    Clint

  13. David law Says:

    BAP is not a zero-sum point system. The effect of white drawing has a 2 BAP …. So maybe we need to revise the essential heart of our scoring system.

Leave a Reply