Table of Contents

The Idea
I always liked to organize tournaments, and I always appreciated the creativity of devising new formats to accommodate needs of time, participants and so on.
During summer 2016, one big tournament was the European championship of association football or Euro 2016. The tournament format was known, but the way to promote some teams from the group stage to the knockout stage was quite complicated.
This led me to remember something my professor of "algorithms and data structures 1" said in the university "are you sure that the knockout format, often used in tournaments, rewards the best player?". After that, I reflected and I was not able to remember any well known study about the effectiveness of tournaments format to reward the best participant. Sure those studies may exists but I never saw them, although I saw a lot of discussions about tournaments (for physical sports, board games and video games).
Therefore I decided to do my own study, since I estimated that this task would be faster than checking existing studies. This task would also give me some experience that I would not get from reading existing studies without replicating them.
The question I want to answer his:
Given a pool of teams sorted by strength, where every team has a chance to win the others, and given a tournament format, how often the weaker teams wins and how often the strongest teams wins?
Which tournament formats?
There are plenty of tournament formats, so which one to test? All of them? I wish! I focused on few that are often used at the start, more may come later.
 Double round robin. Often used in national tournaments (see Soccer) with every team playing against every other team, twice.
 Single round robin. Often used in competitions with many participants grouped in small groups. It is like the double round robin but the teams play against each other once.
 Knockout format. Often used in "show tournaments" with many participants or short run time. This was modeled with each match between teams based on 1 or 3 or an odd number of games, to ensure to determine the winner.
 More may come, like the Swiss system that seems pretty interesting.
Note that tournaments formats in general can be used to compare whatever type of objects, especially when the value is variable and the range of variance is not known.
Usages
(I focus on soccer since it is the first sport for many countries around the globe):
 double Round robin: a lot of soccer national tournaments for clubs. NBA (basket, USA) using double or even quadruple round robin during the regular season.
 single Round robin: short international tournaments, like soccer group stages for international tournaments. NFL in USA (American Football)
 knockout tournaments, best of 1: tennis tournaments (Wimbledon), soccer world cup
 knockout tournaments, best of 3 (or best of 2 with tiebreaks): international soccer competitions for clubs.
 Knockout best of many NBA playoffs.
Assumptions and starting conditions
The assumptions for the computation are the following:
 No draws, a team either win or loses. Draws may happen in terms of score, when multiple teams end up with the same top score.
 Having a strength index, borrowed from the idea of Arpad Elo's formula, to determine the results. See later.
 In a comparison, one should have a reference. The reference here is the double round robin result, that is used as benchmark for the other results.
 each team should have a chance to win all the others, even if the chance is small should not be zero.
For particular starting conditions, refer to single computations cases.
The base strength based on the idea of the Elo formula
How can one simulate the strength of teams or participants? A possible way is to use the Arpad Elo's underlying idea behind his formula for rating chess players. So one assigns to a team or a participant not an "estimated strength", like the Elo's formula does, but a base strength that then may be modified in every game.
Strength modifiers
Observing real world competitions, one can see that from time to time upsets happens. A stronger team loses against a weaker team and teams with similar strength produce uncertain results. To model this using a given base strength for each team, I devised strength modifiers. Those model the following: "in a certain game the base strength of a team is subject to (limited) variations, so in every game a team may produce different performances that anyway are always a modification of its base strength".
Then I added another layer in the model to capture the frequency of certain variations, saying that certain modifiers tends to apply less than others. At the end I wanted to model a Gaussianlike distribution, where small modifiers (that is, the strength of a team in a game would be closer to its base strength) would be selected more often than larger ones.
To simplify the computations I approximated the Gaussian distribution with a triangle distribution, using probability tokens (or weights) modeled with integer numbers instead of fractional values.
The idea is captured by the following matrix.On the third column, one can see the modifiers of the base strength.
The second column assigns the probability weight of a single modifier. So the modifier 400 has a weight of 25. The modifier 375 has a weight of 50. One can see that the modifiers around 0 have more weight. Plus modifiers around 0 have weights that are very similar to one another (the percentage difference is small) compared to the difference of weights between modifiers around zero and modifiers bigger than 200 in absolute value.
The first columns is the sum of the weights. From top to down.
The sum is used to determine the modifier. A random number NR is chosen between the values 1 and 7225 (drawn with uniform distribution. First and last value of weight sums). Then it is compared to the values in the first column from top to down. The first value that is bigger than NR determines the modifier to use.
For example
 given team base strength equal to 1400
 random_cumulative_weight = rounded_to_int( 7225 * random) + 1
# say it is 2100
 The first value bigger than 2100 in the matrix is 2275
returning a modifier of value 100
 compute: team_strength + modifier
 resulting team strength for the match: 1300
To see if a team has a chance to win a stronger one, an easy check is to give to the weaker team the biggest strength modifier and to the stronger team the smallest modifier. Therefore, using the matrix mentioned above, until the base strength difference is smaller than 800 strength points, a weaker team has a chance to win the stronger team.
Computing platform
Although the algorithms can be adapted to several programming languages, since I have a passion for mathematics  although being quite rusty in it compared to years ago  and I consider some calculators a good mathematical environment, I decided to develop the simulation on my hp 50g.
Among the different programming languages available for the 50g I used the userRPL language. The side effect of the project is that I refreshed also my knowledge about userRPL commands, stack manipulations, optimizations, math functions and other procedures related to the 50g. One can see the development in this thread on the forum of hpmuseum.org.
While could be unusual to use such environment (the hp 50g) for computations, it is also true that the limited resources of the 50g push the developer to find better solutions. With modern computers one uses, for many computations that are not really that big, plenty of resources in terms of computational power, RAM and storage.
The 50g, for example, has only round about 230 Kb of usable RAM. This let the developer reflect a bit about what can be done.
Observations and optimization to speed up the computation
Another limitation provided by the 50g is that the userRPL language is pretty slow in certain operations due to an emulation layer (the userRPL interpreter was developed for another CPU). Therefore one first obstacle that I found during the computation was that the procedure to access the matrix with probability weights and strength modifiers was slowing down the entire execution. To end the execution with that procedure, for 1000 iterations of the double round robin format, it would have required one month of uninterrupted computation.
So I engaged with the community on the hpmuseum.org's forum. The community is pretty amazing and some users, DavidM in particular, helped me with ideas to overcome limitations. Like substituting the table lookup with a formula that would return approximately the strength modifiers given a value from the first column (cumulative weights) in the matrix seen before.
Several formulas were developed, one can see the process starting from the post #100 on the hpmuseum.org's topic. Those were done with the best fit procedures of Excel or of the 50g itself, using the matrix exposed above and columns 1 and 3. Mostly power fit was used.
I also derived the "exact" formula to relate the cumulative weights to the strength modifier, therefore I could check the quality of the approximate formulas, without applying super rigorous comparisons though. The exact formula appears in the images below but it is scaled with a factor of 25, more info are exposed  until I report them here  in the linked discussion above. The exact formula was not used because too bulky. For the derivation of the exact formula, check this image: http://i.imgur.com/vXdXvzk.jpg
The approximation formula compared to the exact formula, multiplied by a factor of 25. The input for the formula is X, "cumulative weights" the output is the strength modifier. The formula shown are valid for the second part of the triangle between values 3826 and 7225. The two functions plotted in the range 38267225 produce the following graph.
Pretty close one another.For the range between 1 and 3825, the formulas (approximate, exact) are:
Producing the following graph, where both formulas mostly are identical.
Until further evaluations of the approximation functions (there are several in the linked topic), the current choice is equal to the approximate functions reported in the pictures above for the given ranges.
Results
If not stated otherwise, the modifiers and their probability weights are the one exposed above, in the matrix.
Double round robin
The reference for the comparison in this article. Each team meets each other team twice. Scores are collected, and then a winner emerges (or multiple winners, in case of equal top scores).
16 teams, base strength 12251600
Participating teams strength: $1225, 1250, 1300, 1350, 1325, 1350, 1375, 1400, 1425, 1450, 1475, 1500, 1525, 1550, 1575, 1600$
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 34892 seconds
16 teams, base strength 12461936
Participating teams strength: $1246, 1292, 1338, 1384, 1430, 1476, 1522, 1568, 1614, 1660, 1706, 1752, 1798, 1844, 1890, 1936$
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 32228 seconds
Single round robin
Like double round robin, but teams meet just once.
16 teams, base strength 12251600
Participating teams strength: $1225, 1250, 1300, 1350, 1325, 1350, 1375, 1400, 1425, 1450, 1475, 1500, 1525, 1550, 1575, 1600$
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 16674 seconds
16 teams, base strength 12461936
Participating teams strength: $1246, 1292, 1338, 1384, 1430, 1476, 1522, 1568, 1614, 1660, 1706, 1752, 1798, 1844, 1890, 1936$
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: about 18000 seconds
With larger difference of base strength the single round robin is quite effective to reward stronger teams.
Triple round robin
Like round robin, but teams meet three times.
16 teams, base strength 12251600
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 47398 seconds
16 teams, base strength 12461936
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: about 47355 seconds
With larger difference of base strength the single round robin is quite effective to reward stronger teams.
Knockout best of 1
Starting from a random pairing of the teams, the winning teams stay and meet another winner, until one team is left.
16 teams, base strength 12251600
Participating teams strength: $1225, 1250, 1300, 1350, 1325, 1350, 1375, 1400, 1425, 1450, 1475, 1500, 1525, 1550, 1575, 1600$
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 3219 seconds
16 teams, base strength 12461936
Participating teams strength: $1246, 1292, 1338, 1384, 1430, 1476, 1522, 1568, 1614, 1660, 1706, 1752, 1798, 1844, 1890, 1936$
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 3480 seconds
With large differences of base strength between teams the knockout tournament tends to reward more stronger teams.
Knockout best of 3
Like the knockout tournament but for every match 3 games are played, the team that wins more games continue in the tournament.
16 teams, base strength 12251600
Participating teams strength: $1225, 1250, 1300, 1350, 1325, 1350, 1375, 1400, 1425, 1450, 1475, 1500, 1525, 1550, 1575, 1600$
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 6646 seconds
16 teams, base strength 12461936
Participating teams strength: $1246, 1292, 1338, 1384, 1430, 1476, 1522, 1568, 1614, 1660, 1706, 1752, 1798, 1844, 1890, 1936$
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 6667 seconds
Knockout best of 5
16 teams, base strength 12251600
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 9838 seconds
16 teams, base strength 12461936
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 9832 seconds
Knockout best of 7
16 teams, base strength 12251600
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 13059 seconds
16 teams, base strength 12461936
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:Execution time on the 50g: 13048 seconds
All together
singleRR, doubleRR, knockout bo1 and bo3 16 teams, base strength 12251600
After 1000 tournaments the results are, in terms of percentage of wins over 1000 tournaments:From the graph one could clearly see that the round robin format rewards stronger teams, even more if more games between teams are played. The knockout format is more uncertain, rewarding strong and medium strong teams, although even there, more games for a match helps to reward the stronger teams.
All in all, if one wants to reward the best teams, would be suggested to use either a round robin format or knockout tournaments with many games for each match (maybe best of 5 or more, I hope to add the data about that).
The often used knockout formats based on "best of 1" for every game or "best of 2" (with tiebreaks) or "best of 3" seems not so suitable to reward the strongest teams at the end of the tournament.
General observations
The more the matches between the teams, the higher the reward towards stronger teams, see double round robin or knockout with many games for each match. When the difference between base strength between teams is large, then stronger teams wins more decisively, either in round robin tournaments or knockout tournaments. Although the knockout tournaments are still more uncertain than round robin tournaments.
It is interesting to note that a single round robin is almost as effective as whatever knockout format up to best of 7. So one may use that type of tournament, instead of a knockout format with many games.
Code
The code could be found (for the moment) following this link.
Further work that can be done
Further analysis can be done. Expanding the team numbers, or reducing them, changing the teams base strengths, changing the modifiers' weights, changing the value of the modifiers and a lot of other combinations. Could be also possible that the final results could be justified entirely using probability rules, without any simulation.
Other than that, other tournament formats can be compared.
Computing platform observations
Surely, for all this work would be better to use faster platforms. The Hp 50g with userRPL is a great portable computing environment for mathematics, but it is not so suitable for intensive computations. I would say that is good to do the first computations on the calculator, since the system is limited one is pushed to optimize a bit the procedures, but then port them on other platforms suitable for mathematical computations (or on the hp 50g C solution through hpgcc , see link1 and link2, that is pretty fast).
Side Observations
Formalizing new questions about concepts (note 1) that are often used but not analyzed (like using a tournament format without questioning its effectiveness), may happen quite late compared to the time when one starts to use those concepts. That's is both good and bad.
It is good because when we do not question the concept enough, we use it to get things done although maybe with a partially improper application; it is bad because we may miss an entire world of information about the concept that we use because we just use it, without checking its qualities. With more information we may use the concept in a more proper way.
Side side Observations
I'm not embedding images here on wikidot but I'm using imgur, this creates more efforts if I want to save a source copy of this article, but I hope that both services will hold long enough without, in the meantime, banning my contributions.
Note 1: with concept here I mean tournament formats, mathematical formulas, philosophical ideas, business models, libraries for a programming language, and so on.