在Rank Analysis of Incomplete Block Designs I: The Method of Paired Comparisons這篇論文中,維吉尼亞理工大學的Ralph Bradley與Miton Terry發展出成對樣本的羅吉斯模型。由於是成對樣本,這套模型可以適用於任何一對一的「比較」中,其中運動賽事是最常見的應用。模型的假設是當i與j比賽時,i獲勝的機率是\(\prod ij\),其模型的數學式為:
\[logit(\prod i \text{ beats } j)=\beta_{i}-\beta_{j}\]
經過轉換後可得:
\[\prod ij = \frac{exp(\beta_{i}-\beta_{j})}{1+exp(\beta_{i}-\beta_{j})}\]
誰是網球的GOAT?
自從2003年費德勒(Roger Federer)在溫布敦錦標賽(The Championships, Wimbledon)獲得他的第一座大滿貫冠軍,直到2021年的美國網球公開賽(US Open)為止,費德勒、納達爾(Rafael Nadal)、喬科維奇(Novak Djokovic)、莫瑞(Andy Murray)四人主宰男子網壇。他們四人在長達18年的73場大滿貫賽事中獲得63次冠軍,同時也壟斷ATP大師賽,因此有了四大天王(Big Four)的稱號。
近年隨著莫瑞傷退,四大天王逐漸變成三大天王(Big Three)。儘管費德勒、納達爾、喬科維奇已經接近退休,但三人在網壇的主宰力絲毫未見衰退,他們都分別擁有20座大滿貫頭銜,超過80座ATP冠軍,長期壟斷世界第一的位置,也因此到底誰是網球界的Goat一直是球迷津津樂道的話題。我們可以從四人的對戰組合中,試著找出答案。
Win \ Lose | Murray | Djokovic | Nadal | Federer |
Murray | 0 | 11 | 7 | 11 |
Djokovic | 25 | 0 | 30 | 27 |
Nadal | 17 | 28 | 0 | 24 |
Federer | 14 | 23 | 16 | 0 |
上表是直接從ATP網站擷取的對戰紀錄,時間至2021年美網結束為止,我們可用R繪製出一模一樣的對戰表:
> winner<-c("Andy Murray", "Novak Djokovic", "Rafael Nadal", "Roger Federer")
> loser<-c("Andy Murray", "Novak Djokovic", "Rafael Nadal", "Roger Federer")
> table<-list(Win=winner, Lose=loser)
> table<-expand.grid(table)
> data<-c(0,25,17,14,11,0,28,23,7,30,0,16,11,27,24,0)
> crosstab<-cbind(table, data)
> tennis<-xtabs(data~Win+Lose, crosstab)
> tennis
Lose
Win Andy Murray Novak Djokovic Rafael Nadal Roger Federer
Andy Murray 0 11 7 11
Novak Djokovic 25 0 30 27
Rafael Nadal 17 28 0 24
Roger Federer 14 23 16 0
接著下載BradleyTerry2套件,並將資料用countsToBinomial()轉換為成對樣本。4位選手中取兩兩對戰,因此共有\(C_{2}^4=6\)種對戰組合。
> library(BradleyTerry2)
> Head2Head<-countsToBinomial(tennis)
> names(Head2Head)[3:4]<-c("Win", "Lose")
> Head2Head</kbd>
player1 player2 Win Lose
1 Andy Murray Novak Djokovic 11 25
2 Andy Murray Rafael Nadal 7 17
3 Andy Murray Roger Federer 11 14
4 Novak Djokovic Rafael Nadal 30 28
5 Novak Djokovic Roger Federer 27 23
6 Rafael Nadal Roger Federer 24 16
大體而言,喬科維奇對戰莫瑞、納達爾、費德勒分別取得25勝11敗、30勝28敗、27勝23敗的優勢,與Nadal的對戰成績接近五五波優勢並不明顯。由於喬科維奇是現任球王,我們以他當作參考點來建立模型:
> model<-BTm(cbind(Win, Lose), player1, player2, formula=~player, id="player", refcat="Andy Murray", data=Head2Head)
> summary(model)
Call:
BTm(outcome = cbind(Win, Lose), player1 = player1, player2 = player2,
formula = ~player, id = "player", refcat = "Novak Djokovic",
data = Head2Head)
Deviance Residuals:
1 2 3 4 5 6
-0.2046 -0.2674 0.4786 0.3421 -0.5346 0.2242
Coefficients:
Estimate Std. Error z value Pr(>|z|)
playerAndy Murray -0.74728 0.25236 -2.961 0.00306 **
playerRafael Nadal 0.02088 0.20831 0.100 0.92017
playerRoger Federer -0.31241 0.21610 -1.446 0.14827
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 12.2482 on 6 degrees of freedom
Residual deviance: 0.7955 on 3 degrees of freedom
AIC: 30.825
Number of Fisher Scoring iterations: 3
以喬科維奇當作參考點,其他三位選手對數勝算(log odds)的係數分為別莫瑞-0.747、納達爾0.021、費德勒-0.312,其中只有莫瑞達到統計顯著。在Bradley-Terry模型中,參數就代表每位球員跟參考球員相比較後的「能力」(ability parameter)。可以看到與喬科維奇相較之下,除了納達爾之外,莫瑞與費德勒在對戰上都落居下風。不過只有莫瑞vs.喬科維奇達到統計顯著,代表就現在的對戰紀錄而言(雖然這是母體),喬科維奇確實贏過莫瑞,但統計上還沒有達到可以完全戰勝納達爾與費德勒的程度。
BTabilities()可以直接輸出球員的「能力」報表,也就是Bradley-Terry模型的參數估計,因為喬科維奇是參考點,所以參數是0:
> BTabilities(model)
ability s.e.
Andy Murray -0.74728269 0.2523606
Novak Djokovic 0.00000000 0.0000000
Rafael Nadal 0.02087824 0.2083145
Roger Federer -0.31240554 0.2160995
有了上述對數勝算參數,可以用exp去除對數,重新估計喬科維奇對上其他三人的對戰勝算:
- vs. 莫瑞=\(\frac{exp(0-(-0.747))}{1+exp(0-(-0.747))}\)=68%
- vs. 納達爾=\(\frac{exp(0-0.021)}{1+exp(0-0.021)}\)=49%
- vs. 費德勒=\(\frac{exp(0-(-0.312))}{1+exp(0-(-0.312))}\)=58%
如果要更改模型的參考點,可以用update()配合refact來設定。例如可以將參考點從喬科維奇改為納達爾:
> update(model, refcat="Rafael Nadal")
Bradley Terry model fit by glm.fit
Call: BTm(outcome = cbind(Win, Lose), player1 = player1, player2 = player2,
formula = ~player, id = "player", refcat = "Rafael Nadal",
data = Head2Head)
Coefficients:
playerAndy Murray playerNovak Djokovic playerRoger Federer
-0.76816 -0.02088 -0.33328
Degrees of Freedom: 6 Total (i.e. Null); 3 Residual
Null Deviance: 12.25
Residual Deviance: 0.7955 AIC: 30.83
硬地 vs. 紅土 vs. 草地
網球場地可分為硬地、紅土、草地,不同選手有不同擅長的場地,例如納達爾被稱為紅土之王、前球王山普拉斯(Pete Sampras)號稱草地至尊。我們依據不同場地,列出四大天王的對戰成績,分析他們在各種場地上的表現:
硬地 Hard | ||||
---|---|---|---|---|
Win \ Lose | Murray | Djokovic | Nadal | Federer |
Murray | 0 | 8 | 5 | 10 |
Djokovic | 20 | 0 | 20 | 20 |
Nadal | 7 | 7 | 0 | 9 |
Federer | 12 | 18 | 11 | 0 |
紅土 Clay | ||||
---|---|---|---|---|
Win \ Lose | Murray | Djokovic | Nadal | Federer |
Murray | 0 | 1 | 2 | 0 |
Djokovic | 5 | 0 | 8 | 4 |
Nadal | 7 | 19 | 0 | 14 |
Federer | 0 | 4 | 2 | 0 |
草地 Grass | ||||
---|---|---|---|---|
Win \ Lose | Murray | Djokovic | Nadal | Federer |
Murray | 0 | 2 | 0 | 1 |
Djokovic | 0 | 0 | 2 | 3 |
Nadal | 3 | 2 | 0 | 1 |
Federer | 2 | 1 | 3 | 0 |
以R繪製出三種不同場地的對戰紀錄:
> data_hard<-c(0,20,7,12,8,0,7,18,5,20,0,11,10,20,9,0)
> data_clay<-c(0,5,7,0,1,0,19,4,2,8,0,2,0,4,14,0)
> data_grass<-c(0,0,3,2,2,0,2,1,0,2,0,3,1,3,1,0)
> crosstab_hard<-cbind(table, data_hard)
> crosstab_clay<-cbind(table, data_clay)
> crosstab_grass<-cbind(table, data_grass)
> hard<-xtabs(data_hard~Win+Lose, crosstab_hard)
> clay<-xtabs(data_clay~Win+Lose, crosstab_clay)
> grass<-xtabs(data_grass~Win+Lose, crosstab_grass)
> hard
Lose
Win Andy Murray Novak Djokovic Rafael Nadal Roger Federer
Andy Murray 0 8 5 10
Novak Djokovic 20 0 20 20
Rafael Nadal 7 7 0 9
Roger Federer 12 18 11 0
> clay
Lose
Win Andy Murray Novak Djokovic Rafael Nadal Roger Federer
Andy Murray 0 1 2 0
Novak Djokovic 5 0 8 4
Rafael Nadal 7 19 0 14
Roger Federer 0 4 2 0
> grass
Lose
Win Andy Murray Novak Djokovic Rafael Nadal Roger Federer
Andy Murray 0 2 0 1
Novak Djokovic 0 0 2 3
Rafael Nadal 3 2 0 1
Roger Federer 2 1 3 0
將轉換為成對樣本:
> Head2Head_hard<-countsToBinomial(hard)
player1 player2 win1 win2
1 Andy Murray Novak Djokovic 8 20
2 Andy Murray Rafael Nadal 5 7
3 Andy Murray Roger Federer 10 12
4 Novak Djokovic Rafael Nadal 20 7
5 Novak Djokovic Roger Federer 20 18
6 Rafael Nadal Roger Federer 9 11
> Head2Head_clay<-countsToBinomial(clay)
player1 player2 win1 win2
1 Andy Murray Novak Djokovic 1 5
2 Andy Murray Rafael Nadal 2 7
3 Novak Djokovic Rafael Nadal 8 19
4 Novak Djokovic Roger Federer 4 4
5 Rafael Nadal Roger Federer 14 2
> Head2Head_grass<-countsToBinomial(grass)
player1 player2 win1 win2
1 Andy Murray Novak Djokovic 2 0
2 Andy Murray Rafael Nadal 0 3
3 Andy Murray Roger Federer 1 2
4 Novak Djokovic Rafael Nadal 2 2
5 Novak Djokovic Roger Federer 3 1
6 Rafael Nadal Roger Federer 1 3
計算Bradley-Terry的硬地場地模型:
> model_hard<-BTm(cbind(win1, win2), player1, player2, formula=~player, id="player", refcat="Novak Djokovic", data=Head2Head_hard)
> summary(model_hard)
Call:
BTm(outcome = cbind(win1, win2), player1 = player1, player2 = player2,
formula = ~player, id = "player", refcat = "Novak Djokovic",
data = Head2Head_hard)
Deviance Residuals:
1 2 3 4 5 6
-0.2625 -0.4658 0.6247 0.7177 -0.7688 0.4106
Coefficients:
Estimate Std. Error z value Pr(>|z|)
playerAndy Murray -0.8073 0.2975 -2.713 0.00666 **
playerRafael Nadal -0.7419 0.3007 -2.467 0.01362 *
playerRoger Federer -0.3560 0.2613 -1.362 0.17305
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 12.6625 on 6 degrees of freedom
Residual deviance: 1.9509 on 3 degrees of freedom
AIC: 29.136
Number of Fisher Scoring iterations: 3
從上面的硬地模型來看,喬科維奇對戰另外三人都具有優勢,特別是在對戰莫瑞與納達爾達到統計顯著,反映出他在澳網的主宰力,相較之下在硬地對戰費德勒的優勢則不明顯。接下來計算紅土場地模型:
> model_clay<-BTm(cbind(win1, win2), player1, player2, formula=~player, id="player", refcat="Novak Djokovic", data=Head2Head_clay)
> summary(model_clay)
Call:
BTm(outcome = cbind(win1, win2), player1 = player1, player2 = player2,
formula = ~player, id = "player", refcat = "Novak Djokovic",
data = Head2Head_clay)
Deviance Residuals:
1 2 3 4 5
-0.7023 0.6636 0.1272 -0.7475 0.6968
Coefficients:
Estimate Std. Error z value Pr(>|z|)
playerAndy Murray -0.8959 0.6795 -1.318 0.1874
playerRafael Nadal 0.9188 0.3742 2.456 0.0141 *
playerRoger Federer -0.5316 0.5258 -1.011 0.3119
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 20.592 on 5 degrees of freedom
Residual deviance: 1.994 on 2 degrees of freedom
AIC: 20.849
Number of Fisher Scoring iterations: 4
從模型數據來看,喬科維奇在紅土球場對戰納達爾的係數為0.9188,還原後的勝算為:
\(\frac{exp(0-0.9188)}{1+exp(0-0.9188)}\)=29%
由此可以看出納達爾在紅土的宰制力。接著來看看草地的情況又有所不同:
計算草地模型:
> model_grass<-BTm(cbind(win1, win2), player1, player2, formula=~player, id="player", refcat="Novak Djokovic", data=Head2Head_grass)
> summary(model_grass)
Call:
BTm(outcome = cbind(win1, win2), player1 = player1, player2 = player2,
formula = ~player, id = "player", refcat = "Novak Djokovic",
data = Head2Head_grass)
Deviance Residuals:
1 2 3 4 5 6
1.9307 -1.6614 -0.1280 0.1063 1.1271 -1.0230
Coefficients:
Estimate Std. Error z value Pr(>|z|)
playerAndy Murray -0.4314 0.8774 -0.492 0.623
playerRafael Nadal 0.1063 0.7489 0.142 0.887
playerRoger Federer 0.1063 0.7489 0.142 0.887
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9.3643 on 6 degrees of freedom
Residual deviance: 8.8323 on 3 degrees of freedom
AIC: 21.868
Number of Fisher Scoring iterations: 4
草地是四大天王對戰紀錄最少的場地,因此全部都未達統計顯著,從係數來看喬科維奇在草地對戰費納的優勢差不多。