Do winning decks in MTG World Championships have a particular card balance?¶

author: Paolo Saracco

date: 2024-05-14

Executive summary¶

Magic: The Gathering (aka MTG) is a well-known collectible card game created by Richard Garfield. The MTG World Championships (Worlds) is among the most important tournament in the game, offering cash prizes of up to $100,000 to the winners. We wonder whether winning decks in the standard format have a particular combination of different card types. To answer the question in the negative, we compare the number of Land, Creature, Instant and Sorcery, and Other card types in the top 4 decks of each World Championship from 1994 until 2022.

Libraries¶

In [ ]:
import requests
import re
from bs4 import BeautifulSoup
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
from statistics import mean
import scipy.stats as ss
import os

Data set¶

To collect the data we need, we scrape the webpage www.mtgtop8.com and we save the outcome in a csv file called raw_magic.csv.

In [ ]:
folder = os.getcwd()
data_dir = Path(folder, "data")
Path(data_dir).mkdir(parents=True, exist_ok=True)
data_file = Path(data_dir, "raw_magic.csv")

if not data_file.exists() :
    print(f"{data_file} does not exists. Creating it...")

    players = pd.DataFrame(columns = ["Deck", "Player", "Format", "Event", "Level", "Rank", "Date"])
    decks = pd.DataFrame()

    for year in range(1994,2023,1) :
        # we skip year 2023 because there are decks with empty name field and
        # the command < data = row.text.strip().split("\n\t\t  ") >
        # produces a list which is shorter than the others,
        # running into the error
        # < ValueError: cannot set a row with mismatched columns >
        print(f"I am scraping year {year}")

        my_status = 200
        deck_rows = ["some rows..."]
        page = 1

        while my_status == 200 and deck_rows != [] :
            # gets the content of the page corresponding to
            # searching with the parameters given below
            url = "https://www.mtgtop8.com/"
            parameters = {"current_page" : f"{page}", 
                        "event_titre" : "world",
                        "format" : "ST",
                        "compet_check[P]" : "1",
                        "date_start" : f"01/01/{year}",
                        "date_end" : f"31/12/{year}"}
            response = requests.get(url + "search", params = parameters)
            my_status = response.status_code

            if my_status == 200 :

                soup = BeautifulSoup(response.text, "html5lib")

                deck_rows = soup.find_all("tr", class_ = "hover_tr")

                if deck_rows == [] :
                    print(f"Finished with year {year}. Moving to the next...")

                else :
                    # extract the table on the given page
                    # and concatenates it to the "players"
                    for row in deck_rows :
                        data = row.text.strip().split("\n\t\t  ")
                        players.loc[len(players)] = data

                    # extracts the links to the decks on the given page
                    deck_links = soup.find_all("a", href = re.compile(".*&d=.*"))

                    # for each link, it scrapes the info about the deck
                    # and concatenats it to the "decks"
                    deck_cards = []

                    for link in deck_links :
                        print(f"I am scraping year {year}, deck : {link.text}")
                        new_deck_url = url + link["href"]
                        response_new_deck = requests.get(new_deck_url)
                        soup_new_deck = BeautifulSoup(response_new_deck.text, "html5lib")
                        details_new_deck = soup_new_deck.find_all("div", class_ = "O14")
                        new_deck = {}
                        for elem in details_new_deck :
                            text_string = elem.text
                            text_string_list = text_string.split()
                            name = ""
                            value = 0
                            for text in text_string_list :
                                if text.isdigit() :
                                    value = int(text)
                                else :
                                    name = name + text + "_"
                            new_deck[f"{name.strip("_")}"] = value

                        deck_cards.append(pd.DataFrame(new_deck, index = [0]))
                
                    deck_cards = pd.concat(deck_cards, ignore_index=True)
                    decks = pd.concat([decks, deck_cards], ignore_index=True)

                    page += 1

            else :
                print(f"Status code page {page} or year {year} : {my_status}.")

    df = pd.concat([players,decks], axis=1)
    df.to_csv("data/raw_magic.csv", index=False)
    print("File created.")

else :
    print("File already existing. Scraping skipped.")

df = pd.read_csv(data_file)
File already existing. Scraping skipped.

Data cleaning¶

In [ ]:
df.head()
Out[ ]:
Deck Player Format Event Level Rank Date LANDS CREATURES INSTANTS_and_SORC. OTHER_SPELLS SIDEBOARD LANDS_(29) LANDS_(26) LANDS_(27) LANDS_(28)
0 RG Aggro Bertrand Lestrée Standard Worlds 1994 (Milwaukee) * NaN 2 01/08/94 20.0 12.0 20.0 8.0 0.0 NaN NaN NaN NaN
1 Wug Control Zak Dolan Standard Worlds 1994 (Milwaukee) * NaN 1 01/08/94 15.0 11.0 15.0 19.0 0.0 NaN NaN NaN NaN
2 Zoo Dominic Symens Standard Worlds 1994 (Milwaukee) * NaN 3-4 01/08/94 16.0 12.0 19.0 13.0 0.0 NaN NaN NaN NaN
3 Zoo Cyrille de Foucaud Standard Worlds 1994 (Milwaukee) * NaN 3-4 01/08/94 23.0 16.0 20.0 7.0 0.0 NaN NaN NaN NaN
4 Rack Control Alexander Blumke Standard Worlds 1995 (Seattle) * NaN 1 01/08/95 25.0 6.0 18.0 14.0 0.0 NaN NaN NaN NaN
In [ ]:
data = df.copy()
data = data.drop(labels = ["Level","SIDEBOARD",'Format'], axis = 1)
data["Date"] = data["Date"].apply(lambda date : datetime.datetime.strptime(date, "%d/%m/%y"))
print("\n".join(data.Event.unique()))
Worlds 1994 (Milwaukee) *
Worlds 1995 (Seattle) *
Worlds 1996 (Seattle)
Worlds 1997 (Seattle) *
Worlds 1998 (Seattle)
Worlds 1999 (Yokohama)
Worlds 2000 (Brussels)
Worlds 2001 (Toronto)
Worlds 2002 (Sydney)
Worlds 2003 (Berlin)
Worlds 2004 (San Francisco)
Worlds 2005 (Yokohama)
Worlds 2006 (Paris)
Worlds 2007 (New York)
Worlds 2008 (Memphis)
Worlds 2009 (Rome)
Worlds 2010 (Chiba)
Worlds 2011 - Top 8
Worlds 2011 - Undefeated Standard
Worlds Standard (15 points and original)
2012 World Magic Cup
2013 World Championship
2014 World Magic Cup
2014 World Championship - Day 2
World Magic Cup 2015
2015 World Championship
2016 World Championship
2017 World Magic Cup
Worlds 2017
2018 World Magic Cup
Worlds 2018 (Las Vegas)
World Championship XXVI Honolulu
Magic World Championship XXVII (2021)
Magic World Championship XXVIII

First of all, we are only interested in the individual World Championship rankings.

From Wikipedia:

In 2012, the Magic World Championship structure was drastically altered alongside changes to the ranking system used in Magic: The Gathering. The individual World Championship was changed from a Pro Tour-sized event to a sixteen-player event, which was called the Magic Players Championship (though the tournament reverted to being called the World Championship for 2013). The team event, formerly held alongside the individual event, took place before the individual tournament and was contested by four-player teams instead of the previous three-player teams.

Team World Championship The Team World Championship consists of three-player teams, with each team representing one country. Players that are eligible to play in the Team World Championship are the first, second and third place players at a country's National Championship.

World Magic Cup In 2012, the Team World Championship became a single separate event called the 'World Magic Cup'. This national team event consists of four-player teams representing selected countries. The four players eligible to play in each national team will be the three winners of World Magic Cup qualifiers and the National Champion (the player finished with most Pro Points in previous season) of the country. The National Champion is the highest ranked player from that country in that year's rankings.

This means we do not consider the Pro Tours, including the 2012 Pro Tour that replaced the World Championship, and we drop the lines corresponding to the World Cups.

Moreover, in 2011: The eighteenth Magic World Championship was held from 17 to 20 November in the Fort Mason Center in San Francisco, the same site that already hosted the 2004 World Championship. The tournament consisted of six rounds of Standard on Thursday, two Innistrad Booster Drafts of three rounds each on Friday, and six rounds of Modern on Saturday. This would be the first World Championship to feature the new Modern format. On Sunday, the Top 8 players played against each other in elimination rounds, using the Standard decks they played on Thursday.

We are only interested in the Top 8.

In [ ]:
data = data[~data['Event'].str.contains('Cup')]
data = data[~data['Event'].str.contains('Undefeated')]
data = data[~data['Event'].str.contains('15 points')]
print("\n".join(data.Event.unique()))
Worlds 1994 (Milwaukee) *
Worlds 1995 (Seattle) *
Worlds 1996 (Seattle)
Worlds 1997 (Seattle) *
Worlds 1998 (Seattle)
Worlds 1999 (Yokohama)
Worlds 2000 (Brussels)
Worlds 2001 (Toronto)
Worlds 2002 (Sydney)
Worlds 2003 (Berlin)
Worlds 2004 (San Francisco)
Worlds 2005 (Yokohama)
Worlds 2006 (Paris)
Worlds 2007 (New York)
Worlds 2008 (Memphis)
Worlds 2009 (Rome)
Worlds 2010 (Chiba)
Worlds 2011 - Top 8
2013 World Championship
2014 World Championship - Day 2
2015 World Championship
2016 World Championship
Worlds 2017
Worlds 2018 (Las Vegas)
World Championship XXVI Honolulu
Magic World Championship XXVII (2021)
Magic World Championship XXVIII

Then we check how the rankings appear in the data.

In [ ]:
print(data.Rank.unique())
['2' '1' '3-4' '5-8' 'Day 1 undefeated' 'Other' nan '9' '11' '10' '6' '16'
 '8' '12' '13' '7' '15' '14' '5' '4' '3' '17-32' '27' '20' '19' '31' '26'
 '17' '22' '24' '25' '28' '21' '18' '29' '32' '23']

By inspecting the Wikipedia page, we realise that in modern days it happens that the standard format is used only for the Top 4 on the last day:

The 2016 World Championship was held from 1–4 September in Seattle. The formats used in the competition were Eldritch Moon-Shadows over Innistrad Booster Draft for rounds 1–3, Standard for rounds 4–7, Eldritch Moon-Shadows over Innistrad Booster Draft for rounds 8–10, Modern for rounds 11–14, and Standard for the Top 4.

The 2017 World Championship was held from 6–8 October in Boston. For the 2017 World Championship Wizards of the Coast decided to simplify the invitation criteria, awarding a large portion of invites to the players with the most Pro Points in the 2016–17 Pro Tour season. The formats used in the competition were Ixalan Booster Draft for rounds 1–3, Standard for rounds 4–7, Ixalan Booster Draft for rounds 8–10, Standard for rounds 11–14, and Standard for the Top 4.

By Virtue of this observation, we select only the decks which are among the top 4 in every year, dropping everything else.

All the Day 1 undefeated ranked 9th or above:

In [ ]:
print('Dates in which we have Day 1 undefeated rankings :')
for date in data[data['Rank'] == 'Day 1 undefeated'].Date.unique() :
    print(date.date())
    print('rankings : ' + str(data[data['Date'].dt.year == date.year].Rank.unique()))
Dates in which we have Day 1 undefeated rankings :
1997-08-17
rankings : ['1' '2' '3-4' '5-8' 'Day 1 undefeated']
1999-08-08
rankings : ['Day 1 undefeated' '5-8' 'Other' '3-4' '2' '1']
2001-08-12
rankings : ['1' '2' '3-4' '5-8' 'Day 1 undefeated']
2002-08-18
rankings : ['3-4' '5-8' '1' '2' 'Day 1 undefeated']
2003-08-10
rankings : ['1' '5-8' '3-4' '2' 'Day 1 undefeated']
2004-09-05
rankings : ['1' '2' '3-4' '5-8' 'Day 1 undefeated']
2005-12-04
rankings : ['1' '2' '3-4' '5-8' 'Day 1 undefeated']
2006-12-03
rankings : ['1' '2' '3-4' '5-8' 'Day 1 undefeated']
2007-12-06
rankings : ['Day 1 undefeated' nan '5-8' '1' '3-4' '2']
2008-12-14
rankings : ['5-8' '3-4' '2' '1' 'Day 1 undefeated']
2009-11-19
rankings : ['3-4' 'Day 1 undefeated' '2' '5-8' '1']
2010-12-12
rankings : ['2' '1' '5-8' '3-4' 'Day 1 undefeated' 'Other']

Since we are interested only in the top 4 ranked decks, we will drop these harmlessly. Concerning the NaN's, instead:

In [ ]:
print('Dates in which we have NaN rankings :')
for date in data[pd.isna(data['Rank'])].Date.unique() :
    print(date.date())
    print('rankings : ' + str(data[data['Date'].dt.year == date.year].Rank.unique()))
Dates in which we have NaN rankings :
2007-12-06
rankings : ['Day 1 undefeated' nan '5-8' '1' '3-4' '2']
2016-09-02
rankings : ['Other' '1' '2' nan '3-4']
2018-09-23
rankings : [nan '3-4' '1' '2']

All the NaN's ranked 5th or above, so we will drop them harmlessly. Concerning the Other rankings:

In [ ]:
print('Dates in which we have Other rankings :')
for date in data[data['Rank'] == 'Other'].Date.unique() :
    print(date.date())
    print('rankings : ' + str(data[data['Date'].dt.year == date.year].Rank.unique()))
Dates in which we have Other rankings :
1999-08-08
rankings : ['Day 1 undefeated' '5-8' 'Other' '3-4' '2' '1']
2010-12-12
rankings : ['2' '1' '5-8' '3-4' 'Day 1 undefeated' 'Other']
2016-09-02
rankings : ['Other' '1' '2' nan '3-4']
2017-10-07
rankings : ['1' '2' '3-4' 'Other']

Again, all of them ranked 5th or above, so we will drop them harmlessly.

Finally, the year 2014 has some issues: we correct it by checking the Wikipedia page by hands.

The final standings were as follows:

  • Israel Shahar Shenhar (2013 World Champion) - Decklist
  • United States Patrick Chapin (Pro Tour Journey into Nyx winner)
  • Japan Yuuya Watanabe (Pro Point leader Japan)
  • Japan Kentaro Yamamoto (8th most Pro Points of otherwise unqualified)
In [ ]:
data.loc[data['Date'].dt.year == 2014,'Rank'] = np.NaN
data.loc[(data['Date'].dt.year == 2014) & (data['Player'] == 'Shahar Shenhar'),'Rank'] = '1'
data.loc[(data['Date'].dt.year == 2014) & (data['Player'] == 'Patrick Chapin'),'Rank'] = '2'
data.loc[(data['Date'].dt.year == 2014) & (data['Player'] == 'Yuuya Watanabe'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2014) & (data['Player'] == 'Kentaro Yamamoto'),'Rank'] = '4'

We can now proceed to select only the observations of interest:

In [ ]:
data = data[data['Rank'].isin(['1', '2', '3', '4', '3-4'])]

One final issue comes from the fact that there are decks, from a certain moment on, in which there are cards with two faces, and one face is a land. Hence the scraped data contain in parenthesis the max number of lands the deck can count on, generating new columns.

In [ ]:
print(", ".join(data.columns))
Deck, Player, Event, Rank, Date, LANDS, CREATURES, INSTANTS_and_SORC., OTHER_SPELLS, LANDS_(29), LANDS_(26), LANDS_(27), LANDS_(28)

To begin with, I will try to consider them as non-lands.

In [ ]:
lands_columns = [col for col in data.columns if "LANDS" in col]

for i in data.index :
    for col in lands_columns :
        if not pd.isna(data.loc[i,col]) : 
            data.loc[i,"Lands"] = data.loc[i,col]

data = data.drop(labels = lands_columns, axis = 1)
data = data.rename(columns = {"CREATURES" : "Creatures",
                          "INSTANTS_and_SORC." : "Instants_Sorceries", 
                          "OTHER_SPELLS" : "Other_spells"})

Commenting the previous Python script and uncommenting the underlying Python script will consider them as lands.

In [ ]:
# for i in data.index :
#     if not pd.isna(data.loc[i,'LANDS_(26)']) :
#         data.loc[i,"Lands"] = 26
#     elif not pd.isna(data.loc[i,'LANDS_(27)']) :
#         data.loc[i,"Lands"] = 27
#     elif not pd.isna(data.loc[i,'LANDS_(28)']) :
#         data.loc[i,"Lands"] = 28
#     elif not pd.isna(data.loc[i,'LANDS_(29)']) :
#         data.loc[i,"Lands"] = 29
#     else :
#         data.loc[i,"Lands"] = data.loc[i,"LANDS"]

# lands_columns = [col for col in data.columns if "LANDS" in col]

# data = data.drop(labels = lands_columns, axis = 1)
# data = data.rename(columns = {"CREATURES" : "Creatures",
#                           "INSTANTS_and_SORC." : "Instants_Sorceries", 
#                           "OTHER_SPELLS" : "Other_spells"})

To proceed with the data cleaning phase, let us seek additional NaN's.

In [ ]:
pd.isna(data).sum(axis=0)
Out[ ]:
Deck                   0
Player                 0
Event                  0
Rank                   0
Date                   0
Creatures              8
Instants_Sorceries     1
Other_spells          18
Lands                  1
dtype: int64

Creatures, Lands and Other spells might not be an issue (see later), but Lands are:

In [ ]:
data[np.isnan(data['Lands'])]
Out[ ]:
Deck Player Event Rank Date Creatures Instants_Sorceries Other_spells Lands
10 (Red Artifact Control) Mark Justice Worlds 1995 (Seattle) * 3-4 1995-08-01 NaN NaN 60.0 NaN

Inspecting mtgtop8 reveals that the composition of this deck is unknown. We are forced to drop it.

In [ ]:
data = data[~np.isnan(data['Lands'])]
pd.isna(data).sum(axis=0)
Out[ ]:
Deck                   0
Player                 0
Event                  0
Rank                   0
Date                   0
Creatures              7
Instants_Sorceries     0
Other_spells          18
Lands                  0
dtype: int64

In all the other cases, the NaN value shall be a 0:

In [ ]:
data[pd.isna(data['Creatures'])]
Out[ ]:
Deck Player Event Rank Date Creatures Instants_Sorceries Other_spells Lands
5 WR Control Marc Hernandez Worlds 1995 (Seattle) * 2 1995-08-01 NaN 15.0 27.0 20.0
76 Wake Daniel Zink Worlds 2003 (Berlin) 1 2003-08-10 NaN 26.0 7.0 27.0
82 Wake Tuomo Nieminen Worlds 2003 (Berlin) 3-4 2003-08-10 NaN 25.0 9.0 26.0
83 Wake Jin Okamoto Worlds 2003 (Berlin) 2 2003-08-10 NaN 27.0 7.0 26.0
90 KCI Manuel Bevand Worlds 2004 (San Francisco) 3-4 2004-09-05 NaN 18.0 23.0 19.0
102 Enduring Ideal Akira Asahara Worlds 2005 (Yokohama) 3-4 2005-12-04 NaN 17.0 20.0 23.0
514 Izzet Control Ondřej Stráský Magic World Championship XXVII (2021) 4 2021-10-05 NaN 37.0 NaN 23.0
In [ ]:
data[pd.isna(data['Other_spells'])]
Out[ ]:
Deck Player Event Rank Date Creatures Instants_Sorceries Other_spells Lands
20 Black Five Jakub Slemr Worlds 1997 (Seattle) * 1 1997-08-17 25.0 14.0 NaN 22.0
21 URw Aggro Janosch Kuhn Worlds 1997 (Seattle) * 2 1997-08-17 8.0 30.0 NaN 22.0
49 Ice Age.dec Dominik Hothow Worlds 2000 (Brussels) 3-4 2000-08-06 25.0 11.0 NaN 24.0
57 Machinehead Tom Van de Logt Worlds 2001 (Toronto) 1 2001-08-12 18.0 18.0 NaN 24.0
67 Psychatog Diego Ostrovich Worlds 2002 (Sydney) 3-4 2002-08-18 8.0 28.0 NaN 24.0
71 Psychatog Carlos Romão Worlds 2002 (Sydney) 1 2002-08-18 8.0 28.0 NaN 24.0
72 Psychatog Dave Humpherys Worlds 2002 (Sydney) 3-4 2002-08-18 9.0 26.0 NaN 25.0
73 BurningTog Mark Ziegner Worlds 2002 (Sydney) 2 2002-08-18 13.0 23.0 NaN 24.0
78 UG Madness David Humpherys Worlds 2003 (Berlin) 3-4 2003-08-10 18.0 20.0 NaN 22.0
89 Goblins Ryou Ogura Worlds 2004 (San Francisco) 3-4 2004-09-05 33.0 4.0 NaN 23.0
111 Boros Nicholas Lovett Worlds 2006 (Paris) 3-4 2006-12-03 23.0 16.0 NaN 21.0
192 Quick' n Toast Jamie Parke Worlds 2008 (Memphis) 2 2008-12-14 11.0 23.0 NaN 26.0
291 UWx Midrange Shahar Shenhar 2013 World Championship 1 2013-08-01 10.0 24.0 NaN 26.0
412 Collected Bant Shota Yasooka 2016 World Championship 3-4 2016-09-02 25.0 9.0 NaN 26.0
437 Temur Energy William Jensen Worlds 2017 1 2017-10-07 23.0 15.0 NaN 22.0
480 Red Deck Wins Benjamin Stark Worlds 2018 (Las Vegas) 3-4 2018-09-23 27.0 11.0 NaN 22.0
514 Izzet Control Ondřej Stráský Magic World Championship XXVII (2021) 4 2021-10-05 NaN 37.0 NaN 23.0
515 Izzet Control Yuta Takahashi Magic World Championship XXVII (2021) 1 2021-10-05 8.0 32.0 NaN 20.0
In [ ]:
data = data.fillna(0)

To conclude, we need to adjust the rankings:

In [ ]:
data.value_counts(subset = ["Rank"])
Out[ ]:
Rank
3-4     43
1       27
2       27
3        5
4        5
Name: count, dtype: int64

We have 43 3-4 rankings. we add them by hands checking on Wikipedia 13/05/2024 (and we also correct two typos in the player names).

In [ ]:
data.loc[(data['Date'].dt.year == 1994) & (data['Player'] == 'Dominic Symens'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 1994) & (data['Player'] == 'Cyrille de Foucaud'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 1995) & (data['Player'] == 'Henry Stern'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 1996) & (data['Player'] == 'Henry Stern'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 1996) & (data['Player'] == 'Olle RÃ¥de'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 1997) & (data['Player'] == 'Paul McCabe'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 1997) & (data['Player'] == 'Svend Sparre Geertsen'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 1998) & (data['Player'] == 'Jon Finkel'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 1998) & (data['Player'] == 'Raphael Levy'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 1999) & (data['Player'] == 'Raffaele Lo Moro'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 1999) & (data['Player'] == 'Matt Linde'),'Rank'] = '4'

data.loc[(data['Date'].dt.year == 2000) & (data['Player'] == 'Dominik Hothow'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2000) & (data['Player'] == 'Benedikt Klauser'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2001) & (data['Player'] == 'Antoine Ruel'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2001) & (data['Player'] == 'Andrea Santin'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2002) & (data['Player'] == 'Diego Ostrovich'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2002) & (data['Player'] == 'Dave Humpherys'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2003) & (data['Player'] == 'Tuomo Nieminen'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2003) & (data['Player'] == 'David Humpherys'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2004) & (data['Player'] == 'Ryou Ogura'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2004) & (data['Player'] == 'Manuel Bevand'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2005) & (data['Player'] == 'Tomohiro Kaji'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2005) & (data['Player'] == 'Akira Asahara'),'Rank'] = '4'

data.loc[(data['Date'].dt.year == 2006) & (data['Player'] == 'Nicholas Lovett'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2006) & (data['Player'] == 'Gabriel Nassif'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2007) & (data['Player'] == 'Gabriel Nassif'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2007) & (data['Player'] == 'Kotaro Otsuka'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2008) & (data['Player'] == 'Tsuyoshi Ikeda'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2008) & (data['Player'] == 'Hannes Kerem'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2009) & (data['Player'] == 'Terry Soh'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2009) & (data['Player'] == 'Bram Snepvangers'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2010) & (data['Player'] == 'Paulo Vitor Damo da Rosa'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2010) & (data['Player'] == 'Love Janse'),'Rank'] = '4'

data.loc[(data['Date'].dt.year == 2011) & (data['Player'] == 'Conley Woods'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2011) & (data['Player'] == 'David Caplan'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2013) & (data['Player'] == 'Ben Stark'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2013) & (data['Player'] == 'Josh Utter-Leyton'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2016) & (data['Player'] == 'Oliver Tiu'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2016) & (data['Player'] == 'Shota Yasooka'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2017) & (data['Player'] == 'Josh Utter-leyton'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2017) & (data['Player'] == 'Kelvin Chew'),'Rank'] = '4'
data.loc[(data['Date'].dt.year == 2018) & (data['Player'] == 'Benjamin Stark'),'Rank'] = '3'
data.loc[(data['Date'].dt.year == 2018) & (data['Player'] == 'Shahar Shenhar'),'Rank'] = '4'

data.loc[(data['Date'].dt.year == 2013) & (data['Player'] == 'Ben Stark'),'Player'] = 'Benjamin Stark'
data.loc[(data['Date'].dt.year == 2017) & (data['Player'] == 'Josh Utter-leyton'),'Player'] = 'Josh Utter-Leyton'

data['Rank'] = data['Rank'].astype('category')
data.value_counts(subset = ["Rank"])
Out[ ]:
Rank
1       27
2       27
3       27
4       26
Name: count, dtype: int64

We can now work with our cleaned data set:

In [ ]:
data.to_csv("data/magic.csv", index=False)

folder = os.getcwd()
cleaned_data = Path(folder, "data", "magic.csv")
data = pd.read_csv(cleaned_data)

data["Date"] = data["Date"].apply(lambda date : datetime.datetime.strptime(date, "%Y-%m-%d"))
data['Rank'] = data['Rank'].astype('category')

Exploratory data analysis¶

In [ ]:
data.head()
Out[ ]:
Deck Player Event Rank Date Creatures Instants_Sorceries Other_spells Lands
0 RG Aggro Bertrand Lestrée Worlds 1994 (Milwaukee) * 2 1994-08-01 12.0 20.0 8.0 20.0
1 Wug Control Zak Dolan Worlds 1994 (Milwaukee) * 1 1994-08-01 11.0 15.0 19.0 15.0
2 Zoo Dominic Symens Worlds 1994 (Milwaukee) * 3 1994-08-01 12.0 19.0 13.0 16.0
3 Zoo Cyrille de Foucaud Worlds 1994 (Milwaukee) * 4 1994-08-01 16.0 20.0 7.0 23.0
4 Rack Control Alexander Blumke Worlds 1995 (Seattle) * 1 1995-08-01 6.0 18.0 14.0 25.0
In [ ]:
data.describe()
Out[ ]:
Date Creatures Instants_Sorceries Other_spells Lands
count 107 107.000000 107.000000 107.000000 107.000000
mean 2008-04-10 00:53:49.906542080 15.355140 15.084112 6.242991 23.485981
min 1994-08-01 00:00:00 0.000000 2.000000 0.000000 15.000000
25% 2001-02-07 12:00:00 9.000000 9.000000 2.000000 22.000000
50% 2007-12-06 00:00:00 16.000000 15.000000 4.000000 24.000000
75% 2015-08-30 00:00:00 22.500000 19.500000 8.000000 25.000000
max 2022-10-28 00:00:00 33.000000 37.000000 28.000000 28.000000
std NaN 8.427552 7.817640 6.032680 2.400332

A first interesting question is to see the evolution of the composition of decks along time.

In [ ]:
data_mean_byYear = data[['Date','Creatures','Instants_Sorceries','Other_spells','Lands']].groupby('Date', observed=False).mean()
figure, axis = plt.subplots(2, 2, figsize=(12,8)) 

axis[0,0].plot(data_mean_byYear["Lands"], 'r.-', label='Lands')
axis[0,0].set(xlabel='Lands',ylabel='Count')
axis[0,1].plot(data_mean_byYear["Creatures"], 'b.-', label='Creatures')
axis[0,1].set(xlabel='Creatures',ylabel='Count')
axis[1,0].plot(data_mean_byYear["Instants_Sorceries"], 'g.-', label='Instants & Sorceries')
axis[1,0].set(xlabel='Instants & Sorceries',ylabel='Count')
axis[1,1].plot(data_mean_byYear["Other_spells"], 'y.-', label='Other spells')
axis[1,1].set(xlabel='Other spells',ylabel='Count')

plt.show()
No description has been provided for this image

Another interesting question is to see the evolution of the composition of the top ranked decks over time.

In [ ]:
Champs = data[data["Rank"] == 1]
Champs = Champs.reset_index().drop(['index','Rank'], axis=1)

figure, axis = plt.subplots(2, 2, figsize=(12,8)) 

axis[0,0].plot(Champs["Date"], Champs["Lands"], 'r.-', label='Lands')
axis[0,0].set(xlabel='Lands',ylabel='Count')
axis[0,1].plot(Champs["Date"], Champs["Creatures"], 'b.-', label='Creatures')
axis[0,1].set(xlabel='Creatures',ylabel='Count')
axis[1,0].plot(Champs["Date"], Champs["Instants_Sorceries"], 'g.-', label='Instants & Sorceries')
axis[1,0].set(xlabel='Instants & Sorceries',ylabel='Count')
axis[1,1].plot(Champs["Date"], Champs["Other_spells"], 'y.-', label='Other spells')
axis[1,1].set(xlabel='Other spells',ylabel='Count')

plt.show()
No description has been provided for this image

The surprising value of 15 Lands in the first winnign deck can be explained by the fact that it was playing Moxes, artifacts (hence, Other spells) which provide one mana of a certain color. We also observe two peaks in the Other spells count in years 2000 and 2020. 2000 was the year of the Tinker deck - mainly based on the Sorcery 'Tinker' and a great amount of artifacts.

Then we use violin plots to visualise the distribution of the four groups.

In [ ]:
figure, axis = plt.subplots(2, 2, figsize=(12,8)) 

sns.violinplot(x='Rank', y='Lands', data=data, fill=False, inner='quart', ax=axis[0,0])
sns.stripplot(x='Rank', y='Lands', data=data, alpha=0.3, color='navy', jitter=0.05, ax=axis[0,0])
axis[0,0].set(xlabel='Rank',ylabel='Lands')
sns.violinplot(x='Rank', y='Creatures', data=data, fill=False, inner='quart', ax=axis[1,0])
sns.stripplot(x='Rank', y='Creatures', data=data, alpha=0.3, color='navy', jitter=0.05, ax=axis[1,0])
axis[1,0].set(xlabel='Rank',ylabel='Creatures')
sns.violinplot(x='Rank', y='Instants_Sorceries', data=data, fill=False, inner='quart', ax=axis[0,1])
sns.stripplot(x='Rank', y='Instants_Sorceries', data=data, alpha=0.3, color='navy', jitter=0.05, ax=axis[0,1])
axis[0,1].set(xlabel='Rank',ylabel='Instants & Sorceries')
sns.violinplot(x='Rank', y='Other_spells', data=data, fill=False, inner='quart', ax=axis[1,1])
sns.stripplot(x='Rank', y='Other_spells', data=data, alpha=0.3, color='navy', jitter=0.05, ax=axis[1,1])
axis[1,1].set(xlabel='Rank',ylabel='Other spells')

plt.show()
No description has been provided for this image

Apparently, there is no difference, on average, in the composition of the decks. A box plot can confirm this claim, by also giving a more reasonable output (we cannot expect a negative number of cards in a deck).

In [ ]:
figure, axis = plt.subplots(2, 2, figsize=(12,8)) 

sns.boxplot(x='Rank', y='Lands', data=data, ax=axis[0,0])
axis[0,0].set(xlabel='Rank',ylabel='Lands')
sns.boxplot(x='Rank', y='Creatures', data=data, ax=axis[1,0])
axis[1,0].set(xlabel='Rank',ylabel='Creatures')
sns.boxplot(x='Rank', y='Instants_Sorceries', data=data, ax=axis[0,1])
axis[0,1].set(xlabel='Rank',ylabel='Instants & Sorceries')
sns.boxplot(x='Rank', y='Other_spells', data=data, ax=axis[1,1])
axis[1,1].set(xlabel='Rank',ylabel='Other spells')

plt.show()
No description has been provided for this image

Statistical tests for equality of the means¶

To begin with, we want to determine whether there is a statistically significant difference between the number of Lands in decks that ranked first, second, third or fourth, on average.

We have two ways to perform this. The first one is to divide the data on the number of Lands into four groups corresponding to the four rankings and pass them to an ANOVA test. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means. In other words, the ANOVA is used to test the difference between two or more means.

However, the analysis of variance can be presented in terms of a linear model, which makes the following assumptions about the probability distribution of the responses:

  • Independence of observations : this is an assumption of the model that simplifies the statistical analysis.
  • Normality : the distributions of the residuals are normal. In our case, the distribution of the data in each group.
  • Equality (or "homogeneity") of variances, called homoscedasticity : the variance of data in groups should be the same.

The second way is to repeatedly use Welch's t-test, which allows to compare means of populations with different variances, but the assumption of normality is maintained.

In each case, we need to divide the data of interest into 4 different groups according to the ranking. Let us perform this.

In [ ]:
Lands_groups = []
for i in range(4) :
    Lands_groups.append(data[data['Rank'] == (i+1)]['Lands'])

Creature_groups = []
for i in range(4) :
    Creature_groups.append(data[data['Rank'] == (i+1)]['Creatures'])

IandS_groups = []
for i in range(4) :
    IandS_groups.append(data[data['Rank'] == (i+1)]['Instants_Sorceries'])

Others_groups = []
for i in range(4) :
    Others_groups.append(data[data['Rank'] == (i+1)]['Other_spells'])

Then let us check the homoscedasticity assumption by printing a table of variances:

In [ ]:
data[['Rank','Creatures','Instants_Sorceries','Other_spells','Lands']].groupby('Rank',observed=False).apply(np.var, (0))
Out[ ]:
Creatures Instants_Sorceries Other_spells Lands
Rank
1 60.222222 55.945130 52.644719 6.320988
2 51.283951 50.320988 39.283951 5.319616
3 86.320988 69.135802 20.463649 6.052126
4 76.019231 64.226331 29.321006 4.940828

Therefore we cannot rely on homoscedasticity, but we may still perform the ANOVA test, for the sake of seeing the outcomes.

In [ ]:
long_data = pd.melt(frame=data,
                    id_vars=['Deck','Player','Event','Rank','Date'],
                    value_vars=['Creatures','Instants_Sorceries','Other_spells','Lands'],
                    var_name='Type',
                    value_name='Count')

for name_group in long_data.groupby('Type',observed=False):
    samples = [condition[1] for condition in name_group[1].groupby('Rank',observed=False)['Count']]
    f_val, p_val = ss.f_oneway(*samples)
    print(f'Type: {name_group[0]},\nF value: {f_val:.3f},\np value: {p_val:.3f}\n')
Type: Creatures,
F value: 0.988,
p value: 0.401

Type: Instants_Sorceries,
F value: 0.389,
p value: 0.761

Type: Lands,
F value: 0.258,
p value: 0.855

Type: Other_spells,
F value: 0.549,
p value: 0.650

Thus, apparently, there is no significant difference. We could apply a t-test to confirm the conclusion, but there is an additional issue to address: in general, data are not normally distributed.

In [ ]:
for name_group in long_data.groupby(by=['Type','Rank'], observed=False):
    samples = [condition for condition in name_group[1]['Count']]
    w_val, p_val = ss.shapiro(samples)
    print(f'Type: {name_group[0][0]},\nRank: {name_group[0][1]}\nW value: {w_val:.3f},\np value: {p_val:.3f}\n')
Type: Creatures,
Rank: 1
W value: 0.916,
p value: 0.031

Type: Creatures,
Rank: 2
W value: 0.960,
p value: 0.372

Type: Creatures,
Rank: 3
W value: 0.956,
p value: 0.295

Type: Creatures,
Rank: 4
W value: 0.911,
p value: 0.028

Type: Instants_Sorceries,
Rank: 1
W value: 0.953,
p value: 0.256

Type: Instants_Sorceries,
Rank: 2
W value: 0.977,
p value: 0.799

Type: Instants_Sorceries,
Rank: 3
W value: 0.948,
p value: 0.196

Type: Instants_Sorceries,
Rank: 4
W value: 0.934,
p value: 0.097

Type: Lands,
Rank: 1
W value: 0.874,
p value: 0.004

Type: Lands,
Rank: 2
W value: 0.908,
p value: 0.020

Type: Lands,
Rank: 3
W value: 0.905,
p value: 0.017

Type: Lands,
Rank: 4
W value: 0.957,
p value: 0.344

Type: Other_spells,
Rank: 1
W value: 0.824,
p value: 0.000

Type: Other_spells,
Rank: 2
W value: 0.860,
p value: 0.002

Type: Other_spells,
Rank: 3
W value: 0.922,
p value: 0.044

Type: Other_spells,
Rank: 4
W value: 0.797,
p value: 0.000

The following frequency plot confirms what we already observed.

In [ ]:
figure, axis = plt.subplots(4,4, figsize=(12,12)) 

for i,group in enumerate([Lands_groups,Creature_groups,IandS_groups,Others_groups]) :
    for j in range(4) :
        axis[i,j].hist(group[j], density=True)
        mu, std = ss.norm.fit(group[j]) 
        xmin, xmax = axis[i,j].get_xlim()
        x = np.linspace(xmin, xmax, 100)
        p = ss.norm.pdf(x, mu, std)
        axis[i,j].plot(x, p, 'k', linewidth=2)
        axis[i,j].get_yaxis().set_ticks([])

axis[0,0].set(ylabel='Lands', title='Rank 1')
axis[0,1].set(title='Rank 2')
axis[0,2].set(title='Rank 3')
axis[0,3].set(title='Rank 4')
axis[1,0].set(ylabel='Creatures')
axis[2,0].set(ylabel='Instants & Sorceries')
axis[3,0].set(ylabel='Others')

plt.show()
No description has been provided for this image

This might be due to a low number of observations or to a trend we cannot identify at present. Nevertheless, we expect our data to be normally distributed (possibly with unequal variances) and thus we apply Welch's t-test as discussed.

In [ ]:
for i,group in enumerate([Creature_groups,IandS_groups,Others_groups,Lands_groups]) :
    for j in range(4) :
        if i != j :
            t_val, p_val = ss.ttest_ind(group[i], group[j], equal_var = False)
            print(f'Type : {data.columns[i+5]}\nRank {i+1} vs Rank {j+1}\nT statistic : {t_val:.3f}\np value : {p_val:.3f}\n')
Type : Creatures
Rank 1 vs Rank 2
T statistic : 0.215
p value : 0.831

Type : Creatures
Rank 1 vs Rank 3
T statistic : -0.608
p value : 0.546

Type : Creatures
Rank 1 vs Rank 4
T statistic : -1.368
p value : 0.177

Type : Instants_Sorceries
Rank 2 vs Rank 1
T statistic : 0.018
p value : 0.985

Type : Instants_Sorceries
Rank 2 vs Rank 3
T statistic : -0.156
p value : 0.877

Type : Instants_Sorceries
Rank 2 vs Rank 4
T statistic : 0.844
p value : 0.403

Type : Other_spells
Rank 3 vs Rank 1
T statistic : -0.861
p value : 0.394

Type : Other_spells
Rank 3 vs Rank 2
T statistic : -1.124
p value : 0.267

Type : Other_spells
Rank 3 vs Rank 4
T statistic : -0.121
p value : 0.904

Type : Lands
Rank 4 vs Rank 1
T statistic : -0.026
p value : 0.980

Type : Lands
Rank 4 vs Rank 2
T statistic : -0.261
p value : 0.795

Type : Lands
Rank 4 vs Rank 3
T statistic : 0.595
p value : 0.555

Conclusions¶

We conclude that there is no statistically significant difference at level alpha = 0.05 in the composition of the decks. The conclusion does not change if we consider the double-faced cards Land/non-Land as Lands. A further analysis may take into account colors or frequency of single cards in a deck.

Appendix¶

It could be interesting to see how the number of Lands, Creatures, Instants & Sorceries, and Other spells evolved in time for all rankings (and not just the top one).

In [ ]:
g = sns.relplot(kind='line', 
                data=data, 
                x='Date', 
                y='Lands', 
                hue='Rank', 
                col='Rank', 
                col_wrap=2, 
                legend = False).set_axis_labels("Year", "Lands")
No description has been provided for this image
In [ ]:
g = sns.relplot(kind='line', 
                data=data, 
                x='Date', 
                y='Creatures', 
                hue='Rank', 
                col='Rank', 
                col_wrap=2, 
                legend = False).set_axis_labels("Year", "Creatures")
No description has been provided for this image
In [ ]:
g = sns.relplot(kind='line', 
                data=data, 
                x='Date', 
                y='Instants_Sorceries', 
                hue='Rank', 
                col='Rank', 
                col_wrap=2, 
                legend = False).set_axis_labels("Year", "Instants & Sorceries")
No description has been provided for this image
In [ ]:
g = sns.relplot(kind='line', 
                data=data, 
                x='Date', 
                y='Other_spells', 
                hue='Rank', 
                col='Rank', 
                col_wrap=2, 
                legend = False).set_axis_labels("Year", "Others")
No description has been provided for this image