NHL Draft Player Analysis Project Overview:¶

Objective:¶

Explore the recent success of the Tampa Bay Lightning in the NHL and investigate their drafting strategies to uncover trends that have contributed to their superior roster.

Background:¶

The Tampa Bay Lightning has emerged as a frontrunner in the NHL over recent years, bagging multiple Stanley Cups and maintaining a consistent record in the regular season. Deducing the variables that have fueled their success, especially in player drafting, can offer pivotal insights for other teams aspiring to elevate their game.

Data Collection & Cleaning:¶

  1. Web Scraping for Data Retrieval: Used Python to scrape the NHL official website and Hockey Reference to gather data on player variables like position, height, weight, birth details, and current team.

  2. Data Storage: Loaded the scraped data into a MySQL database.

  3. Data Manipulation: Conducted data cleaning in MySQL, fixed the date format issues, handled missing values, and extracted useful information.

  4. Integration: Used a left join operation to combine datasets for comprehensive analysis.

  5. Feature Engineering: Derived new columns such as the age of players from their birthdates and created a unique identifier combining the initials and the last name of the players for joining datasets on player names.

  6. Handling Anomalies: Updated some players' data manually in the database to account for discrepancies arising from name changes or usage of middle names.

Key Code Excerpts:¶

Code snippets provided for web scraping from NHL websites, transforming the scraped data, and SQL commands for creating, updating, and joining tables.

Insights Gained:¶

Through this comprehensive data collection and cleaning process, we have a rich dataset that can potentially help deduce:

  • The average age, height, or weight of players drafted by the Tampa Bay Lightning versus other teams.
  • Trends in player selection based on birthplace or nationality.
  • If the drafting strategies have evolved over the years.
  • Any correlation between the team's performance and the attributes of their drafted players.

This analysis can be instrumental for teams to refine their drafting strategies and for analysts and enthusiasts to understand the dynamics behind building a successful NHL team.

In [13]:
# List of URLs to scrape data from
urls = [
    'https://www.nhl.com/ducks/roster/2021',
    'https://www.nhl.com/coyotes/roster/2021',
    'https://www.nhl.com/bruins/roster/2021',
    'https://www.nhl.com/sabres/roster/2021',
    'https://www.nhl.com/flames/roster/2021',
    'https://www.nhl.com/hurricanes/roster/2021',
    'https://www.nhl.com/blackhawks/roster/2021',
    'https://www.nhl.com/avalanche/roster/2021',
    'https://www.nhl.com/bluejackets/roster/2021',
    'https://www.nhl.com/stars/roster/2021',
    'https://www.nhl.com/redwings/roster/2021',
    'https://www.nhl.com/oilers/roster/2021',
    'https://www.nhl.com/panthers/roster/2021',
    'https://www.nhl.com/kings/roster/2021',
    'https://www.nhl.com/wild/roster/2021',
    'https://www.nhl.com/canadiens/roster/2021',
    'https://www.nhl.com/predators/roster/2021',
    'https://www.nhl.com/devils/roster/2021',
    'https://www.nhl.com/islanders/roster/2021',
    'https://www.nhl.com/rangers/roster/2021',
    'https://www.nhl.com/senators/roster/2021',
    'https://www.nhl.com/flyers/roster/2021',
    'https://www.nhl.com/penguins/roster/2021',
    'https://www.nhl.com/sharks/roster/2021',
    'https://www.nhl.com/kraken/roster/2021',
    'https://www.nhl.com/blues/roster/2021',
    'https://www.nhl.com/lightning/roster/2021',
    'https://www.nhl.com/mapleleafs/roster/2021',
    'https://www.nhl.com/canucks/roster/2021',
    'https://www.nhl.com/goldenknights/roster/2021',
    'https://www.nhl.com/capitals/roster/2021',
    'https://www.nhl.com/jets/roster/2021'
]
In [14]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time 


# Initialize an empty list to store all the data
all_data1 = []

# Loop through each URL and scrape the data
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    tables = soup.find_all('table')

    # Assuming you want to extract data from table 2 (Forwards)
    table_forwards1 = tables[2]

    data_list_forwards1 = []
    headers_forwards1 = []

    for row in table_forwards1.find_all('tr'):
        if not headers_forwards1:
            headers_forwards1 = [header.text.strip() for header in row.find_all('th')]
        else:
            row_data1 = [data.text.strip() for data in row.find_all('td')]
            data_list_forwards1.append(row_data1)

    # Assuming you want to extract data from table 5 (Defense)
    table_defense1 = tables[5]

    data_list_defense1 = []
    headers_defense1 = [header.text.strip() for header in table_defense1.find_all('th')]

    for row in table_defense1.find_all('tr'):
        if not headers_defense1:
            table_defense1 = [header.text.strip() for header in row.find_all('th')]
        else:
            row_data1 = [data.text.strip() for data in row.find_all('td')]
            data_list_defense1.append(row_data1)

    # Assuming you want to extract data from table 8 (Goalies)
    table_goalies1 = tables[8]

    data_list_goalies1 = []
    headers_goalies1 = []

    for row in table_goalies1.find_all('tr'):
        if not headers_goalies1:
            headers_goalies1 = [header.text.strip() for header in row.find_all('th')]
        else:
            row_data1 = [data.text.strip() for data in row.find_all('td')]
            data_list_goalies1.append(row_data1)

    # Create DataFrames for each table and add them to the 'all_data' list
    df_forwards1 = pd.DataFrame(data_list_forwards1, columns=headers_forwards1)
    df_forwards1['current_team'] = url.split('/')[-3]
    all_data1.append(df_forwards1)

    df_defense1 = pd.DataFrame(data_list_defense1, columns=headers_defense1)
    df_defense1['current_team'] = url.split('/')[-3]
    all_data1.append(df_defense1)

    df_goalies1 = pd.DataFrame(data_list_goalies1, columns=headers_goalies1)
    df_goalies1['current_team'] = url.split('/')[-3]
    all_data1.append(df_goalies1)
    
    # Add a sleep command to avoid overwhelming the server
    time.sleep(20)  # Sleep for 3 seconds before making the next request

# Concatenate all the DataFrames in the 'all_data' list into a single DataFrame
Variables_df_combined1 = pd.concat(all_data1, ignore_index=True)

# Drop rows with None values
Variables_df_combined1 = Variables_df_combined1.dropna()

# Reset the DataFrame index
Variables_df_combined1 = Variables_df_combined1.reset_index(drop=True)


# Display the updated DataFrame
print(Variables_df_combined1)
       # Pos  Sh         Ht   Wt                  Born  \
0     12   C   L  6'1"6' 1"  205  08/10/94Aug 10, 1994   
1     39   C   R  6'0"6' 0"  200   02/04/92Feb 4, 1992   
2     44  LW   L  6'2"6' 2"  210   01/08/99Jan 8, 1999   
3     44  LW   L  6'1"6' 1"  220  02/22/91Feb 22, 1991   
4     15   C   R  6'4"6' 4"  220  05/10/85May 10, 1985   
...   ..  ..  ..        ...  ...                   ...   
1207  54   D   L  6'4"6' 4"  219  01/24/99Jan 24, 1999   
1208  88   D   L  6'0"6' 0"  195  07/16/91Jul 16, 1991   
1209  64   D   L  6'7"6' 7"  242  05/26/98May 26, 1998   
1210  31   G  --  6'1"6' 1"  183   07/06/95Jul 6, 1995   
1211  37   G  --  6'4"6' 4"  216  05/19/93May 19, 1993   

                  Birthplace current_team  
0     Staten Island, NY, USA        ducks  
1           Markham, ON, CAN        ducks  
2         Longueuil, QC, CAN        ducks  
3           LaSalle, QC, CAN        ducks  
4            Regina, SK, CAN        ducks  
...                      ...          ...  
1207        Saginaw, MN, USA         jets  
1208      St. Cloud, MN, USA         jets  
1209       Waterloo, ON, CAN         jets  
1210       Edmonton, AB, CAN         jets  
1211       Commerce, MI, USA         jets  

[1212 rows x 8 columns]
In [15]:
Variables_df_combined1.describe()
Out[15]:
# Pos Sh Ht Wt Born Birthplace current_team
count 1212 1212 1212 1212 1212 1212 1212 1212
unique 97 5 3 16 88 987 615 32
top 17 D L 6'2"6' 2" 190 08/14/92Aug 14, 1992 Toronto, ON, CAN canadiens
freq 29 372 678 221 71 4 33 46
In [16]:
# Count the distinct current teams
team_count = Variables_df_combined1['current_team'].nunique()

# Display the count of distinct current teams
print("Number of distinct current teams:", team_count)
Number of distinct current teams: 32
In [17]:
Variables_df_combined1['Ht'] = Variables_df_combined1['Ht'].str.extract(r"(\d+' \d+\")")
In [18]:
# Extract the 'MM/DD/YY' part from the 'Born' column using str.extract
Variables_df_combined1['Born'] = Variables_df_combined1['Born'].str.extract(r'(\d{2}/\d{2}/\d{2})')

Variables_df_combined1['Birthplace'] = Variables_df_combined1['Birthplace'].str.replace(',','_').str.replace(' ','')

# Step 2: Separate 'Birthplace' into 'Location' and 'Nationality'
Variables_df_combined1[['Birthplace', 'Nationality']] = Variables_df_combined1['Birthplace'].str.rsplit('_', n=1, expand=True)

Variables_df_combined1
Out[18]:
# Pos Sh Ht Wt Born Birthplace current_team Nationality
0 12 C L 6' 1" 205 08/10/94 StatenIsland_NY ducks USA
1 39 C R 6' 0" 200 02/04/92 Markham_ON ducks CAN
2 44 LW L 6' 2" 210 01/08/99 Longueuil_QC ducks CAN
3 44 LW L 6' 1" 220 02/22/91 LaSalle_QC ducks CAN
4 15 C R 6' 4" 220 05/10/85 Regina_SK ducks CAN
... ... ... ... ... ... ... ... ... ...
1207 54 D L 6' 4" 219 01/24/99 Saginaw_MN jets USA
1208 88 D L 6' 0" 195 07/16/91 St.Cloud_MN jets USA
1209 64 D L 6' 7" 242 05/26/98 Waterloo_ON jets CAN
1210 31 G -- 6' 1" 183 07/06/95 Edmonton_AB jets CAN
1211 37 G -- 6' 4" 216 05/19/93 Commerce_MI jets USA

1212 rows × 9 columns

In [19]:
Variables_df_combined1 = Variables_df_combined1.drop('#', axis=1)
print(Variables_df_combined1)
     Pos  Sh     Ht   Wt      Born       Birthplace current_team Nationality
0      C   L  6' 1"  205  08/10/94  StatenIsland_NY        ducks         USA
1      C   R  6' 0"  200  02/04/92       Markham_ON        ducks         CAN
2     LW   L  6' 2"  210  01/08/99     Longueuil_QC        ducks         CAN
3     LW   L  6' 1"  220  02/22/91       LaSalle_QC        ducks         CAN
4      C   R  6' 4"  220  05/10/85        Regina_SK        ducks         CAN
...   ..  ..    ...  ...       ...              ...          ...         ...
1207   D   L  6' 4"  219  01/24/99       Saginaw_MN         jets         USA
1208   D   L  6' 0"  195  07/16/91      St.Cloud_MN         jets         USA
1209   D   L  6' 7"  242  05/26/98      Waterloo_ON         jets         CAN
1210   G  --  6' 1"  183  07/06/95      Edmonton_AB         jets         CAN
1211   G  --  6' 4"  216  05/19/93      Commerce_MI         jets         USA

[1212 rows x 8 columns]
In [20]:
Variables_df_combined1.to_csv('NHL_Player_Variables_2021.csv',index=False)
In [21]:
def scrape_data_from_url(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    tables = soup.find_all('table')

    # Assuming you want to extract data from table 1 (Forwards)
    table_forwards = tables[1]

    data_list_forwards = []
    headers_forwards = []

    for row in table_forwards.find_all('tr'):
        if not headers_forwards:
            headers_forwards = [header.text.strip() for header in row.find_all('th')]
        else:
            row_data = [data.text.strip() for data in row.find_all('td')]
            data_list_forwards.append(row_data)

    # Assuming you want to extract data from table 4 (Defense)
    table_defense = tables[4]

    data_list_defense = []
    headers_defense = []

    for row in table_defense.find_all('tr'):
        if not headers_defense:
            headers_defense = [header.text.strip() for header in row.find_all('th')]
        else:
            row_data = [data.text.strip() for data in row.find_all('td')]
            data_list_defense.append(row_data)

    # Assuming you want to extract data from table 7 (Goalies)
    table_goalies = tables[7]

    data_list_goalies = []
    headers_goalies = []

    for row in table_goalies.find_all('tr'):
        if not headers_goalies:
            headers_goalies = [header.text.strip() for header in row.find_all('th')]
        else:
            row_data = [data.text.strip() for data in row.find_all('td')]
            data_list_goalies.append(row_data)

    # Create DataFrames for each table
    df_forwards = pd.DataFrame(data_list_forwards, columns=headers_forwards)
    df_defense = pd.DataFrame(data_list_defense, columns=headers_defense)
    df_goalies = pd.DataFrame(data_list_goalies, columns=headers_goalies)

    # Concatenate the DataFrames
    df_combined = pd.concat([df_forwards, df_defense, df_goalies], ignore_index=True)

    # Remove any row with all None values (empty rows)
    df_combined.dropna(how='all', inplace=True)
    
    # Reset the DataFrame index
    df_combined = df_combined.reset_index(drop=True)

    # Clean the 'Player' column
    df_combined['Player'] = df_combined['Player'].str.replace(r'\n', ' ', regex=True).str.strip()
    df_combined['Player'] = df_combined['Player'].apply(lambda x: x.replace('(A)', '').strip())
    df_combined['Player'] = df_combined['Player'].apply(lambda x: x.replace('(C)', '').strip())
    df_combined['Player'] = df_combined['Player'].apply(lambda x: x.replace('*', '').strip())
    df_combined['Player'] = df_combined['Player'].apply(lambda x: ' '.join(x.split()[1:]))

    
    return df_combined


# Initialize an empty list to store all the DataFrames
all_dataframes = []

# Loop through each URL and scrape the data
for url in urls:
    df_combined = scrape_data_from_url(url)
    all_dataframes.append(df_combined)

    # Add a sleep command to avoid overwhelming the server
    time.sleep(3)  # Sleep for 3 seconds before making the next request

# Concatenate all the DataFrames into a single DataFrame
df_combined = pd.concat(all_dataframes, ignore_index=True)

# Display the combined DataFrame
print(df_combined)
                   Player
0        Zach Aston-Reese
1             Sam Carrick
2             Max Comtois
3     Nicolas Deslauriers
4            Ryan Getzlaf
...                   ...
1207        Dylan Samberg
1208         Nate Schmidt
1209        Logan Stanley
1210          Eric Comrie
1211    Connor Hellebuyck

[1212 rows x 1 columns]
In [22]:
df_combined.describe()
Out[22]:
Player
count 1212
unique 1124
top Scott Wedgewood
freq 3
In [23]:
df_combined.to_csv('NHL_2021_Player_Names.csv',index=False)
In [24]:
from datetime import datetime

def get_draft_data(urls):
    all_dataframes = []
    current_year = datetime.now().year
    
    for url in urls:
        # Send an HTTP request to the URL
        response = requests.get(url)

        # Check the status code of the response
        if response.status_code == 200:
            # Parse the HTML content
            soup = BeautifulSoup(response.content, "html.parser")

            # Find the remaining table (assuming there is only one table left) and extract its data into a list
            table = soup.find('table')

            # Check if the table is found before proceeding with data extraction
            if table:
                table_data = []
                for row in table.find_all('tr')[1:]:
                    cols = row.find_all(['th', 'td'])
                    cols = [col.text.strip() for col in cols]
                    table_data.append(cols)

                # Create a DataFrame from the extracted data
                columns = table_data[0]
                data = table_data[1:]
                df = pd.DataFrame(data, columns=columns)

                # Find the index of the rows with the repeated column headers
                repeated_headers_index = df[df['Year'].str.contains('Year')].index

                # Drop the rows with the repeated column headers
                df = df.drop(repeated_headers_index)

                # Reset the index of the DataFrame
                df.reset_index(drop=True, inplace=True)

                # Drop the columns that are unneeded
                columns_to_drop = ['Draft', 'Lg']
                df.drop(columns=columns_to_drop, inplace=True)

                # Add the 'Current_Team' column and set its value to the team name extracted from the URL
                Drafted_Team = url.split("/")[-2]
                df['Drafted_Team'] = Drafted_Team
                
                # Add the 'Year' column as an integer
                df['Year'] = df['Year'].astype(int)
                
                # Filter out rows with 'Year' values 26 or more years ago from the current year (BIG JOE)
                df = df[df['Year'] >= current_year - 26]

                # Append the DataFrame to the list
                all_dataframes.append(df)

                # Sleep for 5 seconds before accessing the next URL
                time.sleep(5)
        else:
            print("Failed to get the webpage:", url)

    # Combine all DataFrames into one
    combined_df = pd.concat(all_dataframes, ignore_index=True)
    return combined_df

# Define the URLs of the webpages containing the tables
urls = [
    "https://www.hockey-reference.com/teams/ANA/draft.html",
    "https://www.hockey-reference.com/teams/PHX/draft.html",
    "https://www.hockey-reference.com/teams/BOS/draft.html",
    "https://www.hockey-reference.com/teams/BUF/draft.html",
    "https://www.hockey-reference.com/teams/CGY/draft.html",
    "https://www.hockey-reference.com/teams/CAR/draft.html",
    "https://www.hockey-reference.com/teams/CHI/draft.html",
    "https://www.hockey-reference.com/teams/COL/draft.html",
    "https://www.hockey-reference.com/teams/CBJ/draft.html",
    "https://www.hockey-reference.com/teams/DAL/draft.html",
    "https://www.hockey-reference.com/teams/DET/draft.html",
    "https://www.hockey-reference.com/teams/EDM/draft.html",
    "https://www.hockey-reference.com/teams/FLA/draft.html",
    "https://www.hockey-reference.com/teams/LAK/draft.html",
    "https://www.hockey-reference.com/teams/MIN/draft.html",
    "https://www.hockey-reference.com/teams/MTL/draft.html",
    "https://www.hockey-reference.com/teams/NSH/draft.html",
    "https://www.hockey-reference.com/teams/NJD/draft.html",
    "https://www.hockey-reference.com/teams/NYI/draft.html",
    "https://www.hockey-reference.com/teams/NYR/draft.html",
    "https://www.hockey-reference.com/teams/OTT/draft.html",
    "https://www.hockey-reference.com/teams/PHI/draft.html",
    "https://www.hockey-reference.com/teams/PIT/draft.html",
    "https://www.hockey-reference.com/teams/SJS/draft.html",
    "https://www.hockey-reference.com/teams/SEA/draft.html",
    "https://www.hockey-reference.com/teams/STL/draft.html",
    "https://www.hockey-reference.com/teams/TBL/draft.html",
    "https://www.hockey-reference.com/teams/TOR/draft.html",
    "https://www.hockey-reference.com/teams/VAN/draft.html",
    "https://www.hockey-reference.com/teams/VEG/draft.html",
    "https://www.hockey-reference.com/teams/WSH/draft.html",
    "https://www.hockey-reference.com/teams/WPG/draft.html"
]

# Get the DataFrame with the extracted and combined data from all URLs
df_combined = get_draft_data(urls)

# Display the combined DataFrame
print(df_combined)
      Year Rd. Overall           Player Pos                    Amateur Team  \
0     2023   1       2     Leo Carlsson   C                 Orebro (Sweden)   
1     2023   2      33    Nico Myatovic  LW      Seattle Thunderbirds (WHL)   
2     2023   2      59   Carey Terrance   C               Erie Otters (OHL)   
3     2023   2      60     Damian Clara   G      Farjestad Jr. (Sweden-Jr.)   
4     2023   3      65    Coulson Pitre  RW           Flint Firebirds (OHL)   
...    ...  ..     ...              ...  ..                             ...   
6316  1999   6     159  Yuri Dobryshkin  RW  Krylia Sovetov Moskva (Russia)   
6317  1999   7     188     Stephen Baby  RW       Green Bay Gamblers (USHL)   
6318  1999   8     217    Garnet Exelby   D          Saskatoon Blades (WHL)   
6319  1999   9     245    Tommi Santala   C               Jokerit (Finland)   
6320  1999   9     246  Raymond DiLauro   D             St. Lawrence (ECAC)   

       GP  G   A PTS  +/-  PIM GP W  L  T/O SV% GAA Drafted_Team  
0                                                            ANA  
1                                                            ANA  
2                                                            ANA  
3                                                            ANA  
4                                                            ANA  
...   ... ..  ..  ..  ...  ... .. .. ..  ..  ..  ..          ...  
6316                                                         WPG  
6317                                                         WPG  
6318  408  7  43  50  -28  584                               WPG  
6319   63  2   7   9   -7   46                               WPG  
6320                                                         WPG  

[6321 rows x 19 columns]
In [25]:
# List of all the drafted teams from the URLs
all_teams = [
    "ANA", "PHX", "BOS", "BUF", "CGY", "CAR", "CHI", "COL", "CBJ", "DAL", "DET", "EDM", "FLA", "LAK",
    "MIN", "MTL", "NSH", "NJD", "NYI", "NYR", "OTT", "PHI", "PIT", "SJS", "SEA", "STL", "TBL", "TOR",
    "VAN", "VEG", "WSH", "WPG"
]

# Get the unique drafted teams from the DataFrame
unique_teams_in_df = df_combined['Drafted_Team'].unique()

# Find the missing drafted teams
missing_teams = [team for team in all_teams if team not in unique_teams_in_df]

# Display the missing teams
print("Missing drafted teams:", missing_teams)
Missing drafted teams: []
In [26]:
minimum_year = df_combined['Year'].min()
print(minimum_year)
1997
In [27]:
# Drop the specified columns from the DataFrame
columns_to_drop = ['GP', 'G', 'A', 'PTS', '+/-', 'PIM', 'W', 'L', 'T/O', 'SV%', 'GAA']
df_combined.drop(columns=columns_to_drop, inplace=True)
In [28]:
# Save the combined DataFrame to a CSV file named "draft_data.csv"
df_combined.to_csv("NHL_draft_2023_data.csv", index=False)
In [29]:
# Filter rows where the player name is "Maxime"
maxime_df = df_combined[df_combined["Player"] == "Joe Thornton"]

# Display the DataFrame containing rows where the player name is "Maxime"
print(maxime_df)
     Year Rd. Overall        Player Pos          Amateur Team Drafted_Team
591  1997   1       1  Joe Thornton   C  Soo Greyhounds (OHL)          BOS

MYSQL CODE


CREATE DATABASE IF NOT EXISTS nhl_2021_data;

USE nhl_2021_data;

CREATE TABLE nhl_variables ( ID INT AUTO_INCREMENT PRIMARY KEY, Pos VARCHAR(3), Sh VARCHAR(5), Ht VARCHAR(40), Wt INT, Born VARCHAR(255), Birthplace VARCHAR(255), current_team VARCHAR(255), Nationality VARCHAR(255) );

LOAD DATA INFILE "C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/NHL_Player_Variables_2021.csv" INTO TABLE nhl_variables FIELDS TERMINATED BY ',' IGNORE 1 LINES (Pos, Sh, Ht, Wt, Born, Birthplace, current_team, Nationality);

-- Add column of age ALTER TABLE nhl_2021_data.nhl_variables ADD COLUMN Age INT;

-- Disable safe update mode to allow UPDATE without WHERE clause SET SQL_SAFE_UPDATES = 0;

-- Update the 'Born' column to the correct date format 'YYYY-MM-DD' UPDATE nhl_variables SET Born = DATE_FORMAT(STR_TO_DATE(Born, '%m/%d/%y'), '%Y-%m-%d'); -- Calculate and update the 'Age' column for each row UPDATE nhl_variables SET Age = FLOOR(DATEDIFF(CURDATE(), Born) / 365);

-- Remove double quotes from the 'Ht' column UPDATE nhl_variables SET Ht = REPLACE(Ht, '"', '');

SELECT * from nhl_variables

CREATE TABLE nhl_players ( ID INT AUTO_INCREMENT PRIMARY KEY, Player VARCHAR(255) );

LOAD DATA INFILE "C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/NHL_2021_Player_Names.csv" INTO TABLE nhl_players FIELDS TERMINATED BY ',' IGNORE 1 LINES (Player);

-- Join both tables so name is now with correct stats of players CREATE TABLE nhl_combined AS SELECT p.ID, p.Player, v.Pos, v.Sh, v.Ht, v.Wt, v.Born, v.Birthplace, v.current_team, v.Nationality, v.age, CONCAT(LEFT(p.Player, 3), ' ', SUBSTRING_INDEX(p.Player, ' ', -1)) AS Player_Name_Initial_Last FROM nhl_players p LEFT JOIN nhl_variables v ON p.ID = v.ID;

CREATE TABLE nhl_draft ( Year INT, Rd INT, Overall INT, Player VARCHAR(255), Pos VARCHAR(10), Amateur_Team VARCHAR(255), Drafted_Team VARCHAR(255) );

LOAD DATA INFILE "C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/NHL_draft_2023_data.csv" INTO TABLE nhl_draft FIELDS TERMINATED BY ',' IGNORE 1 LINES;

-- Add new column for player name ALTER TABLE nhl_draft ADD COLUMN Player_Name_InitialLast VARCHAR(255);

-- Create new value to join on cause names were a bit off SET SQL_SAFE_UPDATES = 0; UPDATE nhl_draft SET Player_Name_InitialLast = CONCAT(LEFT(Player, 3), ' ', SUBSTRING_INDEX(Player, ' ', -1))

SELECT * FROM nhl_draft

-- Join tables on names CREATE TABLE nhl_combined1 AS SELECT c.ID, c.Player, c.Pos, c.Sh, c.Ht, c.Wt, c.Born, c.Birthplace, c.age, c.current_team, c.Nationality, d.Year as draft_year, d.Rd, d.Overall, d.Amateur_Team, d.Drafted_Team FROM nhl_combined c LEFT JOIN nhl_draft d ON c.Player_Name_Initial_Last = d.Player_Name_InitialLast;

-- Update all the mismatch names with correct values (some players went by middle names or changed names ex. Joe went by Joeseph) UPDATE nhl_combined1 SET draft_year = 2021, Rd = 2, Overall = 60, Pos = 'D', Amateur_Team = 'EHC Biel-Bienne (Swiss)', Drafted_Team = 'PHX' WHERE Player = 'J.J. Moser';

UPDATE nhl_combined1 SET draft_year = 2020, Rd = 2, Overall = 34, Pos = 'RW', Amateur_Team = 'EHC Munchen (Germany)', Drafted_Team = 'BUF' WHERE Player = 'JJ Peterka';

UPDATE nhl_combined1 SET draft_year = 2009, Rd = 4, Overall = 95, Pos = 'G', Amateur_Team = 'Montreal Juniors (QMJHL)', Drafted_Team = 'LAK' WHERE Player = 'J-F Berube';

UPDATE nhl_combined1 SET draft_year = 2018, Rd = 1, Overall = 30, Pos = 'C', Amateur_Team = 'Drummondville Voltigeurs (QMJHL)', Drafted_Team = 'DET' WHERE Player = 'Joe Veleno';

UPDATE nhl_combined1 SET draft_year = 2014, Rd = 1, Overall = 19, Pos = 'D', Amateur_Team = 'Sarnia Sting (OHL)', Drafted_Team = 'TBL' WHERE Player = 'Tony DeAngelo';

UPDATE nhl_combined1 SET draft_year = 2017, Rd = 4, Overall = 103, Pos = 'D', Amateur_Team = 'Waterloo Black Hawks (USHL)', Drafted_Team = 'LAK' WHERE Player = 'Mikey Anderson';

UPDATE nhl_combined1 SET draft_year = 2015, Rd = 3, Overall = 85, Pos = 'C', Amateur_Team = 'Waterloo Black Hawks (USHL)', Drafted_Team = 'NSH' WHERE Player = 'Tommy Novak';

UPDATE nhl_combined1 SET draft_year = 1996, Rd = 3, Overall = 56, Pos = 'D', Amateur_Team = 'HK Dukla Trencin Jr. (Slovakia Jr.)', Drafted_Team = 'NYI' WHERE Player = 'Zdeno Chara';

UPDATE nhl_combined1 SET draft_year = 2017, Rd = 6, Overall = 185, Pos = 'C', Amateur_Team = 'Ottawa 67s (OHL)', Drafted_Team = 'SJS' WHERE Player = 'Sasha Chmelevski';

UPDATE nhl_combined1 SET draft_year = 2014, Rd = 7, Overall = 210, Pos = 'D', Amateur_Team = 'Ottawa 67s (OHL)', Drafted_Team = 'LAK' WHERE Player = 'Jake Middleton';

UPDATE nhl_combined1 SET draft_year = 2013, Rd = 7, Overall = 202, Pos = 'LW', Amateur_Team = 'Frolunda HC Jr. (Sweden Jr.)', Drafted_Team = 'TOR' WHERE Player = 'Andreas Johnsson';

UPDATE nhl_combined1 SET draft_year = 2019, Rd = 6, Overall = 61, Pos = 'D', Amateur_Team = 'Ottawa 67s (OHL)', Drafted_Team = 'NJD' WHERE Player = 'Nikita Okhotiuk';

UPDATE nhl_combined1 SET draft_year = 2012, Rd = 1, Overall = 23, Pos = 'D', Amateur_Team = 'Dubuque Fighting Saints (USHL)', Drafted_Team = 'FLA' WHERE Player = 'Mike Matheson';

UPDATE nhl_combined1 SET draft_year = 2016, Rd = 3, Overall = 73, Pos = 'RW', Amateur_Team = 'USA U-18 Development Team (USDP/USHL)', Drafted_Team = 'NJD' WHERE Player = 'Joey Anderson';

UPDATE nhl_combined1 SET draft_year = 2008, Rd = 4, Overall = 114, Pos = 'D', Amateur_Team = 'Saginaw Spirit (OHL)', Drafted_Team = 'CGY' WHERE Player = 'TJ Brodie';

UPDATE nhl_combined1 SET draft_year = 2014, Rd = 5, Overall = 129, Pos = 'LW', Amateur_Team = 'Sioux Falls Stampede (USHL)', Drafted_Team = 'WPG' WHERE Player = 'CJ Suess';

-- Fill the undrafted players with correct values UPDATE nhl_combined1 SET Drafted_Team = 'Undrafted' WHERE Drafted_Team IS NULL;

-- Identify null values SELECT * from nhl_combined1 where Drafted_Team is NULL;

-- Delete Duplicates DELETE nhl_combined1 FROM nhl_combined1 JOIN ( SELECT Player, MIN(ID) AS min_id FROM nhl_combined1 GROUP BY Player HAVING COUNT(*) > 1 ) duplicates ON nhl_combined1.Player = duplicates.Player WHERE nhl_combined1.ID > duplicates.min_id;

-- Delete the incorrect entries caused by multiple entries with same name from draft_table. I manually checked for data validation DELETE FROM nhl_combined1 WHERE (Drafted_Team = 'CHI' AND Player = 'Craig Anderson') OR (Drafted_Team = 'NYI' AND Player = 'Sebastian Aho') OR (Drafted_Team = 'ANA' AND Player = 'Frederik Andersen') OR (Drafted_Team = 'COL' AND Player = 'Jonas Johansson') OR (Drafted_Team = 'ANA' AND Player = 'Brendan Perlini') OR (Drafted_Team = 'EDM' AND Player = 'Matt Roy') OR (Drafted_Team = 'COL' AND Player = 'Josh Anderson') OR (Drafted_Team = 'NJD' AND Player = 'Josh Anderson') OR (Drafted_Team = 'CAR' AND Player = 'Erik Karlsson') OR (Drafted_Team = 'DET' AND Player = "Ryan O'Reilly") OR (Drafted_Team = 'VAN' AND Player = "Nick Jensen") OR (Drafted_Team = 'LAK' AND Player = "Jack Hughes") OR (Drafted_Team = 'OTT' AND Player = "Colton White") OR (Drafted_Team = 'NJD' AND Player = "Colin White") OR (draft_year = '2022' AND Player = "Elias Pettersson") OR (Drafted_Team = 'VAN' AND Player = "Nathan Smith") OR (Drafted_Team = 'LAK' AND Player = "Anders Lee") OR (Drafted_Team = 'OTT' AND Player = "Vince Dunn");

Select * from nhl_combined1 where player = "Vince Dunn";

-- Count the occurrences of players with duplicate entries. select player, Count(ID) from nhl_combined1 group by player having count(*) > 1;

-- Count the number of players drafted by each team and display the results in descending order of player count. SELECT Drafted_Team, COUNT(*) AS Player_Count FROM nhl_combined1 GROUP BY Drafted_Team Order By Player_Count DESC;

image.png

-- Count the number of players from each current team who were not drafted and display the results in descending order of player count. SELECT current_team, Drafted_Team, COUNT(*) AS Player_Count FROM nhl_combined1 WHERE Drafted_Team = 'Undrafted' GROUP BY current_team ORDER BY Player_Count DESC;

image.png

-- Calculate average draft round, average age, average height (in inches), and average weight for each current team and display the results in descending order of average draft round. SELECT current_team AS Team, AVG(Rd) AS Avg_Round, AVG(Age) AS Avg_Age, AVG(12 * CAST(SUBSTRING_INDEX(Ht, ''', 1) AS UNSIGNED) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(Ht, ''', -1), '"', 1) AS UNSIGNED)) AS Avg_Height_Inches, AVG(Wt) AS Avg_Weight FROM nhl_combined1 GROUP BY current_team ORDER BY Avg_Round DESC;

image.png image.png image.png

Conclusion:¶

Colorado utilized the undrafted market for role players while relying on low-round draft picks (high talent) players to be playmakers to win the Stanley Cup. Also, these tables show that successful teams rely heavily on drafting well to develop NHL talent, as shown with WSH, TBL, and STL (recent Cup winners). These teams do not have as many draft picks as other teams, but their scouts have produced NHL roster draft talent.

Looking at the season standings, teams like Florida Panthers, New York Rangers, Carolina Hurricanes, Toronto Maple Leafs, and Minnesota Wild have higher average rounds, but they have found success in finding value in these rounds. However, these teams have had difficulty making it to the playoff finals.

The statistics show a relatively competitive balance across teams, with a relatively small difference in the average round, age, height, and weight.

-- Count the number of players from each amateur team and display the results in descending order of player count. SELECT Amateur_Team, COUNT(*) as Player_Count_ATeam FROM nhl_combined1 GROUP BY Amateur_Team ORDER BY Player_Count_ATeam DESC;

image.png

For a young hockey talent, these are my recommendations:¶

  • From America, consider joining the USA U-18 Development Team (USDP/USHL) for enhanced exposure and top-tier development opportunities.
  • From Canada, explore options with top teams like London Knights, Soo Greyhounds, and Guelph Storm, known for producing NHL talent.
  • From Sweden, consider joining Frolunda HC Jr. for further development and growth in your hockey career.

-- Count the number of players for each nationality and display the results in descending order of player count. SELECT Nationality, COUNT(*) as Player_Count_Nat FROM nhl_combined1 GROUP BY Nationality ORDER BY Player_Count_Nat DESC;

image.png

In [ ]: