Explore the recent success of the Tampa Bay Lightning in the NHL and investigate their drafting strategies to uncover trends that have contributed to their superior roster.
The Tampa Bay Lightning has emerged as a frontrunner in the NHL over recent years, bagging multiple Stanley Cups and maintaining a consistent record in the regular season. Deducing the variables that have fueled their success, especially in player drafting, can offer pivotal insights for other teams aspiring to elevate their game.
Web Scraping for Data Retrieval: Used Python to scrape the NHL official website and Hockey Reference to gather data on player variables like position, height, weight, birth details, and current team.
Data Storage: Loaded the scraped data into a MySQL database.
Data Manipulation: Conducted data cleaning in MySQL, fixed the date format issues, handled missing values, and extracted useful information.
Integration: Used a left join operation to combine datasets for comprehensive analysis.
Feature Engineering: Derived new columns such as the age of players from their birthdates and created a unique identifier combining the initials and the last name of the players for joining datasets on player names.
Handling Anomalies: Updated some players' data manually in the database to account for discrepancies arising from name changes or usage of middle names.
Code snippets provided for web scraping from NHL websites, transforming the scraped data, and SQL commands for creating, updating, and joining tables.
Through this comprehensive data collection and cleaning process, we have a rich dataset that can potentially help deduce:
This analysis can be instrumental for teams to refine their drafting strategies and for analysts and enthusiasts to understand the dynamics behind building a successful NHL team.
# List of URLs to scrape data from
urls = [
'https://www.nhl.com/ducks/roster/2021',
'https://www.nhl.com/coyotes/roster/2021',
'https://www.nhl.com/bruins/roster/2021',
'https://www.nhl.com/sabres/roster/2021',
'https://www.nhl.com/flames/roster/2021',
'https://www.nhl.com/hurricanes/roster/2021',
'https://www.nhl.com/blackhawks/roster/2021',
'https://www.nhl.com/avalanche/roster/2021',
'https://www.nhl.com/bluejackets/roster/2021',
'https://www.nhl.com/stars/roster/2021',
'https://www.nhl.com/redwings/roster/2021',
'https://www.nhl.com/oilers/roster/2021',
'https://www.nhl.com/panthers/roster/2021',
'https://www.nhl.com/kings/roster/2021',
'https://www.nhl.com/wild/roster/2021',
'https://www.nhl.com/canadiens/roster/2021',
'https://www.nhl.com/predators/roster/2021',
'https://www.nhl.com/devils/roster/2021',
'https://www.nhl.com/islanders/roster/2021',
'https://www.nhl.com/rangers/roster/2021',
'https://www.nhl.com/senators/roster/2021',
'https://www.nhl.com/flyers/roster/2021',
'https://www.nhl.com/penguins/roster/2021',
'https://www.nhl.com/sharks/roster/2021',
'https://www.nhl.com/kraken/roster/2021',
'https://www.nhl.com/blues/roster/2021',
'https://www.nhl.com/lightning/roster/2021',
'https://www.nhl.com/mapleleafs/roster/2021',
'https://www.nhl.com/canucks/roster/2021',
'https://www.nhl.com/goldenknights/roster/2021',
'https://www.nhl.com/capitals/roster/2021',
'https://www.nhl.com/jets/roster/2021'
]
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
# Initialize an empty list to store all the data
all_data1 = []
# Loop through each URL and scrape the data
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all('table')
# Assuming you want to extract data from table 2 (Forwards)
table_forwards1 = tables[2]
data_list_forwards1 = []
headers_forwards1 = []
for row in table_forwards1.find_all('tr'):
if not headers_forwards1:
headers_forwards1 = [header.text.strip() for header in row.find_all('th')]
else:
row_data1 = [data.text.strip() for data in row.find_all('td')]
data_list_forwards1.append(row_data1)
# Assuming you want to extract data from table 5 (Defense)
table_defense1 = tables[5]
data_list_defense1 = []
headers_defense1 = [header.text.strip() for header in table_defense1.find_all('th')]
for row in table_defense1.find_all('tr'):
if not headers_defense1:
table_defense1 = [header.text.strip() for header in row.find_all('th')]
else:
row_data1 = [data.text.strip() for data in row.find_all('td')]
data_list_defense1.append(row_data1)
# Assuming you want to extract data from table 8 (Goalies)
table_goalies1 = tables[8]
data_list_goalies1 = []
headers_goalies1 = []
for row in table_goalies1.find_all('tr'):
if not headers_goalies1:
headers_goalies1 = [header.text.strip() for header in row.find_all('th')]
else:
row_data1 = [data.text.strip() for data in row.find_all('td')]
data_list_goalies1.append(row_data1)
# Create DataFrames for each table and add them to the 'all_data' list
df_forwards1 = pd.DataFrame(data_list_forwards1, columns=headers_forwards1)
df_forwards1['current_team'] = url.split('/')[-3]
all_data1.append(df_forwards1)
df_defense1 = pd.DataFrame(data_list_defense1, columns=headers_defense1)
df_defense1['current_team'] = url.split('/')[-3]
all_data1.append(df_defense1)
df_goalies1 = pd.DataFrame(data_list_goalies1, columns=headers_goalies1)
df_goalies1['current_team'] = url.split('/')[-3]
all_data1.append(df_goalies1)
# Add a sleep command to avoid overwhelming the server
time.sleep(20) # Sleep for 3 seconds before making the next request
# Concatenate all the DataFrames in the 'all_data' list into a single DataFrame
Variables_df_combined1 = pd.concat(all_data1, ignore_index=True)
# Drop rows with None values
Variables_df_combined1 = Variables_df_combined1.dropna()
# Reset the DataFrame index
Variables_df_combined1 = Variables_df_combined1.reset_index(drop=True)
# Display the updated DataFrame
print(Variables_df_combined1)
# Pos Sh Ht Wt Born \ 0 12 C L 6'1"6' 1" 205 08/10/94Aug 10, 1994 1 39 C R 6'0"6' 0" 200 02/04/92Feb 4, 1992 2 44 LW L 6'2"6' 2" 210 01/08/99Jan 8, 1999 3 44 LW L 6'1"6' 1" 220 02/22/91Feb 22, 1991 4 15 C R 6'4"6' 4" 220 05/10/85May 10, 1985 ... .. .. .. ... ... ... 1207 54 D L 6'4"6' 4" 219 01/24/99Jan 24, 1999 1208 88 D L 6'0"6' 0" 195 07/16/91Jul 16, 1991 1209 64 D L 6'7"6' 7" 242 05/26/98May 26, 1998 1210 31 G -- 6'1"6' 1" 183 07/06/95Jul 6, 1995 1211 37 G -- 6'4"6' 4" 216 05/19/93May 19, 1993 Birthplace current_team 0 Staten Island, NY, USA ducks 1 Markham, ON, CAN ducks 2 Longueuil, QC, CAN ducks 3 LaSalle, QC, CAN ducks 4 Regina, SK, CAN ducks ... ... ... 1207 Saginaw, MN, USA jets 1208 St. Cloud, MN, USA jets 1209 Waterloo, ON, CAN jets 1210 Edmonton, AB, CAN jets 1211 Commerce, MI, USA jets [1212 rows x 8 columns]
Variables_df_combined1.describe()
# | Pos | Sh | Ht | Wt | Born | Birthplace | current_team | |
---|---|---|---|---|---|---|---|---|
count | 1212 | 1212 | 1212 | 1212 | 1212 | 1212 | 1212 | 1212 |
unique | 97 | 5 | 3 | 16 | 88 | 987 | 615 | 32 |
top | 17 | D | L | 6'2"6' 2" | 190 | 08/14/92Aug 14, 1992 | Toronto, ON, CAN | canadiens |
freq | 29 | 372 | 678 | 221 | 71 | 4 | 33 | 46 |
# Count the distinct current teams
team_count = Variables_df_combined1['current_team'].nunique()
# Display the count of distinct current teams
print("Number of distinct current teams:", team_count)
Number of distinct current teams: 32
Variables_df_combined1['Ht'] = Variables_df_combined1['Ht'].str.extract(r"(\d+' \d+\")")
# Extract the 'MM/DD/YY' part from the 'Born' column using str.extract
Variables_df_combined1['Born'] = Variables_df_combined1['Born'].str.extract(r'(\d{2}/\d{2}/\d{2})')
Variables_df_combined1['Birthplace'] = Variables_df_combined1['Birthplace'].str.replace(',','_').str.replace(' ','')
# Step 2: Separate 'Birthplace' into 'Location' and 'Nationality'
Variables_df_combined1[['Birthplace', 'Nationality']] = Variables_df_combined1['Birthplace'].str.rsplit('_', n=1, expand=True)
Variables_df_combined1
# | Pos | Sh | Ht | Wt | Born | Birthplace | current_team | Nationality | |
---|---|---|---|---|---|---|---|---|---|
0 | 12 | C | L | 6' 1" | 205 | 08/10/94 | StatenIsland_NY | ducks | USA |
1 | 39 | C | R | 6' 0" | 200 | 02/04/92 | Markham_ON | ducks | CAN |
2 | 44 | LW | L | 6' 2" | 210 | 01/08/99 | Longueuil_QC | ducks | CAN |
3 | 44 | LW | L | 6' 1" | 220 | 02/22/91 | LaSalle_QC | ducks | CAN |
4 | 15 | C | R | 6' 4" | 220 | 05/10/85 | Regina_SK | ducks | CAN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1207 | 54 | D | L | 6' 4" | 219 | 01/24/99 | Saginaw_MN | jets | USA |
1208 | 88 | D | L | 6' 0" | 195 | 07/16/91 | St.Cloud_MN | jets | USA |
1209 | 64 | D | L | 6' 7" | 242 | 05/26/98 | Waterloo_ON | jets | CAN |
1210 | 31 | G | -- | 6' 1" | 183 | 07/06/95 | Edmonton_AB | jets | CAN |
1211 | 37 | G | -- | 6' 4" | 216 | 05/19/93 | Commerce_MI | jets | USA |
1212 rows × 9 columns
Variables_df_combined1 = Variables_df_combined1.drop('#', axis=1)
print(Variables_df_combined1)
Pos Sh Ht Wt Born Birthplace current_team Nationality 0 C L 6' 1" 205 08/10/94 StatenIsland_NY ducks USA 1 C R 6' 0" 200 02/04/92 Markham_ON ducks CAN 2 LW L 6' 2" 210 01/08/99 Longueuil_QC ducks CAN 3 LW L 6' 1" 220 02/22/91 LaSalle_QC ducks CAN 4 C R 6' 4" 220 05/10/85 Regina_SK ducks CAN ... .. .. ... ... ... ... ... ... 1207 D L 6' 4" 219 01/24/99 Saginaw_MN jets USA 1208 D L 6' 0" 195 07/16/91 St.Cloud_MN jets USA 1209 D L 6' 7" 242 05/26/98 Waterloo_ON jets CAN 1210 G -- 6' 1" 183 07/06/95 Edmonton_AB jets CAN 1211 G -- 6' 4" 216 05/19/93 Commerce_MI jets USA [1212 rows x 8 columns]
Variables_df_combined1.to_csv('NHL_Player_Variables_2021.csv',index=False)
def scrape_data_from_url(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
tables = soup.find_all('table')
# Assuming you want to extract data from table 1 (Forwards)
table_forwards = tables[1]
data_list_forwards = []
headers_forwards = []
for row in table_forwards.find_all('tr'):
if not headers_forwards:
headers_forwards = [header.text.strip() for header in row.find_all('th')]
else:
row_data = [data.text.strip() for data in row.find_all('td')]
data_list_forwards.append(row_data)
# Assuming you want to extract data from table 4 (Defense)
table_defense = tables[4]
data_list_defense = []
headers_defense = []
for row in table_defense.find_all('tr'):
if not headers_defense:
headers_defense = [header.text.strip() for header in row.find_all('th')]
else:
row_data = [data.text.strip() for data in row.find_all('td')]
data_list_defense.append(row_data)
# Assuming you want to extract data from table 7 (Goalies)
table_goalies = tables[7]
data_list_goalies = []
headers_goalies = []
for row in table_goalies.find_all('tr'):
if not headers_goalies:
headers_goalies = [header.text.strip() for header in row.find_all('th')]
else:
row_data = [data.text.strip() for data in row.find_all('td')]
data_list_goalies.append(row_data)
# Create DataFrames for each table
df_forwards = pd.DataFrame(data_list_forwards, columns=headers_forwards)
df_defense = pd.DataFrame(data_list_defense, columns=headers_defense)
df_goalies = pd.DataFrame(data_list_goalies, columns=headers_goalies)
# Concatenate the DataFrames
df_combined = pd.concat([df_forwards, df_defense, df_goalies], ignore_index=True)
# Remove any row with all None values (empty rows)
df_combined.dropna(how='all', inplace=True)
# Reset the DataFrame index
df_combined = df_combined.reset_index(drop=True)
# Clean the 'Player' column
df_combined['Player'] = df_combined['Player'].str.replace(r'\n', ' ', regex=True).str.strip()
df_combined['Player'] = df_combined['Player'].apply(lambda x: x.replace('(A)', '').strip())
df_combined['Player'] = df_combined['Player'].apply(lambda x: x.replace('(C)', '').strip())
df_combined['Player'] = df_combined['Player'].apply(lambda x: x.replace('*', '').strip())
df_combined['Player'] = df_combined['Player'].apply(lambda x: ' '.join(x.split()[1:]))
return df_combined
# Initialize an empty list to store all the DataFrames
all_dataframes = []
# Loop through each URL and scrape the data
for url in urls:
df_combined = scrape_data_from_url(url)
all_dataframes.append(df_combined)
# Add a sleep command to avoid overwhelming the server
time.sleep(3) # Sleep for 3 seconds before making the next request
# Concatenate all the DataFrames into a single DataFrame
df_combined = pd.concat(all_dataframes, ignore_index=True)
# Display the combined DataFrame
print(df_combined)
Player 0 Zach Aston-Reese 1 Sam Carrick 2 Max Comtois 3 Nicolas Deslauriers 4 Ryan Getzlaf ... ... 1207 Dylan Samberg 1208 Nate Schmidt 1209 Logan Stanley 1210 Eric Comrie 1211 Connor Hellebuyck [1212 rows x 1 columns]
df_combined.describe()
Player | |
---|---|
count | 1212 |
unique | 1124 |
top | Scott Wedgewood |
freq | 3 |
df_combined.to_csv('NHL_2021_Player_Names.csv',index=False)
from datetime import datetime
def get_draft_data(urls):
all_dataframes = []
current_year = datetime.now().year
for url in urls:
# Send an HTTP request to the URL
response = requests.get(url)
# Check the status code of the response
if response.status_code == 200:
# Parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")
# Find the remaining table (assuming there is only one table left) and extract its data into a list
table = soup.find('table')
# Check if the table is found before proceeding with data extraction
if table:
table_data = []
for row in table.find_all('tr')[1:]:
cols = row.find_all(['th', 'td'])
cols = [col.text.strip() for col in cols]
table_data.append(cols)
# Create a DataFrame from the extracted data
columns = table_data[0]
data = table_data[1:]
df = pd.DataFrame(data, columns=columns)
# Find the index of the rows with the repeated column headers
repeated_headers_index = df[df['Year'].str.contains('Year')].index
# Drop the rows with the repeated column headers
df = df.drop(repeated_headers_index)
# Reset the index of the DataFrame
df.reset_index(drop=True, inplace=True)
# Drop the columns that are unneeded
columns_to_drop = ['Draft', 'Lg']
df.drop(columns=columns_to_drop, inplace=True)
# Add the 'Current_Team' column and set its value to the team name extracted from the URL
Drafted_Team = url.split("/")[-2]
df['Drafted_Team'] = Drafted_Team
# Add the 'Year' column as an integer
df['Year'] = df['Year'].astype(int)
# Filter out rows with 'Year' values 26 or more years ago from the current year (BIG JOE)
df = df[df['Year'] >= current_year - 26]
# Append the DataFrame to the list
all_dataframes.append(df)
# Sleep for 5 seconds before accessing the next URL
time.sleep(5)
else:
print("Failed to get the webpage:", url)
# Combine all DataFrames into one
combined_df = pd.concat(all_dataframes, ignore_index=True)
return combined_df
# Define the URLs of the webpages containing the tables
urls = [
"https://www.hockey-reference.com/teams/ANA/draft.html",
"https://www.hockey-reference.com/teams/PHX/draft.html",
"https://www.hockey-reference.com/teams/BOS/draft.html",
"https://www.hockey-reference.com/teams/BUF/draft.html",
"https://www.hockey-reference.com/teams/CGY/draft.html",
"https://www.hockey-reference.com/teams/CAR/draft.html",
"https://www.hockey-reference.com/teams/CHI/draft.html",
"https://www.hockey-reference.com/teams/COL/draft.html",
"https://www.hockey-reference.com/teams/CBJ/draft.html",
"https://www.hockey-reference.com/teams/DAL/draft.html",
"https://www.hockey-reference.com/teams/DET/draft.html",
"https://www.hockey-reference.com/teams/EDM/draft.html",
"https://www.hockey-reference.com/teams/FLA/draft.html",
"https://www.hockey-reference.com/teams/LAK/draft.html",
"https://www.hockey-reference.com/teams/MIN/draft.html",
"https://www.hockey-reference.com/teams/MTL/draft.html",
"https://www.hockey-reference.com/teams/NSH/draft.html",
"https://www.hockey-reference.com/teams/NJD/draft.html",
"https://www.hockey-reference.com/teams/NYI/draft.html",
"https://www.hockey-reference.com/teams/NYR/draft.html",
"https://www.hockey-reference.com/teams/OTT/draft.html",
"https://www.hockey-reference.com/teams/PHI/draft.html",
"https://www.hockey-reference.com/teams/PIT/draft.html",
"https://www.hockey-reference.com/teams/SJS/draft.html",
"https://www.hockey-reference.com/teams/SEA/draft.html",
"https://www.hockey-reference.com/teams/STL/draft.html",
"https://www.hockey-reference.com/teams/TBL/draft.html",
"https://www.hockey-reference.com/teams/TOR/draft.html",
"https://www.hockey-reference.com/teams/VAN/draft.html",
"https://www.hockey-reference.com/teams/VEG/draft.html",
"https://www.hockey-reference.com/teams/WSH/draft.html",
"https://www.hockey-reference.com/teams/WPG/draft.html"
]
# Get the DataFrame with the extracted and combined data from all URLs
df_combined = get_draft_data(urls)
# Display the combined DataFrame
print(df_combined)
Year Rd. Overall Player Pos Amateur Team \ 0 2023 1 2 Leo Carlsson C Orebro (Sweden) 1 2023 2 33 Nico Myatovic LW Seattle Thunderbirds (WHL) 2 2023 2 59 Carey Terrance C Erie Otters (OHL) 3 2023 2 60 Damian Clara G Farjestad Jr. (Sweden-Jr.) 4 2023 3 65 Coulson Pitre RW Flint Firebirds (OHL) ... ... .. ... ... .. ... 6316 1999 6 159 Yuri Dobryshkin RW Krylia Sovetov Moskva (Russia) 6317 1999 7 188 Stephen Baby RW Green Bay Gamblers (USHL) 6318 1999 8 217 Garnet Exelby D Saskatoon Blades (WHL) 6319 1999 9 245 Tommi Santala C Jokerit (Finland) 6320 1999 9 246 Raymond DiLauro D St. Lawrence (ECAC) GP G A PTS +/- PIM GP W L T/O SV% GAA Drafted_Team 0 ANA 1 ANA 2 ANA 3 ANA 4 ANA ... ... .. .. .. ... ... .. .. .. .. .. .. ... 6316 WPG 6317 WPG 6318 408 7 43 50 -28 584 WPG 6319 63 2 7 9 -7 46 WPG 6320 WPG [6321 rows x 19 columns]
# List of all the drafted teams from the URLs
all_teams = [
"ANA", "PHX", "BOS", "BUF", "CGY", "CAR", "CHI", "COL", "CBJ", "DAL", "DET", "EDM", "FLA", "LAK",
"MIN", "MTL", "NSH", "NJD", "NYI", "NYR", "OTT", "PHI", "PIT", "SJS", "SEA", "STL", "TBL", "TOR",
"VAN", "VEG", "WSH", "WPG"
]
# Get the unique drafted teams from the DataFrame
unique_teams_in_df = df_combined['Drafted_Team'].unique()
# Find the missing drafted teams
missing_teams = [team for team in all_teams if team not in unique_teams_in_df]
# Display the missing teams
print("Missing drafted teams:", missing_teams)
Missing drafted teams: []
minimum_year = df_combined['Year'].min()
print(minimum_year)
1997
# Drop the specified columns from the DataFrame
columns_to_drop = ['GP', 'G', 'A', 'PTS', '+/-', 'PIM', 'W', 'L', 'T/O', 'SV%', 'GAA']
df_combined.drop(columns=columns_to_drop, inplace=True)
# Save the combined DataFrame to a CSV file named "draft_data.csv"
df_combined.to_csv("NHL_draft_2023_data.csv", index=False)
# Filter rows where the player name is "Maxime"
maxime_df = df_combined[df_combined["Player"] == "Joe Thornton"]
# Display the DataFrame containing rows where the player name is "Maxime"
print(maxime_df)
Year Rd. Overall Player Pos Amateur Team Drafted_Team 591 1997 1 1 Joe Thornton C Soo Greyhounds (OHL) BOS
CREATE DATABASE IF NOT EXISTS nhl_2021_data;
USE nhl_2021_data;
CREATE TABLE nhl_variables ( ID INT AUTO_INCREMENT PRIMARY KEY, Pos VARCHAR(3), Sh VARCHAR(5), Ht VARCHAR(40), Wt INT, Born VARCHAR(255), Birthplace VARCHAR(255), current_team VARCHAR(255), Nationality VARCHAR(255) );
LOAD DATA INFILE "C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/NHL_Player_Variables_2021.csv" INTO TABLE nhl_variables FIELDS TERMINATED BY ',' IGNORE 1 LINES (Pos, Sh, Ht, Wt, Born, Birthplace, current_team, Nationality);
-- Add column of age ALTER TABLE nhl_2021_data.nhl_variables ADD COLUMN Age INT;
-- Disable safe update mode to allow UPDATE without WHERE clause SET SQL_SAFE_UPDATES = 0;
-- Update the 'Born' column to the correct date format 'YYYY-MM-DD' UPDATE nhl_variables SET Born = DATE_FORMAT(STR_TO_DATE(Born, '%m/%d/%y'), '%Y-%m-%d'); -- Calculate and update the 'Age' column for each row UPDATE nhl_variables SET Age = FLOOR(DATEDIFF(CURDATE(), Born) / 365);
-- Remove double quotes from the 'Ht' column UPDATE nhl_variables SET Ht = REPLACE(Ht, '"', '');
SELECT * from nhl_variables
CREATE TABLE nhl_players ( ID INT AUTO_INCREMENT PRIMARY KEY, Player VARCHAR(255) );
LOAD DATA INFILE "C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/NHL_2021_Player_Names.csv" INTO TABLE nhl_players FIELDS TERMINATED BY ',' IGNORE 1 LINES (Player);
-- Join both tables so name is now with correct stats of players CREATE TABLE nhl_combined AS SELECT p.ID, p.Player, v.Pos, v.Sh, v.Ht, v.Wt, v.Born, v.Birthplace, v.current_team, v.Nationality, v.age, CONCAT(LEFT(p.Player, 3), ' ', SUBSTRING_INDEX(p.Player, ' ', -1)) AS Player_Name_Initial_Last FROM nhl_players p LEFT JOIN nhl_variables v ON p.ID = v.ID;
CREATE TABLE nhl_draft ( Year INT, Rd INT, Overall INT, Player VARCHAR(255), Pos VARCHAR(10), Amateur_Team VARCHAR(255), Drafted_Team VARCHAR(255) );
LOAD DATA INFILE "C:/ProgramData/MySQL/MySQL Server 8.0/Uploads/NHL_draft_2023_data.csv" INTO TABLE nhl_draft FIELDS TERMINATED BY ',' IGNORE 1 LINES;
-- Add new column for player name ALTER TABLE nhl_draft ADD COLUMN Player_Name_InitialLast VARCHAR(255);
-- Create new value to join on cause names were a bit off SET SQL_SAFE_UPDATES = 0; UPDATE nhl_draft SET Player_Name_InitialLast = CONCAT(LEFT(Player, 3), ' ', SUBSTRING_INDEX(Player, ' ', -1))
SELECT * FROM nhl_draft
-- Join tables on names CREATE TABLE nhl_combined1 AS SELECT c.ID, c.Player, c.Pos, c.Sh, c.Ht, c.Wt, c.Born, c.Birthplace, c.age, c.current_team, c.Nationality, d.Year as draft_year, d.Rd, d.Overall, d.Amateur_Team, d.Drafted_Team FROM nhl_combined c LEFT JOIN nhl_draft d ON c.Player_Name_Initial_Last = d.Player_Name_InitialLast;
-- Update all the mismatch names with correct values (some players went by middle names or changed names ex. Joe went by Joeseph) UPDATE nhl_combined1 SET draft_year = 2021, Rd = 2, Overall = 60, Pos = 'D', Amateur_Team = 'EHC Biel-Bienne (Swiss)', Drafted_Team = 'PHX' WHERE Player = 'J.J. Moser';
UPDATE nhl_combined1 SET draft_year = 2020, Rd = 2, Overall = 34, Pos = 'RW', Amateur_Team = 'EHC Munchen (Germany)', Drafted_Team = 'BUF' WHERE Player = 'JJ Peterka';
UPDATE nhl_combined1 SET draft_year = 2009, Rd = 4, Overall = 95, Pos = 'G', Amateur_Team = 'Montreal Juniors (QMJHL)', Drafted_Team = 'LAK' WHERE Player = 'J-F Berube';
UPDATE nhl_combined1 SET draft_year = 2018, Rd = 1, Overall = 30, Pos = 'C', Amateur_Team = 'Drummondville Voltigeurs (QMJHL)', Drafted_Team = 'DET' WHERE Player = 'Joe Veleno';
UPDATE nhl_combined1 SET draft_year = 2014, Rd = 1, Overall = 19, Pos = 'D', Amateur_Team = 'Sarnia Sting (OHL)', Drafted_Team = 'TBL' WHERE Player = 'Tony DeAngelo';
UPDATE nhl_combined1 SET draft_year = 2017, Rd = 4, Overall = 103, Pos = 'D', Amateur_Team = 'Waterloo Black Hawks (USHL)', Drafted_Team = 'LAK' WHERE Player = 'Mikey Anderson';
UPDATE nhl_combined1 SET draft_year = 2015, Rd = 3, Overall = 85, Pos = 'C', Amateur_Team = 'Waterloo Black Hawks (USHL)', Drafted_Team = 'NSH' WHERE Player = 'Tommy Novak';
UPDATE nhl_combined1 SET draft_year = 1996, Rd = 3, Overall = 56, Pos = 'D', Amateur_Team = 'HK Dukla Trencin Jr. (Slovakia Jr.)', Drafted_Team = 'NYI' WHERE Player = 'Zdeno Chara';
UPDATE nhl_combined1 SET draft_year = 2017, Rd = 6, Overall = 185, Pos = 'C', Amateur_Team = 'Ottawa 67s (OHL)', Drafted_Team = 'SJS' WHERE Player = 'Sasha Chmelevski';
UPDATE nhl_combined1 SET draft_year = 2014, Rd = 7, Overall = 210, Pos = 'D', Amateur_Team = 'Ottawa 67s (OHL)', Drafted_Team = 'LAK' WHERE Player = 'Jake Middleton';
UPDATE nhl_combined1 SET draft_year = 2013, Rd = 7, Overall = 202, Pos = 'LW', Amateur_Team = 'Frolunda HC Jr. (Sweden Jr.)', Drafted_Team = 'TOR' WHERE Player = 'Andreas Johnsson';
UPDATE nhl_combined1 SET draft_year = 2019, Rd = 6, Overall = 61, Pos = 'D', Amateur_Team = 'Ottawa 67s (OHL)', Drafted_Team = 'NJD' WHERE Player = 'Nikita Okhotiuk';
UPDATE nhl_combined1 SET draft_year = 2012, Rd = 1, Overall = 23, Pos = 'D', Amateur_Team = 'Dubuque Fighting Saints (USHL)', Drafted_Team = 'FLA' WHERE Player = 'Mike Matheson';
UPDATE nhl_combined1 SET draft_year = 2016, Rd = 3, Overall = 73, Pos = 'RW', Amateur_Team = 'USA U-18 Development Team (USDP/USHL)', Drafted_Team = 'NJD' WHERE Player = 'Joey Anderson';
UPDATE nhl_combined1 SET draft_year = 2008, Rd = 4, Overall = 114, Pos = 'D', Amateur_Team = 'Saginaw Spirit (OHL)', Drafted_Team = 'CGY' WHERE Player = 'TJ Brodie';
UPDATE nhl_combined1 SET draft_year = 2014, Rd = 5, Overall = 129, Pos = 'LW', Amateur_Team = 'Sioux Falls Stampede (USHL)', Drafted_Team = 'WPG' WHERE Player = 'CJ Suess';
-- Fill the undrafted players with correct values UPDATE nhl_combined1 SET Drafted_Team = 'Undrafted' WHERE Drafted_Team IS NULL;
-- Identify null values SELECT * from nhl_combined1 where Drafted_Team is NULL;
-- Delete Duplicates DELETE nhl_combined1 FROM nhl_combined1 JOIN ( SELECT Player, MIN(ID) AS min_id FROM nhl_combined1 GROUP BY Player HAVING COUNT(*) > 1 ) duplicates ON nhl_combined1.Player = duplicates.Player WHERE nhl_combined1.ID > duplicates.min_id;
-- Delete the incorrect entries caused by multiple entries with same name from draft_table. I manually checked for data validation DELETE FROM nhl_combined1 WHERE (Drafted_Team = 'CHI' AND Player = 'Craig Anderson') OR (Drafted_Team = 'NYI' AND Player = 'Sebastian Aho') OR (Drafted_Team = 'ANA' AND Player = 'Frederik Andersen') OR (Drafted_Team = 'COL' AND Player = 'Jonas Johansson') OR (Drafted_Team = 'ANA' AND Player = 'Brendan Perlini') OR (Drafted_Team = 'EDM' AND Player = 'Matt Roy') OR (Drafted_Team = 'COL' AND Player = 'Josh Anderson') OR (Drafted_Team = 'NJD' AND Player = 'Josh Anderson') OR (Drafted_Team = 'CAR' AND Player = 'Erik Karlsson') OR (Drafted_Team = 'DET' AND Player = "Ryan O'Reilly") OR (Drafted_Team = 'VAN' AND Player = "Nick Jensen") OR (Drafted_Team = 'LAK' AND Player = "Jack Hughes") OR (Drafted_Team = 'OTT' AND Player = "Colton White") OR (Drafted_Team = 'NJD' AND Player = "Colin White") OR (draft_year = '2022' AND Player = "Elias Pettersson") OR (Drafted_Team = 'VAN' AND Player = "Nathan Smith") OR (Drafted_Team = 'LAK' AND Player = "Anders Lee") OR (Drafted_Team = 'OTT' AND Player = "Vince Dunn");
Select * from nhl_combined1 where player = "Vince Dunn";
-- Count the occurrences of players with duplicate entries. select player, Count(ID) from nhl_combined1 group by player having count(*) > 1;
-- Count the number of players drafted by each team and display the results in descending order of player count. SELECT Drafted_Team, COUNT(*) AS Player_Count FROM nhl_combined1 GROUP BY Drafted_Team Order By Player_Count DESC;
-- Count the number of players from each current team who were not drafted and display the results in descending order of player count. SELECT current_team, Drafted_Team, COUNT(*) AS Player_Count FROM nhl_combined1 WHERE Drafted_Team = 'Undrafted' GROUP BY current_team ORDER BY Player_Count DESC;
-- Calculate average draft round, average age, average height (in inches), and average weight for each current team and display the results in descending order of average draft round. SELECT current_team AS Team, AVG(Rd) AS Avg_Round, AVG(Age) AS Avg_Age, AVG(12 * CAST(SUBSTRING_INDEX(Ht, ''', 1) AS UNSIGNED) + CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(Ht, ''', -1), '"', 1) AS UNSIGNED)) AS Avg_Height_Inches, AVG(Wt) AS Avg_Weight FROM nhl_combined1 GROUP BY current_team ORDER BY Avg_Round DESC;
Colorado utilized the undrafted market for role players while relying on low-round draft picks (high talent) players to be playmakers to win the Stanley Cup. Also, these tables show that successful teams rely heavily on drafting well to develop NHL talent, as shown with WSH, TBL, and STL (recent Cup winners). These teams do not have as many draft picks as other teams, but their scouts have produced NHL roster draft talent.
Looking at the season standings, teams like Florida Panthers, New York Rangers, Carolina Hurricanes, Toronto Maple Leafs, and Minnesota Wild have higher average rounds, but they have found success in finding value in these rounds. However, these teams have had difficulty making it to the playoff finals.
The statistics show a relatively competitive balance across teams, with a relatively small difference in the average round, age, height, and weight.
-- Count the number of players from each amateur team and display the results in descending order of player count. SELECT Amateur_Team, COUNT(*) as Player_Count_ATeam FROM nhl_combined1 GROUP BY Amateur_Team ORDER BY Player_Count_ATeam DESC;
-- Count the number of players for each nationality and display the results in descending order of player count. SELECT Nationality, COUNT(*) as Player_Count_Nat FROM nhl_combined1 GROUP BY Nationality ORDER BY Player_Count_Nat DESC;