As a data analyst, you have been tasked with developing a predictive model that can accurately forecast the popularity of a given song based on its underlying audio features. This task holds significant potential for music industry professionals, including record label executives, music producers, and artists, who are constantly seeking innovative ways to leverage data-driven insights and optimize their creative and promotional strategies. By accurately identifying the key audio features that are most influential in predicting song popularity, this model can aid music professionals in crafting successful new compositions and implementing more effective marketing campaigns.
To gain access to Spotify API follow these steps:
Go to this link (https://developer.spotify.com/dashboard) -> Log into Spotify -> Create an app -> Click settings -> View client secret -> Copy and paste the codes into the "cid and secret"
This code gathers data on music tracks from the Spotify API using the Spotipy library.
The code starts by searching for tracks released in the year 2023 and retrieving their artist name, track name, popularity, and track ID. This data is stored in separate lists.
Then, the track IDs are used to retrieve additional data on the audio features of each track, including danceability, energy, and loudness, using the sp.audio_features() function provided by the Spotipy library. The results are stored in a list called rows.
Once both sets of data have been collected, they are merged into a single Pandas dataframe using the pd.merge() function. The resulting dataset contains information on the artist name, track name, popularity, track ID, and various audio features for each track released in the year 2023 that were found in the Spotify API.
Data Cleaning and Preprocessing Tasks: Data preprocessing functions that are used to clean and prepare data before it is fed into a machine learning model. The clean_data(df) function cleans the dataset by imputing missing values, removing duplicates, inconsistent rows, and invalid values. The non_scale_clean_data(df) function performs the same cleaning process as clean_data(df) but does not scale the numerical columns. The preprocess_data(df) function applies Isolation Forest to remove outliers, scales the numerical columns, and one-hot encodes the categorical columns. The select_features(X, y) function performs feature scaling and one-hot encoding for categorical features.
Comparing Three Song Popularity Prediction Models: The first model is a linear regression model that predicts the popularity of a song based on its features, such as danceability, energy, and tempo. It uses Ridge regression to perform hyperparameter tuning and evaluates the model using mean squared error and R^2.
The second model is a binary classification model that predicts whether a song is popular or not based on its features. It preprocesses the data using a pipeline that includes scaling numerical features and one-hot encoding categorical features, performs feature selection using a Random Forest Classifier, and handles data imbalance using SMOTE. It then uses logistic regression as the classification algorithm, tunes hyperparameters using GridSearchCV, and evaluates performance using cross-validation and accuracy score.
The third model uses Auto-sklearn to automatically select the best classification algorithm and hyperparameters for predicting whether a song is popular or not based on its features. It preprocesses the data by scaling numerical features and one-hot encoding categorical features and uses cross-validation to find the best model.
Analyzing Model Performance: The evaluation of a model's performance using the Receiver Operating Characteristic (ROC) curve, which plots the true positive rate (TPR) against the false positive rate (FPR) for different threshold values. The area under the ROC curve (AUC) is calculated as a measure of the model's discriminatory power. The permutation importance is then computed to identify the most important features contributing to the model's performance. Finally, the weights of each feature and the summary statistics of the model are printed, and the leaderboard and the details of each model's performance are displayed.
Due to the inability of my computer to run the autosklearn model, I had to access the Bash shell in Windows Subsystem for Linux (WSL) by following the procedures outlined in https://www.wikihow.com/Install-Linux.
Limitations of this dataset include its limited scope, which only contains a specific set of features related to each track, such as popularity, danceability, energy, and so on. Other important information, such as lyrics, release date, or album name, is not included. Additionally, the knowledge cutoff for this dataset is September 2021, so any tracks released after that time will not be included. Moreover, this dataset does not cover tracks from earlier decades, which may be relevant in certain analyses.
Another limitation is that the dataset may be biased towards certain genres or artists due to factors such as popularity or market trends. This can limit the generalizability of any conclusions drawn from this dataset. Additionally, the analysis is constrained by the dataset, which only encompasses top-rated popular songs, thereby limiting the scope of the study to exclude undiscovered songs.
import logging
import numpy as np
import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, IsolationForest
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import KNNImputer
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix, mean_squared_error, r2_score, roc_curve, auc, roc_auc_score
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
import autosklearn.classification
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.inspection import permutation_importance
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
import ydata_profiling
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
cid ="xx"
secret = "xx"
client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)
# create a function to retrieve tracks
import timeit
def retrieve_tracks(year, limit):
offset = 0
results = []
while True:
track_results = sp.search(q=f'year:{year}', type='track', limit=50, offset=offset)
items = track_results['tracks']['items']
results.extend(items)
offset += len(items)
if len(items) < 50 or offset >= limit:
break
return results
start = timeit.default_timer()
# retrieve tracks for year 2023, limit of 1000
tracks = retrieve_tracks(year=2023, limit=1000)
# extract relevant information and store in lists
artist_name = [track['artists'][0]['name'] for track in tracks]
track_name = [track['name'] for track in tracks]
track_id = [track['id'] for track in tracks]
popularity = [track['popularity'] for track in tracks]
stop = timeit.default_timer()
print('Time to run this code (in seconds):', stop - start)
Time to run this code (in seconds): 7.474733628999957
print('number of elements in the track_id list:', len(track_id))
number of elements in the track_id list: 1000
df_tracks = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id,'popularity':popularity})
print(df_tracks.shape)
df_tracks.head()
(1000, 4)
artist_name | track_name | track_id | popularity | |
---|---|---|---|---|
0 | PinkPantheress | Boy's a liar Pt. 2 | 6AQbmUe0Qwf5PZnt4HmTXv | 97 |
1 | Miley Cyrus | Flowers | 0yLdNVWF3Srea0uzk55zFn | 100 |
2 | Morgan Wallen | Last Night | 59uQI0PADDKeE6UZDTJEe8 | 89 |
3 | Morgan Wallen | Last Night | 7K3BhSpAxZBznislvUMVtn | 88 |
4 | Morgan Wallen | Thinkin’ Bout Me | 0PAcdVzhPO4gq1Iym9ESnK | 86 |
# again measuring the time
start = timeit.default_timer()
# empty list, batchsize and the counter for None results
rows = []
batchsize = 100
None_counter = 0
for i in range(0,len(df_tracks['track_id']),batchsize):
batch = df_tracks['track_id'][i:i+batchsize]
feature_results = sp.audio_features(batch)
for i, t in enumerate(feature_results):
if t == None:
None_counter = None_counter + 1
else:
rows.append(t)
print('Number of tracks where no audio features were available:',None_counter)
stop = timeit.default_timer()
print ('Time to run this code (in seconds):',stop - start)
Number of tracks where no audio features were available: 2 Time to run this code (in seconds): 3.023385842000039
df_audio_features = pd.DataFrame.from_dict(rows,orient='columns')
print("Shape of the dataset:", df_audio_features.shape)
df_audio_features.head()
Shape of the dataset: (998, 18)
danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | type | id | uri | track_href | analysis_url | duration_ms | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.696 | 0.809 | 5 | -8.254 | 1 | 0.0500 | 0.2520 | 0.000128 | 0.2480 | 0.857 | 132.962 | audio_features | 6AQbmUe0Qwf5PZnt4HmTXv | spotify:track:6AQbmUe0Qwf5PZnt4HmTXv | https://api.spotify.com/v1/tracks/6AQbmUe0Qwf5... | https://api.spotify.com/v1/audio-analysis/6AQb... | 131013 | 4 |
1 | 0.707 | 0.681 | 0 | -4.325 | 1 | 0.0668 | 0.0632 | 0.000005 | 0.0322 | 0.646 | 117.999 | audio_features | 0yLdNVWF3Srea0uzk55zFn | spotify:track:0yLdNVWF3Srea0uzk55zFn | https://api.spotify.com/v1/tracks/0yLdNVWF3Sre... | https://api.spotify.com/v1/audio-analysis/0yLd... | 200455 | 4 |
2 | 0.517 | 0.675 | 6 | -5.382 | 1 | 0.0357 | 0.4590 | 0.000000 | 0.1510 | 0.518 | 203.853 | audio_features | 59uQI0PADDKeE6UZDTJEe8 | spotify:track:59uQI0PADDKeE6UZDTJEe8 | https://api.spotify.com/v1/tracks/59uQI0PADDKe... | https://api.spotify.com/v1/audio-analysis/59uQ... | 163855 | 4 |
3 | 0.492 | 0.675 | 6 | -5.456 | 1 | 0.0389 | 0.4670 | 0.000000 | 0.1420 | 0.478 | 203.759 | audio_features | 7K3BhSpAxZBznislvUMVtn | spotify:track:7K3BhSpAxZBznislvUMVtn | https://api.spotify.com/v1/tracks/7K3BhSpAxZBz... | https://api.spotify.com/v1/audio-analysis/7K3B... | 163855 | 4 |
4 | 0.656 | 0.757 | 3 | -5.775 | 0 | 0.0308 | 0.4920 | 0.000000 | 0.1170 | 0.429 | 139.971 | audio_features | 0PAcdVzhPO4gq1Iym9ESnK | spotify:track:0PAcdVzhPO4gq1Iym9ESnK | https://api.spotify.com/v1/tracks/0PAcdVzhPO4g... | https://api.spotify.com/v1/audio-analysis/0PAc... | 177388 | 4 |
columns_to_drop = ['analysis_url','track_href','type','uri']
df_audio_features.drop(columns_to_drop, axis=1,inplace=True)
df_audio_features.rename(columns={'id': 'track_id'}, inplace=True)
df_audio_features.shape
(998, 14)
# merge both dataframes
# the 'inner' method will make sure that we only keep track IDs present in both datasets
df = pd.merge(df_tracks,df_audio_features,on='track_id',how='inner')
print("Shape of the dataset:", df_audio_features.shape)
df.head()
Shape of the dataset: (998, 14)
artist_name | track_name | track_id | popularity | danceability | energy | key | loudness | mode | speechiness | acousticness | instrumentalness | liveness | valence | tempo | duration_ms | time_signature | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | PinkPantheress | Boy's a liar Pt. 2 | 6AQbmUe0Qwf5PZnt4HmTXv | 97 | 0.696 | 0.809 | 5 | -8.254 | 1 | 0.0500 | 0.2520 | 0.000128 | 0.2480 | 0.857 | 132.962 | 131013 | 4 |
1 | Miley Cyrus | Flowers | 0yLdNVWF3Srea0uzk55zFn | 100 | 0.707 | 0.681 | 0 | -4.325 | 1 | 0.0668 | 0.0632 | 0.000005 | 0.0322 | 0.646 | 117.999 | 200455 | 4 |
2 | Morgan Wallen | Last Night | 59uQI0PADDKeE6UZDTJEe8 | 89 | 0.517 | 0.675 | 6 | -5.382 | 1 | 0.0357 | 0.4590 | 0.000000 | 0.1510 | 0.518 | 203.853 | 163855 | 4 |
3 | Morgan Wallen | Last Night | 7K3BhSpAxZBznislvUMVtn | 88 | 0.492 | 0.675 | 6 | -5.456 | 1 | 0.0389 | 0.4670 | 0.000000 | 0.1420 | 0.478 | 203.759 | 163855 | 4 |
4 | Morgan Wallen | Thinkin’ Bout Me | 0PAcdVzhPO4gq1Iym9ESnK | 86 | 0.656 | 0.757 | 3 | -5.775 | 0 | 0.0308 | 0.4920 | 0.000000 | 0.1170 | 0.429 | 139.971 | 177388 | 4 |
# Save the modified DataFrame to a CSV file
df.to_csv('spotifyfeatures2023v04.csv', index=False)
report = ydata_profiling.ProfileReport(df, title="Pandas Profiling Report")
report
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]