Part 2: Python What Makes a Hit? Analyzing Spotify’s Top Songs of 2023

Hi there, if you missed part 1 where I dove into SQL insights, check it out here.

In part 2 it’s all about Python and digging deep to pull insights.

Purpose: Prepare and clean the dataset for deeper analysis and visualization, then explore what makes a hit song across platforms like Spotify, Apple, and Shazam.

Key Tasks:

• Handle missing values and normalize column names

• Convert columns to correct datatypes

• Engineer new features such as release_date

• Explore relationships between features(e.g., valence, danceability, energy) and the number of streams.

Skills & Tools

Pandas, NumPy, Matplotlib/Seaborn, feature engineering, data wrangling, exploratory data analysis

At first glance there are several columns, artists, and tracks. I noticed there are charts for Spotify, Apple, etc. so I want to better understand what makes these hits.

Data Cleaning & Preparation

import pandas as pd

spotify = pd.read_csv(“spotify-2023.csv”, encoding=”ISO-8859-1″)
spotify

First goal is to take a look at the dataset and see what pops out. The various playlists for Spotify, Apple, and others stands out. Also, take a look at the track names, artists, and release years. I wonder if danceability, energy, and other features have an impact on whether or not a song is in the playlist?

Let’s dive deeper!

Using the describe method for mean, max, min, standard deviation, etc.

spotify.describe()

standard describe summary

summary = spotify.describe().T # Transpose so columns are rows

helps spot outliers early on. More readability for count, mean, min, etc.

Adding extra info: range (max – min)

summary[‘range’] = summary[‘max’] – summary[‘min’]
summary

Identify null values

spotify.isnull().sum()

If there are missing values, I’m changing it to 0 meaning it doesn’t show up in the chart.

spotify[‘in_shazam_charts’] = spotify[‘in_shazam_charts’].fillna(0)

Checking the values of the key column before changing the missing values

spotify[‘key’].unique()

array(['B', 'C#', 'F', 'A', 'D', 'F#', nan, 'G#', 'G', 'E', 'A#', 'D#'],
dtype=object)

Verifying that missing values are gone

spotify['key'].unique()

array(['B', 'C#', 'F', 'A', 'D', 'F#', 'G#', 'G', 'E', 'A#', 'D#'],
dtype=object)

Checking the data types. If I were doing machine learning, I'd convert the features used to numerical values.

Combining year, month, and day into one data time column.

spotify[‘release_date’] = pd.to_datetime(spotify.rename(columns={‘released_year’: ‘year’, ‘released_month’: ‘month’, ‘released_day’: ‘day’})[[‘year’, ‘month’, ‘day’]])

spotify.dtypes

spotify[‘streams’] = pd.to_numeric(spotify[‘streams’], errors=’coerce’)

Checking the unique values in the mode column

spotify[‘mode’].unique()

array(['Major', 'Minor'], dtype=object)

spotify['in_deezer_playlists'] = spotify['in_deezer_playlists'].str.replace(',', '', regex=True).astype(int)

spotify['in_shazam_charts'] = spotify['in_shazam_charts'].str.replace(',', '', regex=True)

# filling missing values with 0

spotify['in_shazam_charts'] = spotify['in_shazam_charts'].fillna('0').astype(int)

For readability I'm changing the column name from artist(s)_name to artist_name.

spotify = spotify.rename(columns={'artist(s)_name': 'artist_name'})

spotify['artist_name']

It’s important to check the dataset throughout to ensure accuracy.

I then checked for any duplicates, there are none.

spotify.duplicated().sum()

0

I then checked the spelling of the artists’ names.

spotify[‘artist_name’].head(10)

Exploratory Data Analysis

After some data cleaning it’s time for exploratory data analysis to check for correlations and strong features.

import matplotlib.pyplot as plt

Plot histogram

spotify[‘streams’].plot(kind=’hist’, bins=20)
plt.title(‘Distribution of Song Streams’)
plt.xlabel(‘Number of Streams’)
plt.ylabel(‘Count of Songs’)
plt.show()

Notice how it’s right skewed. This means that the mean is greater than the median.

Plot histogram

spotify[‘energy_%’].plot(kind=’hist’, bins=20)
plt.title(‘Energy Level of the Song’)
plt.xlabel(‘Energy Level’)
plt.ylabel(‘Count of Songs’)
plt.show()

Some skew to the left. The mean is less than the median.

Plot histogram

spotify[‘danceability_%’].plot(kind=’hist’, bins=20)
plt.title(‘Danceability of the Song’)
plt.xlabel(‘Danceability Level’)
plt.ylabel(‘Count of Songs’)
plt.show()

Danceability is more evenly spread across tracks.

import matplotlib.pyplot as plt

List of columns I want to plot

numeric_features = [
‘danceability_%’, ‘energy_%’, ‘valence_%’,
‘acousticness_%’, ‘instrumentalness_%’,
‘liveness_%’, ‘speechiness_%’, ‘bpm’, ‘streams’
]

Looping through each feature and plot

for feature in numeric_features:
plt.figure(figsize=(6,4))
spotify[feature].plot(kind=’hist’, bins=20, edgecolor=’black’) # more bins for detail
plt.title(f'{feature} Distribution’)
plt.xlabel(feature)
plt.ylabel(‘Frequency’)
plt.show()

This gave me various plots for danceability, energy, acousticeness, and more.

I then created a heat map to check for any correlations.

The results:

– Energy vs Danceability = 0.14 which is a Slight positive correlation

– Danceability vs Valence = 0.13 which is a Slight positive correlation

– Stream vs Danceability = -0.11 which is a Weak negative correlation

import seaborn as sns
import matplotlib.pyplot as plt

Select numerical columns for correlation

numerical_cols = [‘streams’, ‘energy_%’, ‘danceability_%’, ‘valence_%’]

Calculate correlation matrix

corr_matrix = spotify[numerical_cols].corr()

Plot heatmap

sns.heatmap(corr_matrix, annot=True, cmap=’coolwarm’)
plt.title(‘Correlation Heatmap of Song Features’)
plt.show()

Interesting that some songs are from the 30s and 40s! Only one song is from prior to 1940. Notice how Spotify has more songs on playlists than Apple.

avg_playlist_per_release_date = spotify.groupby(‘release_date’)[[‘in_spotify_playlists’, ‘in_apple_playlists’]].mean()
avg_playlist_per_release_date

Let’s dig into song/songs from the 30s. Interesting, there is one. It’s in 323 playlists in Spotify!

spotify[spotify[‘released_year’] < 1940]

The most popular release month is January

The least popular release month is August

Highest_Month = spotify.groupby(‘released_month’)[‘track_name’].count().sort_values(ascending=False)

Lowest_Month = spotify.groupby(‘released_month’)[‘track_name’].count().sort_values(ascending=True)

print(‘Highest Month’)
print(Highest_Month)
print(‘Lowest Month’)
print(Lowest_Month)

The most popular day is the first of the month.

Highest_Day = spotify.groupby(‘released_day’)[‘track_name’].count().sort_values(ascending=False).head(5)
Highest_Day

The least popular day is the 26th of each month.

Lowest_day = spotify.groupby(‘released_day’)[‘track_name’].count().sort_values(ascending=True).head(5)
Lowest_day

Next Steps

I’ll have part 3 which is all about Tableau! See the data come to life in an engaging format. I’ll have it linked here when it’s available.

Decide whether or not to do machine learning. If machine learning is the next step I’d encode categorical variable (make them numerical such as key, mode, etc).

Explore time-based playlist trends in more detail.

Build predictive models to determine which features (danceability, valence, playlists, etc). best explain high stream counts.

Lindsay