In this notebook, we will analyze the Global YouTube Statistics 2023 and draw conclusions based on the dataset.
from matplotlib.ticker import ScalarFormatter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import pandas as pd
Data preprocessing
df = pd.read_csv('/kaggle/input/global-youtube-statistics-2023/Global YouTube Statistics.csv',encoding = 'latin-1', index_col=0)
df.sample(7)
Youtuber | subscribers | video views | category | Title | uploads | Country | Abbreviation | channel_type | video_views_rank | … | subscribers_for_last_30_days | created_year | created_month | created_date | Gross tertiary education enrollment (%) | Population | Unemployment rate | Urban_population | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
rank | |||||||||||||||||||||
781 | Zee Bangla | 14200000 | 1.142879e+10 | Entertainment | Zee Bangla | 132398 | India | IN | Entertainment | 352.0 | … | 200000.0 | 2008.0 | Feb | 26.0 | 28.1 | 1.366418e+09 | 5.36 | 471031528.0 | 20.593684 | 78.962880 |
752 | Lilly Singh | 14500000 | 3.517662e+09 | Comedy | Lilly Singh | 1064 | Canada | CA | Comedy | 2297.0 | … | NaN | 2010.0 | Oct | 29.0 | 68.9 | 3.699198e+07 | 5.56 | 30628482.0 | 56.130366 | -106.346771 |
545 | Doggy Doggy Cartoons | 16800000 | 6.518419e+09 | Entertainment | Doggy Doggy Cartoons | 0 | NaN | NaN | NaN | 4057944.0 | … | NaN | 2018.0 | Nov | 11.0 | NaN | NaN | NaN | NaN | NaN | NaN |
57 | HAR PAL GEO | 44600000 | 4.113905e+10 | Entertainment | HAR PAL GEO | 100755 | Pakistan | PK | Entertainment | 20.0 | … | 1300000.0 | 2008.0 | Jan | 2.0 | 9.0 | 2.165653e+08 | 4.45 | 79927762.0 | 30.375321 | 69.345116 |
147 | Dream | 31700000 | 2.930015e+09 | Gaming | Dream | 116 | United States | US | Games | 2986.0 | … | 200000.0 | 2014.0 | Feb | 8.0 | 88.2 | 3.282395e+08 | 14.70 | 270663028.0 | 37.090240 | -95.712891 |
187 | Shemaroo Movies | 28200000 | 7.600741e+09 | Entertainment | Shemaroo Movies | 3009 | India | IN | Film | 721.0 | … | 500000.0 | 2011.0 | Mar | 1.0 | 28.1 | 1.366418e+09 | 5.36 | 471031528.0 | 20.593684 | 78.962880 |
263 | KSI | 24100000 | 6.002167e+09 | Entertainment | KSI | 1252 | United Kingdom | GB | Music | 1053.0 | … | NaN | 2009.0 | Jul | 25.0 | 60.0 | 6.683440e+07 | 3.85 | 55908316.0 | 55.378051 | -3.435973 |
Let’s pay attention that dataset consists of many independent variables which are:
df.columns.tolist()
['Youtuber', 'subscribers', 'video views', 'category', 'Title', 'uploads', 'Country', 'Abbreviation', 'channel_type', 'video_views_rank', 'country_rank', 'channel_type_rank', 'video_views_for_the_last_30_days', 'lowest_monthly_earnings', 'highest_monthly_earnings', 'lowest_yearly_earnings', 'highest_yearly_earnings', 'subscribers_for_last_30_days', 'created_year', 'created_month', 'created_date', 'Gross tertiary education enrollment (%)', 'Population', 'Unemployment rate', 'Urban_population', 'Latitude', 'Longitude']
Our focus should be on extracting the essential columns while omitting finer details like Died country code, Organization city, Geo Point 2D, and others.
Shortening column names contributes to a more streamlined data handling process.
columns_to_drop = [
'Abbreviation','created_month', 'created_date','Gross tertiary education enrollment (%)',
'Unemployment rate', 'Urban_population', 'Latitude', 'category', 'lowest_yearly_earnings',
'Longitude', 'video_views_for_the_last_30_days', 'lowest_monthly_earnings', 'highest_yearly_earnings',
'highest_monthly_earnings', 'Population', 'country_rank', 'channel_type_rank'
]
df['Average_yearly_earnings'] = (df['lowest_yearly_earnings']+df['highest_yearly_earnings'])/2
df.drop(columns=columns_to_drop, inplace=True)
new_column_names = {
'youtuber': 'Youtuber',
'subscribers': 'Subs',
'video views': 'Views',
'title': 'Title',
'uploads': 'Uploads',
'country': 'Country',
'channel_type': 'Type',
'video_views_rank': 'Views_Rank',
'subscribers_for_last_30_days': 'Subs_Last_30Days',
'created_year': 'Created_Year'
}
df.rename(columns=new_column_names, inplace=True)
It looks way better than before. Although we shortened column names, we are still able to recognize each meaning easily.
Missing values
Now, we must ensure data integrity and reliability. It is vital for accurate analyses and conclusions.
def check_missing_values(column):
nan_percentage = df[column].isnull().sum() / df[column].size
print(f'"{column}" column consists of {nan_percentage:.2%} missing values.')
for column in df.columns:
check_missing_values(column)
"Youtuber" column consists of 0.00% missing values. "Subs" column consists of 0.00% missing values. "Views" column consists of 0.00% missing values. "Title" column consists of 0.00% missing values. "Uploads" column consists of 0.00% missing values. "Country" column consists of 12.26% missing values. "Type" column consists of 3.02% missing values. "Views_Rank" column consists of 0.10% missing values. "Subs_Last_30Days" column consists of 33.87% missing values. "Created_Year" column consists of 0.50% missing values. "Average_yearly_earnings" column consists of 0.00% missing values.
It appears that most columns in the dataset are filled, except for “Country” (12.26% missing) and “Subs_Last_30Days” (33.87% missing).
Keeping an eye on these missing value patterns will help our future analysis.
Data types
Let’s see the sample to gain a better understanding of the data types.
formatted_data = []
column_name_width = 20
column_value_width = 25
for column_name, column_value in df.loc[1].items():
column_dtype = df[column_name].dtype
formatted_data.append(f"{column_name.ljust(column_name_width)}{str(column_value).ljust(column_value_width)}{column_dtype}")
sample_output = "\n".join(formatted_data)
print(sample_output)
Youtuber T-Series object Subs 245000000 int64 Views 228000000000.0 float64 Title T-Series object Uploads 20082 int64 Country India object Type Music object Views_Rank 1.0 float64 Subs_Last_30Days 2000000.0 float64 Created_Year 2006.0 float64 Average_yearly_earnings57600000.0 float64
It looks like the data types are already quite appropriate for the given columns. Let’s change some details.
median_created_year = df['Created_Year'].median()
df['Created_Year'].fillna(median_created_year, inplace=True)
df['Type'] = df['Type'].fillna('Unknown')
df['Created_Year'] = df['Created_Year'].astype(int)
Data Analysis
Top Youtubers
Who are the ten most subscribed YouTubers? Let’s visualize this information with a WordCloud!
wordcloud = WordCloud(width=600, height=400, background_color='white').generate(' '.join(df['Youtuber'][:10]))
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
It’s evident that T-Series and YouTube Movies have taken the lead. MrBeast stands out for creative content and philanthropic endeavors, while the list also features children-oriented channels like Cocomelon, Kids Diana Show, Like Nastya, and Vlad and Niki, catering to a diverse range of audiences.
Top Countries
Let’s find out which country has the most YouTubers.
country_counts = df['Country'].value_counts().reset_index()
country_counts.columns = ['Country', 'Youtuber_Count']
fig = px.choropleth(
country_counts,
locations='Country',
locationmode='country names',
color='Youtuber_Count',
hover_name='Country',
title='Number of YouTubers by Country',
color_continuous_scale='Viridis',
)
fig.show()
50100150200250300Youtuber_CountNumber of YouTubers by Country
It looks like the country with the most YouTubers is the United States, having 313 YouTubers. India comes second with 168 YouTubers, and Brazil ranks third with 62 YouTubers.
Channel creation year
What are the creation years of the most subscribed YouTube channels? Let’s check it out!
BTW: Possible outlier in the data: https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023/discussion/433640
plt.figure(figsize=(10, 6))
sns.histplot(data=df, x='Created_Year', bins=50, kde=True)
plt.xlabel('Year', fontsize=12)
plt.ylabel('No. created channels', fontsize=12)
plt.title('Distribution of created channels', fontsize=14)
plt.xticks(range(2005, 2023, 1))
plt.xlim(2005, 2022)
plt.show()
It is evident that:
- Channels created in earlier years had more time to gather subscribers, leading to higher subscription numbers for those established in 2006, 2011, and 2014 in contrast to later years like 2020, 2021, and 2022.
- With time, YouTube’s platform has become more saturated with creators, intensifying competition for subscribers and views, while a consistent decline in subscription rates since 2015 adds to the challenge of becoming the most subscribed channel.
Top YouTube Channel Types
Let’s find out the top trending channel types on YouTube!
categories_counts = df['Type'].value_counts()
plt.figure(figsize=(8, 6))
plt.pie(categories_counts.head(10), labels=None, autopct='%1.1f%%', startangle=140)
plt.legend(categories_counts.head(10).index, loc='upper right')
plt.title('Top 10 YouTube Channel Types', fontsize=14)
plt.axis('equal')
plt.show()
We can see that:
- Entertainment and Music seem to be the most popular top channel types, having higher counts.
- People and Games have a significant number of top channels.
- Tech has the lowest count among the provided categories, suggesting that there are fewer top YouTube channels focused on technology-related content.
Average Yearly Earnings
Who are the top earners among YouTube channels on a yearly basis?
top_10_earners = df.sort_values(by="Average_yearly_earnings", ascending=False).head(10)
plt.figure(figsize=(10, 6))
plt.bar(top_10_earners["Youtuber"], top_10_earners["Average_yearly_earnings"], color='gold', alpha=0.7)
plt.title("Top Earners Among YouTube Channels (Yearly Basis)", fontsize=14)
plt.ylabel("Average Yearly Earnings (10$ kk)", fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
It’s pretty amazing that the top YouTuber have made around $87 million. That’s seriously impressive!
Earnings & Uploads
Are earnings connected in someway with number of uploads?
correlation = df["Average_yearly_earnings"].corr(df["Uploads"])
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x="Uploads", y="Average_yearly_earnings")
plt.title("Correlation between Uploads and Earnings", fontsize=14)
plt.xlabel("Number of Uploads", fontsize=12)
plt.ylabel("Average Yearly Earnings (10$ kk)", fontsize=12)
plt.xlim(-2500, df["Uploads"].max()+5000)
plt.ylim(0, df["Average_yearly_earnings"].max())
plt.show()
print(f'Correlation is : {correlation:.2f}')
Correlation is : 0.17
The correlation of 0.17 suggests that there’s a mild tendency for higher earnings when there are more uploads, but the connection is not strong.
Fact! Better videos mean more earnings. As is visible on the graph, if your videos are high-quality and capture your audience’s attention, you can earn more, regardless of how many videos you upload.
Biggest Monthly Subs Increase
df['Subs_Last_30Days'].fillna(0, inplace=True)
top_10_df = df.sort_values(by='Subs_Last_30Days', ascending=False).head(10)
plt.figure(figsize=(10, 6))
plt.bar(top_10_df['Youtuber'], top_10_df['Subs_Last_30Days'], color='r')
plt.ylabel('Monthly Subs Increase ($ kk)', fontsize=12)
plt.title('Top 10 YouTubers with Biggest Monthly Subs Increase', fontsize=14)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
It is clear that these top 10 YouTubers gained lots of new subscribers because they offer unique and trending content that appeals to a wide audience. Channels like MrBeast and Jess No Limit use attention-grabbing stunts, while lesser-known ones like DaFuq!?Boom! and BeatboxJCOP found their niche. Their creativity and engaging strategies played a big role in their rapid growth.
Conclusions
- T-Series and YouTube Movies are prominent leaders, while MrBeast shines for creativity and philanthropy. Children-focused channels like Cocomelon, Kids Diana Show, Like Nastya, and Vlad and Niki cater to diverse audiences.
- The United States holds the highest number of YouTubers (313), followed by India and Brazil. Channels established earlier benefit from accumulated time, resulting in more subscribers.
- YouTube’s increasing saturation creates fierce competition for views and subscribers. Declining subscription rates since 2015 add to the challenge of gaining a significant following.
- Entertainment and Music categories dominate top channels, with significant presence from People and Games. Tech channels are fewer, indicating less focus on technology content.
- The top 10 YouTubers, earning around $87 million, gained subscribers through unique and trending content. MrBeast, Jess No Limit, and others used attention-grabbing strategies, while niche channels found success through creativity.
- Video quality and audience engagement have a stronger impact on earnings than mere upload frequency, reinforcing the importance of captivating content for sustained growth.
2 Comments
Berezin_oesl · January 24, 2024 at 11:42 pm
I really loved it it helpsss a lot thaks you <3
Machine Learning Project 1: Honda Motor Stocks Best Prices · May 23, 2024 at 6:44 pm
[…] YouTube Statistics Analysis […]