Global YouTube Statistics Analysis 2023 -

In this notebook, we will analyze the Global YouTube Statistics 2023 and draw conclusions based on the dataset.

from matplotlib.ticker import ScalarFormatter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import pandas as pd

Data preprocessing

df = pd.read_csv('/kaggle/input/global-youtube-statistics-2023/Global YouTube Statistics.csv',encoding = 'latin-1', index_col=0)
df.sample(7)

	Youtuber	subscribers	video views	category	Title	uploads	Country	Abbreviation	channel_type	video_views_rank	…	subscribers_for_last_30_days	created_year	created_month	created_date	Gross tertiary education enrollment (%)	Population	Unemployment rate	Urban_population	Latitude	Longitude
rank
781	Zee Bangla	14200000	1.142879e+10	Entertainment	Zee Bangla	132398	India	IN	Entertainment	352.0	…	200000.0	2008.0	Feb	26.0	28.1	1.366418e+09	5.36	471031528.0	20.593684	78.962880
752	Lilly Singh	14500000	3.517662e+09	Comedy	Lilly Singh	1064	Canada	CA	Comedy	2297.0	…	NaN	2010.0	Oct	29.0	68.9	3.699198e+07	5.56	30628482.0	56.130366	-106.346771
545	Doggy Doggy Cartoons	16800000	6.518419e+09	Entertainment	Doggy Doggy Cartoons	0	NaN	NaN	NaN	4057944.0	…	NaN	2018.0	Nov	11.0	NaN	NaN	NaN	NaN	NaN	NaN
57	HAR PAL GEO	44600000	4.113905e+10	Entertainment	HAR PAL GEO	100755	Pakistan	PK	Entertainment	20.0	…	1300000.0	2008.0	Jan	2.0	9.0	2.165653e+08	4.45	79927762.0	30.375321	69.345116
147	Dream	31700000	2.930015e+09	Gaming	Dream	116	United States	US	Games	2986.0	…	200000.0	2014.0	Feb	8.0	88.2	3.282395e+08	14.70	270663028.0	37.090240	-95.712891
187	Shemaroo Movies	28200000	7.600741e+09	Entertainment	Shemaroo Movies	3009	India	IN	Film	721.0	…	500000.0	2011.0	Mar	1.0	28.1	1.366418e+09	5.36	471031528.0	20.593684	78.962880
263	KSI	24100000	6.002167e+09	Entertainment	KSI	1252	United Kingdom	GB	Music	1053.0	…	NaN	2009.0	Jul	25.0	60.0	6.683440e+07	3.85	55908316.0	55.378051	-3.435973

Let’s pay attention that dataset consists of many independent variables which are:

df.columns.tolist()

['Youtuber',
 'subscribers',
 'video views',
 'category',
 'Title',
 'uploads',
 'Country',
 'Abbreviation',
 'channel_type',
 'video_views_rank',
 'country_rank',
 'channel_type_rank',
 'video_views_for_the_last_30_days',
 'lowest_monthly_earnings',
 'highest_monthly_earnings',
 'lowest_yearly_earnings',
 'highest_yearly_earnings',
 'subscribers_for_last_30_days',
 'created_year',
 'created_month',
 'created_date',
 'Gross tertiary education enrollment (%)',
 'Population',
 'Unemployment rate',
 'Urban_population',
 'Latitude',
 'Longitude']

Our focus should be on extracting the essential columns while omitting finer details like Died country code, Organization city, Geo Point 2D, and others.

Shortening column names contributes to a more streamlined data handling process.

columns_to_drop = [
    'Abbreviation','created_month', 'created_date','Gross tertiary education enrollment (%)',
    'Unemployment rate', 'Urban_population', 'Latitude', 'category', 'lowest_yearly_earnings',
    'Longitude', 'video_views_for_the_last_30_days', 'lowest_monthly_earnings', 'highest_yearly_earnings',
       'highest_monthly_earnings', 'Population', 'country_rank', 'channel_type_rank'
]

df['Average_yearly_earnings'] = (df['lowest_yearly_earnings']+df['highest_yearly_earnings'])/2

df.drop(columns=columns_to_drop, inplace=True)

new_column_names = {
    'youtuber': 'Youtuber',
    'subscribers': 'Subs',
    'video views': 'Views',
    'title': 'Title',
    'uploads': 'Uploads',
    'country': 'Country',
    'channel_type': 'Type',
    'video_views_rank': 'Views_Rank',
    'subscribers_for_last_30_days': 'Subs_Last_30Days',
    'created_year': 'Created_Year'
}

df.rename(columns=new_column_names, inplace=True)

It looks way better than before. Although we shortened column names, we are still able to recognize each meaning easily.

Missing values

Now, we must ensure data integrity and reliability. It is vital for accurate analyses and conclusions.

def check_missing_values(column):
    nan_percentage = df[column].isnull().sum() / df[column].size
    print(f'"{column}" column consists of {nan_percentage:.2%} missing values.')

for column in df.columns:
    check_missing_values(column)

"Youtuber" column consists of 0.00% missing values.
"Subs" column consists of 0.00% missing values.
"Views" column consists of 0.00% missing values.
"Title" column consists of 0.00% missing values.
"Uploads" column consists of 0.00% missing values.
"Country" column consists of 12.26% missing values.
"Type" column consists of 3.02% missing values.
"Views_Rank" column consists of 0.10% missing values.
"Subs_Last_30Days" column consists of 33.87% missing values.
"Created_Year" column consists of 0.50% missing values.
"Average_yearly_earnings" column consists of 0.00% missing values.

It appears that most columns in the dataset are filled, except for “Country” (12.26% missing) and “Subs_Last_30Days” (33.87% missing).

Keeping an eye on these missing value patterns will help our future analysis.

Data types

Let’s see the sample to gain a better understanding of the data types.

formatted_data = []
column_name_width = 20
column_value_width = 25

for column_name, column_value in df.loc[1].items():
    column_dtype = df[column_name].dtype
    formatted_data.append(f"{column_name.ljust(column_name_width)}{str(column_value).ljust(column_value_width)}{column_dtype}")

sample_output = "\n".join(formatted_data)
print(sample_output)

Youtuber            T-Series                 object
Subs                245000000                int64
Views               228000000000.0           float64
Title               T-Series                 object
Uploads             20082                    int64
Country             India                    object
Type                Music                    object
Views_Rank          1.0                      float64
Subs_Last_30Days    2000000.0                float64
Created_Year        2006.0                   float64
Average_yearly_earnings57600000.0               float64

It looks like the data types are already quite appropriate for the given columns. Let’s change some details.

median_created_year = df['Created_Year'].median()

df['Created_Year'].fillna(median_created_year, inplace=True)
df['Type'] = df['Type'].fillna('Unknown')

df['Created_Year'] = df['Created_Year'].astype(int)

Data Analysis

Top Youtubers

Who are the ten most subscribed YouTubers? Let’s visualize this information with a WordCloud!

wordcloud = WordCloud(width=600, height=400, background_color='white').generate(' '.join(df['Youtuber'][:10]))

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

It’s evident that T-Series and YouTube Movies have taken the lead. MrBeast stands out for creative content and philanthropic endeavors, while the list also features children-oriented channels like Cocomelon, Kids Diana Show, Like Nastya, and Vlad and Niki, catering to a diverse range of audiences.

Top Countries

Let’s find out which country has the most YouTubers.

country_counts = df['Country'].value_counts().reset_index()
country_counts.columns = ['Country', 'Youtuber_Count']

fig = px.choropleth(
    country_counts,
    locations='Country',
    locationmode='country names',
    color='Youtuber_Count', 
    hover_name='Country',
    title='Number of YouTubers by Country',
    color_continuous_scale='Viridis',
)

fig.show()

50100150200250300Youtuber_CountNumber of YouTubers by Country

It looks like the country with the most YouTubers is the United States, having 313 YouTubers. India comes second with 168 YouTubers, and Brazil ranks third with 62 YouTubers.

Channel creation year

What are the creation years of the most subscribed YouTube channels? Let’s check it out!

BTW: Possible outlier in the data: https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023/discussion/433640

plt.figure(figsize=(10, 6))

sns.histplot(data=df, x='Created_Year', bins=50, kde=True)

plt.xlabel('Year', fontsize=12)
plt.ylabel('No. created channels', fontsize=12)
plt.title('Distribution of created channels', fontsize=14)

plt.xticks(range(2005, 2023, 1))

plt.xlim(2005, 2022)

plt.show()

It is evident that:

Channels created in earlier years had more time to gather subscribers, leading to higher subscription numbers for those established in 2006, 2011, and 2014 in contrast to later years like 2020, 2021, and 2022.
With time, YouTube’s platform has become more saturated with creators, intensifying competition for subscribers and views, while a consistent decline in subscription rates since 2015 adds to the challenge of becoming the most subscribed channel.

Top YouTube Channel Types

Let’s find out the top trending channel types on YouTube!

categories_counts = df['Type'].value_counts()

plt.figure(figsize=(8, 6))

plt.pie(categories_counts.head(10), labels=None, autopct='%1.1f%%', startangle=140)

plt.legend(categories_counts.head(10).index, loc='upper right')
plt.title('Top 10 YouTube Channel Types', fontsize=14)
plt.axis('equal') 

plt.show()

We can see that:

Entertainment and Music seem to be the most popular top channel types, having higher counts.
People and Games have a significant number of top channels.
Tech has the lowest count among the provided categories, suggesting that there are fewer top YouTube channels focused on technology-related content.

Average Yearly Earnings

Who are the top earners among YouTube channels on a yearly basis?

top_10_earners = df.sort_values(by="Average_yearly_earnings", ascending=False).head(10)

plt.figure(figsize=(10, 6))
plt.bar(top_10_earners["Youtuber"], top_10_earners["Average_yearly_earnings"], color='gold', alpha=0.7)
plt.title("Top Earners Among YouTube Channels (Yearly Basis)", fontsize=14)
plt.ylabel("Average Yearly Earnings (10$ kk)", fontsize=12)

plt.xticks(rotation=45)
plt.tight_layout()

plt.show()

It’s pretty amazing that the top YouTuber have made around $87 million. That’s seriously impressive!

Earnings & Uploads

Are earnings connected in someway with number of uploads?

correlation = df["Average_yearly_earnings"].corr(df["Uploads"])

plt.figure(figsize=(10, 6))

sns.scatterplot(data=df, x="Uploads", y="Average_yearly_earnings")

plt.title("Correlation between Uploads and Earnings", fontsize=14)
plt.xlabel("Number of Uploads", fontsize=12)
plt.ylabel("Average Yearly Earnings (10$ kk)", fontsize=12)

plt.xlim(-2500, df["Uploads"].max()+5000)
plt.ylim(0, df["Average_yearly_earnings"].max())

plt.show()

print(f'Correlation is : {correlation:.2f}')

Correlation is : 0.17

The correlation of 0.17 suggests that there’s a mild tendency for higher earnings when there are more uploads, but the connection is not strong.

Fact! Better videos mean more earnings. As is visible on the graph, if your videos are high-quality and capture your audience’s attention, you can earn more, regardless of how many videos you upload.

Biggest Monthly Subs Increase

df['Subs_Last_30Days'].fillna(0, inplace=True)

top_10_df = df.sort_values(by='Subs_Last_30Days', ascending=False).head(10)

plt.figure(figsize=(10, 6))

plt.bar(top_10_df['Youtuber'], top_10_df['Subs_Last_30Days'], color='r')

plt.ylabel('Monthly Subs Increase ($ kk)', fontsize=12)
plt.title('Top 10 YouTubers with Biggest Monthly Subs Increase', fontsize=14)

plt.xticks(rotation=45, ha='right')
plt.tight_layout()

plt.show()

It is clear that these top 10 YouTubers gained lots of new subscribers because they offer unique and trending content that appeals to a wide audience. Channels like MrBeast and Jess No Limit use attention-grabbing stunts, while lesser-known ones like DaFuq!?Boom! and BeatboxJCOP found their niche. Their creativity and engaging strategies played a big role in their rapid growth.

Conclusions

T-Series and YouTube Movies are prominent leaders, while MrBeast shines for creativity and philanthropy. Children-focused channels like Cocomelon, Kids Diana Show, Like Nastya, and Vlad and Niki cater to diverse audiences.
The United States holds the highest number of YouTubers (313), followed by India and Brazil. Channels established earlier benefit from accumulated time, resulting in more subscribers.
YouTube’s increasing saturation creates fierce competition for views and subscribers. Declining subscription rates since 2015 add to the challenge of gaining a significant following.
Entertainment and Music categories dominate top channels, with significant presence from People and Games. Tech channels are fewer, indicating less focus on technology content.
The top 10 YouTubers, earning around $87 million, gained subscribers through unique and trending content. MrBeast, Jess No Limit, and others used attention-grabbing strategies, while niche channels found success through creativity.
Video quality and audience engagement have a stronger impact on earnings than mere upload frequency, reinforcing the importance of captivating content for sustained growth.

Learn more

More info about our us

Facebook: Click

Telegram group of exercises: Click

YouTube: Click

Global YouTube Statistics Analysis 2023

Published by Writer1 on August 23, 2023August 23, 2023

Data preprocessing

Missing values

Data types

Data Analysis

Top Youtubers

Top Countries

Channel creation year

Top YouTube Channel Types

Average Yearly Earnings

Earnings & Uploads

Biggest Monthly Subs Increase

Conclusions

Learn more

More info about our us

2 Comments

Berezin_oesl · January 24, 2024 at 11:42 pm

Machine Learning Project 1: Honda Motor Stocks Best Prices · May 23, 2024 at 6:44 pm

Leave a Reply Cancel reply

Computer Engineering

Best 7 Pandas Error Solved Step by Step

Computer Engineering

Best BERT Tutorial NLP 2024

Computer Engineering

Best Summary Of Artificial Intelligence Terms 2024

Global YouTube Statistics Analysis 2023

Published by Writer1 on August 23, 2023August 23, 2023

Data preprocessing

Missing values

Data types

Data Analysis

Top Youtubers

Top Countries

Channel creation year

Top YouTube Channel Types

Average Yearly Earnings

Earnings & Uploads

Biggest Monthly Subs Increase

Conclusions

Learn more

More info about our us

2 Comments

Berezin_oesl · January 24, 2024 at 11:42 pm

Machine Learning Project 1: Honda Motor Stocks Best Prices · May 23, 2024 at 6:44 pm

Leave a Reply Cancel reply

Related Posts

Computer Engineering

Best 7 Pandas Error Solved Step by Step

Computer Engineering

Best BERT Tutorial NLP 2024

Computer Engineering

Best Summary Of Artificial Intelligence Terms 2024