Sharing is caring!

In this notebook, we will analyze the Global YouTube Statistics 2023 and draw conclusions based on the dataset.

from matplotlib.ticker import ScalarFormatter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
import pandas as pd

Data preprocessing

df = pd.read_csv('/kaggle/input/global-youtube-statistics-2023/Global YouTube Statistics.csv',encoding = 'latin-1', index_col=0)
df.sample(7)
Youtubersubscribersvideo viewscategoryTitleuploadsCountryAbbreviationchannel_typevideo_views_ranksubscribers_for_last_30_dayscreated_yearcreated_monthcreated_dateGross tertiary education enrollment (%)PopulationUnemployment rateUrban_populationLatitudeLongitude
rank
781Zee Bangla142000001.142879e+10EntertainmentZee Bangla132398IndiaINEntertainment352.0200000.02008.0Feb26.028.11.366418e+095.36471031528.020.59368478.962880
752Lilly Singh145000003.517662e+09ComedyLilly Singh1064CanadaCAComedy2297.0NaN2010.0Oct29.068.93.699198e+075.5630628482.056.130366-106.346771
545Doggy Doggy Cartoons168000006.518419e+09EntertainmentDoggy Doggy Cartoons0NaNNaNNaN4057944.0NaN2018.0Nov11.0NaNNaNNaNNaNNaNNaN
57HAR PAL GEO446000004.113905e+10EntertainmentHAR PAL GEO100755PakistanPKEntertainment20.01300000.02008.0Jan2.09.02.165653e+084.4579927762.030.37532169.345116
147Dream317000002.930015e+09GamingDream116United StatesUSGames2986.0200000.02014.0Feb8.088.23.282395e+0814.70270663028.037.090240-95.712891
187Shemaroo Movies282000007.600741e+09EntertainmentShemaroo Movies3009IndiaINFilm721.0500000.02011.0Mar1.028.11.366418e+095.36471031528.020.59368478.962880
263KSI241000006.002167e+09EntertainmentKSI1252United KingdomGBMusic1053.0NaN2009.0Jul25.060.06.683440e+073.8555908316.055.378051-3.435973

Let’s pay attention that dataset consists of many independent variables which are:

df.columns.tolist()
['Youtuber',
 'subscribers',
 'video views',
 'category',
 'Title',
 'uploads',
 'Country',
 'Abbreviation',
 'channel_type',
 'video_views_rank',
 'country_rank',
 'channel_type_rank',
 'video_views_for_the_last_30_days',
 'lowest_monthly_earnings',
 'highest_monthly_earnings',
 'lowest_yearly_earnings',
 'highest_yearly_earnings',
 'subscribers_for_last_30_days',
 'created_year',
 'created_month',
 'created_date',
 'Gross tertiary education enrollment (%)',
 'Population',
 'Unemployment rate',
 'Urban_population',
 'Latitude',
 'Longitude']

Our focus should be on extracting the essential columns while omitting finer details like Died country code, Organization city, Geo Point 2D, and others.

Shortening column names contributes to a more streamlined data handling process.

columns_to_drop = [
    'Abbreviation','created_month', 'created_date','Gross tertiary education enrollment (%)',
    'Unemployment rate', 'Urban_population', 'Latitude', 'category', 'lowest_yearly_earnings',
    'Longitude', 'video_views_for_the_last_30_days', 'lowest_monthly_earnings', 'highest_yearly_earnings',
       'highest_monthly_earnings', 'Population', 'country_rank', 'channel_type_rank'
]

df['Average_yearly_earnings'] = (df['lowest_yearly_earnings']+df['highest_yearly_earnings'])/2

df.drop(columns=columns_to_drop, inplace=True)

new_column_names = {
    'youtuber': 'Youtuber',
    'subscribers': 'Subs',
    'video views': 'Views',
    'title': 'Title',
    'uploads': 'Uploads',
    'country': 'Country',
    'channel_type': 'Type',
    'video_views_rank': 'Views_Rank',
    'subscribers_for_last_30_days': 'Subs_Last_30Days',
    'created_year': 'Created_Year'
}

df.rename(columns=new_column_names, inplace=True)

It looks way better than before. Although we shortened column names, we are still able to recognize each meaning easily.

Missing values

Now, we must ensure data integrity and reliability. It is vital for accurate analyses and conclusions.

def check_missing_values(column):
    nan_percentage = df[column].isnull().sum() / df[column].size
    print(f'"{column}" column consists of {nan_percentage:.2%} missing values.')

for column in df.columns:
    check_missing_values(column)
"Youtuber" column consists of 0.00% missing values.
"Subs" column consists of 0.00% missing values.
"Views" column consists of 0.00% missing values.
"Title" column consists of 0.00% missing values.
"Uploads" column consists of 0.00% missing values.
"Country" column consists of 12.26% missing values.
"Type" column consists of 3.02% missing values.
"Views_Rank" column consists of 0.10% missing values.
"Subs_Last_30Days" column consists of 33.87% missing values.
"Created_Year" column consists of 0.50% missing values.
"Average_yearly_earnings" column consists of 0.00% missing values.

It appears that most columns in the dataset are filled, except for “Country” (12.26% missing) and “Subs_Last_30Days” (33.87% missing).

Keeping an eye on these missing value patterns will help our future analysis.

Data types

Let’s see the sample to gain a better understanding of the data types.

formatted_data = []
column_name_width = 20
column_value_width = 25

for column_name, column_value in df.loc[1].items():
    column_dtype = df[column_name].dtype
    formatted_data.append(f"{column_name.ljust(column_name_width)}{str(column_value).ljust(column_value_width)}{column_dtype}")

sample_output = "\n".join(formatted_data)
print(sample_output)
Youtuber            T-Series                 object
Subs                245000000                int64
Views               228000000000.0           float64
Title               T-Series                 object
Uploads             20082                    int64
Country             India                    object
Type                Music                    object
Views_Rank          1.0                      float64
Subs_Last_30Days    2000000.0                float64
Created_Year        2006.0                   float64
Average_yearly_earnings57600000.0               float64

It looks like the data types are already quite appropriate for the given columns. Let’s change some details.

median_created_year = df['Created_Year'].median()

df['Created_Year'].fillna(median_created_year, inplace=True)
df['Type'] = df['Type'].fillna('Unknown')

df['Created_Year'] = df['Created_Year'].astype(int)

Data Analysis

Top Youtubers

Who are the ten most subscribed YouTubers? Let’s visualize this information with a WordCloud!

wordcloud = WordCloud(width=600, height=400, background_color='white').generate(' '.join(df['Youtuber'][:10]))

plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

It’s evident that T-Series and YouTube Movies have taken the lead. MrBeast stands out for creative content and philanthropic endeavors, while the list also features children-oriented channels like Cocomelon, Kids Diana Show, Like Nastya, and Vlad and Niki, catering to a diverse range of audiences.

Top Countries

Let’s find out which country has the most YouTubers.

country_counts = df['Country'].value_counts().reset_index()
country_counts.columns = ['Country', 'Youtuber_Count']

fig = px.choropleth(
    country_counts,
    locations='Country',
    locationmode='country names',
    color='Youtuber_Count', 
    hover_name='Country',
    title='Number of YouTubers by Country',
    color_continuous_scale='Viridis',
)

fig.show()

50100150200250300Youtuber_CountNumber of YouTubers by Country

It looks like the country with the most YouTubers is the United States, having 313 YouTubers. India comes second with 168 YouTubers, and Brazil ranks third with 62 YouTubers.

Channel creation year

What are the creation years of the most subscribed YouTube channels? Let’s check it out!

BTW: Possible outlier in the data: https://www.kaggle.com/datasets/nelgiriyewithana/global-youtube-statistics-2023/discussion/433640

plt.figure(figsize=(10, 6))

sns.histplot(data=df, x='Created_Year', bins=50, kde=True)

plt.xlabel('Year', fontsize=12)
plt.ylabel('No. created channels', fontsize=12)
plt.title('Distribution of created channels', fontsize=14)

plt.xticks(range(2005, 2023, 1))

plt.xlim(2005, 2022)

plt.show()

It is evident that:

  • Channels created in earlier years had more time to gather subscribers, leading to higher subscription numbers for those established in 2006, 2011, and 2014 in contrast to later years like 2020, 2021, and 2022.
  • With time, YouTube’s platform has become more saturated with creators, intensifying competition for subscribers and views, while a consistent decline in subscription rates since 2015 adds to the challenge of becoming the most subscribed channel.

Top YouTube Channel Types

Let’s find out the top trending channel types on YouTube!

categories_counts = df['Type'].value_counts()

plt.figure(figsize=(8, 6))

plt.pie(categories_counts.head(10), labels=None, autopct='%1.1f%%', startangle=140)

plt.legend(categories_counts.head(10).index, loc='upper right')
plt.title('Top 10 YouTube Channel Types', fontsize=14)
plt.axis('equal') 

plt.show()

We can see that:

  • Entertainment and Music seem to be the most popular top channel types, having higher counts.
  • People and Games have a significant number of top channels.
  • Tech has the lowest count among the provided categories, suggesting that there are fewer top YouTube channels focused on technology-related content.

Average Yearly Earnings

Who are the top earners among YouTube channels on a yearly basis?

top_10_earners = df.sort_values(by="Average_yearly_earnings", ascending=False).head(10)

plt.figure(figsize=(10, 6))
plt.bar(top_10_earners["Youtuber"], top_10_earners["Average_yearly_earnings"], color='gold', alpha=0.7)
plt.title("Top Earners Among YouTube Channels (Yearly Basis)", fontsize=14)
plt.ylabel("Average Yearly Earnings (10$ kk)", fontsize=12)

plt.xticks(rotation=45)
plt.tight_layout()

plt.show()

It’s pretty amazing that the top YouTuber have made around $87 million. That’s seriously impressive!

Earnings & Uploads

Are earnings connected in someway with number of uploads?

correlation = df["Average_yearly_earnings"].corr(df["Uploads"])

plt.figure(figsize=(10, 6))

sns.scatterplot(data=df, x="Uploads", y="Average_yearly_earnings")

plt.title("Correlation between Uploads and Earnings", fontsize=14)
plt.xlabel("Number of Uploads", fontsize=12)
plt.ylabel("Average Yearly Earnings (10$ kk)", fontsize=12)

plt.xlim(-2500, df["Uploads"].max()+5000)
plt.ylim(0, df["Average_yearly_earnings"].max())

plt.show()

print(f'Correlation is : {correlation:.2f}')
Correlation is : 0.17

The correlation of 0.17 suggests that there’s a mild tendency for higher earnings when there are more uploads, but the connection is not strong.

Fact! Better videos mean more earnings. As is visible on the graph, if your videos are high-quality and capture your audience’s attention, you can earn more, regardless of how many videos you upload.

Biggest Monthly Subs Increase

df['Subs_Last_30Days'].fillna(0, inplace=True)

top_10_df = df.sort_values(by='Subs_Last_30Days', ascending=False).head(10)

plt.figure(figsize=(10, 6))

plt.bar(top_10_df['Youtuber'], top_10_df['Subs_Last_30Days'], color='r')

plt.ylabel('Monthly Subs Increase ($ kk)', fontsize=12)
plt.title('Top 10 YouTubers with Biggest Monthly Subs Increase', fontsize=14)

plt.xticks(rotation=45, ha='right')
plt.tight_layout()

plt.show()

It is clear that these top 10 YouTubers gained lots of new subscribers because they offer unique and trending content that appeals to a wide audience. Channels like MrBeast and Jess No Limit use attention-grabbing stunts, while lesser-known ones like DaFuq!?Boom! and BeatboxJCOP found their niche. Their creativity and engaging strategies played a big role in their rapid growth.

Conclusions

  • T-Series and YouTube Movies are prominent leaders, while MrBeast shines for creativity and philanthropy. Children-focused channels like Cocomelon, Kids Diana Show, Like Nastya, and Vlad and Niki cater to diverse audiences.
  • The United States holds the highest number of YouTubers (313), followed by India and Brazil. Channels established earlier benefit from accumulated time, resulting in more subscribers.
  • YouTube’s increasing saturation creates fierce competition for views and subscribers. Declining subscription rates since 2015 add to the challenge of gaining a significant following.
  • Entertainment and Music categories dominate top channels, with significant presence from People and Games. Tech channels are fewer, indicating less focus on technology content.
  • The top 10 YouTubers, earning around $87 million, gained subscribers through unique and trending content. MrBeast, Jess No Limit, and others used attention-grabbing strategies, while niche channels found success through creativity.
  • Video quality and audience engagement have a stronger impact on earnings than mere upload frequency, reinforcing the importance of captivating content for sustained growth.


2 Comments

Berezin_oesl · January 24, 2024 at 11:42 pm

I really loved it it helpsss a lot thaks you <3

Machine Learning Project 1: Honda Motor Stocks Best Prices · May 23, 2024 at 6:44 pm

[…] YouTube Statistics Analysis […]

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *