Sharing is caring!

Machine Learning Project 7: Machine Learning Engineer Salary in 2024

Table of Contents

Introduction

Hello there! Welcome to our blog where we’re delving into the captivating realm of Machine Learning Engineer salaries for 2024. Interested in knowing how much these tech wizards are earning? You’ve come to the perfect place.

Also, check Machine Learning projects:

Machine Learning is currently one of the most sought-after fields in the tech industry. Whether you’re considering a career in this field or simply curious about the pay scale, we’ve got all the exciting details.

Here’s what we’ll cover:

  • Salary Ranges: We’ll provide a breakdown of the average salaries you can expect.
  • Industry Demand: Discover which sectors are offering top dollar for AI talent.
  • Location, Location, Location: Learn how geography can impact your paycheck.

So grab a cup of coffee and get cozy, because we’re about to unravel everything you need to know about Machine Learning Engineer salaries in 2024. Let’s dive in!

project machine learning
machine learning certification
certification machine learning

Machine Learning Engineer Salary in 2024 EDA

Dataset Info

Description of the features in dataset:

  • work_year: The year in which the salary data was collected (e.g., 2024).
  • experience_level: The level of experience of the employee (e.g., MI for Mid-Level).
  • employment_type: The type of employment (e.g., FT for Full-Time).
  • job_title: The title of the job (e.g., Data Scientist).
  • salary: The salary amount.
  • salary_currency: The currency in which the salary is denominated (e.g., USD for US Dollars).
  • salary_in_usd: The salary amount converted to US Dollars.
  • employee_residence: The country of residence of the employee (e.g., AU for Australia).
  • remote_ratio: The ratio indicating the level of remote work (0 for no remote work).
  • company_location: The location of the company (e.g., AU for Australia).
  • company_size: The size of the company (e.g., S for Small).

Import Dependencies

import warnings
warnings.filterwarnings("ignore")

import os
import squarify
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import clear_output
from wordcloud import WordCloud
# Verify input 
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
/kaggle/input/machine-learning-engineer-salary-in-2024/salaries.csv
machine learning projects github
machine learning projects for final year
machine learning projects for students

Load Dataset

Dataset Link: https://www.kaggle.com/code/maskara31/machine-learning-engineer-salary-in-2024-eda

data_salary = pd.read_csv('/kaggle/input/machine-learning-engineer-salary-in-2024/salaries.csv')
# show dataset
data_salary.head()
work_yearexperience_levelemployment_typejob_titlesalarysalary_currencysalary_in_usdemployee_residenceremote_ratiocompany_locationcompany_size
02024MIFTData Scientist120000USD120000AU0AUS
12024MIFTData Scientist70000USD70000AU0AUS
22024MICTData Scientist130000USD130000US0USM
32024MICTData Scientist110000USD110000US0USM
42024MIFTData Science Manager240000USD240000US0USM
# Verify dataset info

data_salary.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16494 entries, 0 to 16493
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           16494 non-null  int64 
 1   experience_level    16494 non-null  object
 2   employment_type     16494 non-null  object
 3   job_title           16494 non-null  object
 4   salary              16494 non-null  int64 
 5   salary_currency     16494 non-null  object
 6   salary_in_usd       16494 non-null  int64 
 7   employee_residence  16494 non-null  object
 8   remote_ratio        16494 non-null  int64 
 9   company_location    16494 non-null  object
 10  company_size        16494 non-null  object
dtypes: int64(4), object(7)
memory usage: 1.4+ MB

Dataset has 16494 rows and 12 columns

Visualization

Top 10 Job Titles with Highest Salaries

top_job_titles = data_salary.groupby('job_title')['salary_in_usd'].median().sort_values(ascending=False).head(10)

plt.figure(figsize=(10, 6))
top_job_titles.plot(kind='bar', color='blue')
plt.title('Top 10 Job Titles with Highest Salaries')
plt.xlabel('Job Title')
plt.ylabel('Median Salary (USD)')
plt.xticks(rotation=45, ha='right')
plt.show()
machine learning projects github
machine learning projects for final year
machine learning projects for students

Top 15 Job titles by Word Cloud

top_15_job_titles = data_salary['job_title'].value_counts().head(15)

title_counts = dict(top_15_job_titles)

wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(title_counts)

plt.figure(figsize=(10, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Top 15 Job Titles')
plt.show()

Salaries Distribution

plt.figure(figsize=(10, 6))
sns.histplot(data=data_salary, x='salary_in_usd', kde=True)
plt.title('Salaries Distribution')
plt.xlabel('Salary (USD)')
plt.ylabel('Frequency')
plt.show()
ml process
kaggle machine learning projects

Salaries Distribution by company size

plt.figure(figsize=(10, 6))
sns.violinplot(data=data_salary, x='company_size', y='salary_in_usd')
plt.title('Salaries Distribution by company size')
plt.xlabel('Company size')
plt.ylabel('Salary (USD)')
plt.show()

Distribution of Salaries by Experience Level

plt.subplots(figsize=(10, 6))
sns.set_color_codes("pastel")
sns.barplot(x='experience_level', y='salary_in_usd', data=data_salary)
plt.xticks([0, 1, 2, 3], ['EN', 'MI', 'SE', 'EX'])
plt.title('Salary Distribution by experience level')
plt.xlabel('Experience level')
plt.ylabel('Salary (USD)')
plt.show()
machine learning project manager
machine learning project management
machine learning projects for masters students

Salaries Distribution by Experience Level and Employment Type

plt.figure(figsize=(12, 6))
sns.set_color_codes("pastel")
sns.barplot(x='experience_level', y='salary_in_usd', hue='employment_type', data=data_salary,)
plt.title('Salary Distribution by Experience Level and Employment Type')
plt.xlabel('Experience Level')
plt.ylabel('Salary (USD)')
plt.legend(title='Employment Type')
plt.show()

Average salaries by level of experience over the years

plt.figure(figsize=(10, 6))
sns.lineplot(data=data_salary, x='work_year', y='salary_in_usd', hue='experience_level', estimator='mean', ci=None)
plt.title('Average salaries by level of experience over the years')
plt.xlabel('Year')
plt.ylabel('Salary (USD)')
plt.legend(title='Experience level', labels=['EN', 'MI', 'SE', 'EX'])
plt.show()
plt.figure(figsize=(10, 6))
sns.lineplot(data=data_salary, x='work_year', y='salary_in_usd', estimator='mean', ci=None)
plt.title('Salary trends over the years')
plt.xlabel('Year')
plt.ylabel('Salary (USD)')
plt.show()

Average Salaries by Company Size Over the Years

plt.figure(figsize=(10, 6))
data_salary.groupby(['work_year', 'company_size'])['salary_in_usd'].mean().unstack().plot(kind='line', marker='o')
plt.title('Average Salaries by Company Size Over the Years')
plt.xlabel('Year')
plt.ylabel('Average salary (USD)')
plt.legend(title='Company size')
plt.show()
<Figure size 1000x600 with 0 Axes>

2024 Data and AI Profession Salary Insights

Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_samples, silhouette_score

from scipy.cluster import hierarchy
step machine learning
step of machine learning
ml projects

Data Exploration

df = pd.read_csv('/kaggle/input/machine-learning-engineer-salary-in-2024/salaries.csv')
df.head()
work_yearexperience_levelemployment_typejob_titlesalarysalary_currencysalary_in_usdemployee_residenceremote_ratiocompany_locationcompany_size
02024MIFTData Scientist120000USD120000AU0AUS
12024MIFTData Scientist70000USD70000AU0AUS
22024MICTData Scientist130000USD130000US0USM
32024MICTData Scientist110000USD110000US0USM
42024MIFTData Science Manager240000USD240000US0USM
df.shape
(16494, 11)
df.columns
Index(['work_year', 'experience_level', 'employment_type', 'job_title',
       'salary', 'salary_currency', 'salary_in_usd', 'employee_residence',
       'remote_ratio', 'company_location', 'company_size'],
      dtype='object')
df.dtypes
work_year              int64
experience_level      object
employment_type       object
job_title             object
salary                 int64
salary_currency       object
salary_in_usd          int64
employee_residence    object
remote_ratio           int64
company_location      object
company_size          object
dtype: object
df.isnull().sum()
work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

EDA

df.describe()
work_yearsalarysalary_in_usdremote_ratio
count16494.0000001.649400e+0416494.00000016494.000000
mean2023.2249911.637878e+05149713.57572532.044986
std0.7134053.406017e+0568516.13691846.260201
min2020.0000001.400000e+0415000.0000000.000000
25%2023.0000001.020000e+05101517.5000000.000000
50%2023.0000001.422000e+05141300.0000000.000000
75%2024.0000001.873422e+05185900.000000100.000000
max2024.0000003.040000e+07800000.000000100.000000
df.describe(include='object')
experience_levelemployment_typejob_titlesalary_currencyemployee_residencecompany_locationcompany_size
count16494164941649416494164941649416494
unique441552388773
topSEFTData EngineerUSDUSUSM
freq1065216414345615254144271447815268
ml project
machine learning python projects
machine learning projects in python

Univariate Analysis

def annotate_axes(ax, axis='y'):
for p in ax.patches:
if axis == 'y':
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center',
xytext=(0, 9),
textcoords='offset points')
elif axis == 'x':
ax.annotate(format(p.get_width(), '.0f'),
(p.get_width(), p.get_y() + p.get_height() / 2.),
ha='center', va='center',
xytext=(9, 0),
textcoords='offset points')
def plot_count_graph(col: str, title: str, xlabel: str, ylabel: str, _df: pd.DataFrame, axis: str, 
                     figsize=(10, 6), ordered: bool = False, asc: bool = False, 
                     sort_by: str = 'count'):
    plt.figure(figsize=figsize)
    
    order = _df[col].unique()
    if ordered:
        if sort_by == 'count':
            order = _df[col].value_counts().sort_values(ascending=asc).index
        elif sort_by == 'values':
            order = sorted(_df[col].unique(), reverse=not asc)
            
    if axis == 'x':
        ax = sns.countplot(data=_df, x=col, palette='Pastel1', order=order)
        annotate_axes(ax, axis='y')
        plt.grid(axis='y')
    elif axis == 'y':
        ax = sns.countplot(data=_df, y=col, palette='Pastel1', order=order)
        annotate_axes(ax, axis='x')
        plt.grid(axis='x')
        
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.xticks(rotation=45)
    plt.show()
plot_count_graph(col='work_year', 
                 title='Year Distribution', 
                 xlabel='Year', 
                 ylabel='Count', 
                 _df=df, 
                 axis='x', 
                 ordered=True, 
                 asc=True, 
                 sort_by='values')
plot_count_graph(col='experience_level', 
                 title='Experience Level Distribution', 
                 xlabel='Level of Experience', 
                 ylabel='Count', 
                 _df=df, 
                 axis='x')
plot_count_graph(col='employment_type', 
                 title='Employment Type Distribution', 
                 xlabel='Employment Type', 
                 ylabel='Count', 
                 _df=df, 
                 axis='x')
machine learning project github
machine learning ideas
ml project ideas
plot_count_graph(col='job_title', 
                 title='Job Title Distribution', 
                 xlabel='Job Title', 
                 ylabel='Count', 
                 _df=df, 
                 axis='y', 
                 figsize=(12, 24),
                 ordered=True, 
                 sort_by='count')
plot_count_graph(col='salary_currency', 
                 title='Salary Currencies', 
                 xlabel='Salary Currencies', 
                 ylabel='Count', 
                 _df=df, 
                 ordered=True,
                 asc=True,
                 axis='x')
plot_count_graph(col='company_location', 
                 title='Company Locations', 
                 xlabel='Location', 
                 ylabel='Count', 
                 _df=df, 
                 ordered=True,
                 asc=False,
                 figsize=(10, 16),
                 axis='y')
plot_count_graph(col='employee_residence', 
                 title='Residence Distribution', 
                 xlabel='Residence', 
                 ylabel='Count', 
                 _df=df, 
                 ordered=True,
                 asc=False,
                 figsize=(10, 16),
                 axis='y')
cv machine learning
machine learning cv
machine learning projects github
plot_count_graph(col='company_size', 
                 title='Company Size Distribution', 
                 xlabel='Size', 
                 ylabel='Count', 
                 _df=df, 
                 ordered=True,
                 asc=False,
                 sort_by='values',
                 axis='x')
'salary', 'salary_in_usd', 'remote_ratio'
('salary', 'salary_in_usd', 'remote_ratio')
plot_count_graph(col='remote_ratio', 
                 title='Remote Ratio Distribution', 
                 xlabel='Ratio', 
                 ylabel='Count', 
                 _df=df, 
                 ordered=True,
                 asc=True,
                 sort_by='values',
                 axis='x')
plt.figure(figsize=(8, 6))
sns.histplot(data=df, x='salary', kde=True, color='skyblue', edgecolor='black')
plt.title('Distribution of Salary')
plt.grid(True)
plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
ml projects ideas
project manager artificial intelligence
best machine learning courses reddit
machine learning projects for resume
plt.figure(figsize=(8, 3))
sns.histplot(data=df, x='salary_in_usd', kde=True, color='skyblue', edgecolor='black')
plt.title('Distribution of Salary')
plt.grid(True)
plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
plt.figure(figsize=(16, 3))
sns.boxplot(data=df, x='salary_in_usd', color='skyblue')
plt.title('Box Plot of Salary in USD')
plt.grid(True)
plt.show()
ml model
machine learning projects
projects machine learning

Bivariate Analysis

plt.figure(figsize=(16, 8))
sns.boxplot(data=df, y='salary_in_usd', x='experience_level', palette='Pastel1')
plt.title('Distribution of Salary in USD Across Different Experience Levels')
plt.grid(True)
plt.show()
plt.figure(figsize=(16, 8))
sns.boxplot(data=df, y='salary_in_usd', x='employment_type', palette='Pastel1')
plt.title('Distribution of Salary in USD Across Different Employment Type')
plt.grid(True)
plt.show()
plt.figure(figsize=(16, 8))
sns.boxplot(data=df, y='salary_in_usd', x='company_size', palette='Pastel1')
plt.title('Distribution of Salary in USD Across Different Company Size')
plt.grid(True)
plt.show()
plt.figure(figsize=(20, 40))
sns.boxplot(data=df, x='salary_in_usd', y='job_title', palette='Pastel1')
plt.title('Distribution of Salary in USD Across Different Job Titles')
plt.grid(True)
plt.show()
plt.figure(figsize=(20, 40))
sns.boxplot(data=df, x='salary_in_usd', y='company_location', color='experience_level', palette='Pastel1', order=df['company_location'].value_counts().index)
plt.title('Distribution of Salary in USD Across Different Locations')
plt.grid(True)
plt.show()
stacked_data = df[~(df['company_location'] == 'US')].groupby(['company_location', 'employment_type']).size().unstack()

# Plot stacked bar graph
stacked_data.plot(kind='bar', stacked=True, figsize=(20, 10), colormap='Pastel1')
plt.title('Distribution of Employment Type Across Different Company Locations (Excluding US)')
plt.xlabel('Company Location')
plt.ylabel('Count')
plt.grid(True)
plt.legend(title='Employment Type')
plt.show()
stacked_data = df[(df['company_location'] == 'US')].groupby(['company_location', 'employment_type']).size().unstack()

# Plot stacked bar graph
stacked_data.plot(kind='bar', stacked=True, figsize=(20, 10), colormap='Pastel1')
plt.title('Distribution of Employment Type Across Different Company Locations (US Only)')
plt.xlabel('Company Location')
plt.ylabel('Count')
plt.grid(True)
plt.legend(title='Employment Type')
plt.show()
stacked_data = df[(df['employment_type']=='FT')].groupby(['experience_level', 'employment_type']).size().unstack()

# Plot stacked bar graph
stacked_data.plot(kind='bar', stacked=True, figsize=(20, 10), colormap='Pastel1')
plt.title('Distribution of Employment Type Across Different experience_level (Full Time Only)')
plt.xlabel('Company Location')
plt.ylabel('Count')
plt.grid(True)
plt.legend(title='Employment Type')
plt.show()
stacked_data = df[~(df['employment_type']=='FT')].groupby(['experience_level', 'employment_type']).size().unstack()

# Plot stacked bar graph
stacked_data.plot(kind='bar', stacked=True, figsize=(20, 10), colormap='Pastel1')
plt.title('Distribution of Employment Type Across Different experience_level (Excluding Full Time)')
plt.xlabel('Company Location')
plt.ylabel('Count')
plt.grid(True)
plt.legend(title='Employment Type')
plt.show()
le = LabelEncoder()
categorical_columns = ['experience_level', 'employment_type', 'job_title', 
                       'salary_currency', 'employee_residence', 'company_location', 'company_size']

df_encoded = df.copy().drop(columns=categorical_columns)
for column in categorical_columns:
    df_encoded[column + '_encoded'] = le.fit_transform(df[column])

p_corr = df_encoded.corr('spearman')
sns.set(rc={'figure.figsize':(20, 20)})
sns.heatmap(p_corr, annot=True)
<Axes: >
stacked_data = df.groupby(['job_title', 'experience_level']).size().unstack()
stacked_data['Total'] = stacked_data.sum(axis=1)
stacked_data = stacked_data.sort_values(by='Total', ascending=True)

stacked_data.drop(columns='Total').plot(kind='barh', stacked=True, figsize=(20, 40), colormap='Pastel1')
plt.title('Distribution of Employment Type Across Different Experience Levels (Excluding Full Time)')
plt.xlabel('Count')
plt.ylabel('Job Title')
plt.grid(True)
plt.legend(title='Experience Level')
plt.show()
_ = df.groupby(['work_year', 'job_title']).agg({'salary_in_usd': np.mean})

_
/tmp/ipykernel_18/4132009602.py:1: FutureWarning: The provided callable <function mean at 0x7942382084c0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
  _ = df.groupby(['work_year', 'job_title']).agg({'salary_in_usd': np.mean})
salary_in_usd
work_yearjob_title
2020AI Scientist45896.000000
Azure Data Engineer100000.000000
BI Data Analyst98000.000000
Big Data Engineer97690.333333
Business Data Analyst110000.000000
2024Research Analyst127518.121212
Research Engineer206586.567164
Research Scientist204206.865741
Robotics Engineer140416.666667
Robotics Software Engineer196625.000000

362 rows × 1 columns

reshaped_df = _.reset_index().pivot_table(index='job_title', columns='work_year', values='salary_in_usd')
reshaped_df = reshaped_df.sort_index(axis=1)
reshaped_df = reshaped_df.fillna(0)

reshaped_df
work_year20202021202220232024
job_title
AI Architect0.00.0180000.0250328.000000258753.125000
AI Developer0.00.0275000.0133266.82352933333.000000
AI Engineer0.00.0107093.0161487.829787164797.028571
AI Product Manager0.00.00.0120000.000000152650.000000
AI Programmer0.00.040000.072858.80000030000.000000
Sales Data Analyst60000.00.00.00.0000000.000000
Software Data Engineer0.00.00.0111627.6666670.000000
Staff Data Analyst29876.50.00.0179998.0000000.000000
Staff Data Scientist164000.0105000.00.00.0000000.000000
Staff Machine Learning Engineer0.0185000.00.00.0000000.000000

155 rows × 5 columns

Regional Disparities Analysis

average_salary_by_residence = df.groupby('employee_residence')['salary_in_usd'].mean().reset_index()
average_salary_by_residence.columns = ['Employee Residence', 'Average Salary']
print(average_salary_by_residence)
   Employee Residence  Average Salary
0                  AD    50745.000000
1                  AE    86000.000000
2                  AM    33500.000000
3                  AR    58461.538462
4                  AS    45555.000000
..                ...             ...
83                 UG    36000.000000
84                 US   157220.590351
85                 UZ    82000.000000
86                 VN    56733.333333
87                 ZA    53488.684211

[88 rows x 2 columns]
average_salary_by_location = df.groupby('company_location')['salary_in_usd'].mean().reset_index()
average_salary_by_location.columns = ['Company Location', 'Average Salary']
print(average_salary_by_location)
   Company Location  Average Salary
0                AD    50745.000000
1                AE    86000.000000
2                AM    50000.000000
3                AR    62444.444444
4                AS    31684.333333
..              ...             ...
72               TR    23094.666667
73               UA   105600.000000
74               US   156954.893355
75               VN    63000.000000
76               ZA    53488.684211

[77 rows x 2 columns]
import matplotlib.pyplot as plt

comparison = pd.merge(average_salary_by_residence, average_salary_by_location, left_on='Employee Residence', right_on='Company Location', how='outer', suffixes=('_Residence', '_Company'))

comparison_long = pd.melt(comparison, id_vars=['Employee Residence'], value_vars=['Average Salary_Residence', 'Average Salary_Company'],
                          var_name='Category', value_name='Average Salary')

plt.figure(figsize=(16, 32))
ax = sns.barplot(data=comparison_long, y='Employee Residence', x='Average Salary', hue='Category')
plt.title('Comparative Average Salary: Residence vs Company Location')
plt.xlabel('Average Salary in USD')
plt.ylabel('Employee Residence')
plt.legend(title='Category')
plt.show()

for p in ax.patches:
    ax.annotate(format(p.get_height(), '.0f'), 
                        (p.get_x() + p.get_width() / 2., p.get_height()), 
                        ha='center', va='center', 
                        xytext=(0, 9), 
                        textcoords='offset points')
plt.show()

Clustering

import warnings

warnings.filterwarnings("ignore")
def elbow_method(data):
    inertia = []
    K = range(1, 11)
    for k in K:
        model = KMeans(n_clusters=k, random_state=1)
        model.fit(data)
        inertia.append(model.inertia_)

    # Plot the elbow
    sns.set(style='whitegrid')
    plt.figure(figsize=(12, 6))
    sns.lineplot(x=K, y=inertia, marker='o', color='red')
    plt.title('Elbow plot for KMeans clustering')
    plt.xlabel('Number of clusters (k)')
    plt.ylabel('Inertia')
    plt.xticks(K)
    plt.show()
def silhouette_scores(data):
    K = range(2, 11)
    
    for k in K:
        model = KMeans(n_clusters=k, random_state=1)
        model.fit(data)
        s_score = silhouette_score(data, model.labels_)
        print("For k={}, silhouette score is {:.3f}".format(k, s_score))
def scale(data, scaler):
    scaled_data = scaler.fit_transform(data)
    return scaled_data, scaler
def visualize_centroid(data, model, x_col, y_col, x_center, y_center, title, x_label, y_label):
    plt.clf()
    fig, ax = plt.subplots(figsize=(10,7))
    sns.scatterplot(x=x_col, y=y_col, data=data, hue='Label', ax=ax, palette='deep')

    centers = model.cluster_centers_
    sns.scatterplot(x=centers[:,x_center], y=centers[:,y_center], s=500, alpha=0.8, marker='o',
                    ax=ax, legend=False, palette='deep')

    ax.set_title(title, fontsize=16)
    ax.set_xlabel(x_label, fontsize=12)
    ax.set_ylabel(y_label, fontsize=12)
    plt.show()
def train_kmeans(K, data, is_PCA=False, isScaled=False):
    columns = data.columns
    if isScaled:
        scaler = RobustScaler()
        data = scaler.fit_transform(data)
    else:
        scaler = None

    model = KMeans(n_clusters=K, random_state=1)
    model.fit(data)
    
    labels = model.labels_
    _data = pd.DataFrame(data, columns=columns).copy()
    _data['Label'] = labels

    if is_PCA:
        pca, _data_pca = pca_transform(_data.drop('Label', axis=1))
        _data_pca = pd.DataFrame(_data_pca)
        _data_pca['Label'] = labels
        return pca, model, _data_pca, scaler
    else:
        return model, _data, scaler
def pca_transform(dataset):
    pca = PCA(n_components=2)
    pca_df = pca.fit_transform(dataset)
    return pca, pca_df
elbow_method(df_encoded)
silhouette_scores(df_encoded)
For k=2, silhouette score is 0.983
For k=3, silhouette score is 0.975
For k=4, silhouette score is 0.961
For k=5, silhouette score is 0.561
For k=6, silhouette score is 0.564
For k=7, silhouette score is 0.537
For k=8, silhouette score is 0.538
For k=9, silhouette score is 0.538
For k=10, silhouette score is 0.537
df_encoded.columns
Index(['work_year', 'salary', 'salary_in_usd', 'remote_ratio',
       'experience_level_encoded', 'employment_type_encoded',
       'job_title_encoded', 'salary_currency_encoded',
       'employee_residence_encoded', 'company_location_encoded',
       'company_size_encoded'],
      dtype='object')
pca, model, clustered_data, scaler = train_kmeans(3, 
          df_encoded[['work_year', 'salary_in_usd', 'remote_ratio',
       'experience_level_encoded', 'employment_type_encoded',
       'job_title_encoded', 'salary_currency_encoded',
       'employee_residence_encoded', 'company_location_encoded',
       'company_size_encoded']], True)
visualize_centroid(clustered_data, model, 0, 1, 
                   0, 1, 
                   'K-Means Clustering of Professions by Job Title and Experience Level', 
                   'Job Title', 'Experience Level')
<Figure size 2000x2000 with 0 Axes>
df["Label"] = clustered_data['Label']
df.loc[(df['Label']) == 0].describe()
work_yearsalarysalary_in_usdremote_ratioLabel
count7704.0000007704.0000007704.0000007704.0000007704.0
mean2023.228193164109.469107164190.94885832.9958460.0
std0.66151425136.06623624868.95286046.9330110.0
min2020.000000104000.000000126225.0000000.0000000.0
25%2023.000000142000.000000142000.0000000.0000000.0
50%2023.000000160000.000000160000.0000000.0000000.0
75%2024.000000185000.000000185000.000000100.0000000.0
max2024.000000274965.000000215300.000000100.0000000.0
df.loc[(df['Label']) == 1].describe()
work_yearsalarysalary_in_usdremote_ratioLabel
count6384.0000006.384000e+036384.0000006384.0000006384.0
mean2023.1914161.245268e+0588145.76989333.5056391.0
std0.8010735.400781e+0525453.61570046.3168660.0
min2020.0000001.400000e+0415000.0000000.0000001.0
25%2023.0000007.000000e+0470000.0000000.0000001.0
50%2023.0000009.200000e+0492000.0000000.0000001.0
75%2024.0000001.100000e+05110000.000000100.0000001.0
max2024.0000003.040000e+07126100.000000100.0000001.0
df.loc[(df['Label']) == 2].describe()
work_yearsalarysalary_in_usdremote_ratioLabel
count2406.0000002.406000e+032406.0000002406.0000002406.0
mean2023.3038242.669317e+05266719.05777225.1246882.0
std0.6134006.829962e+0463748.33361643.2500470.0
min2020.0000002.000000e+05215500.0000000.0000002.0
25%2023.0000002.300000e+05230000.0000000.0000002.0
50%2023.0000002.500000e+05250000.0000000.0000002.0
75%2024.0000002.810000e+05281000.00000050.0000002.0
max2024.0000001.500000e+06800000.000000100.0000002.0
df.loc[(df['Label']) == 0].describe(include='object')
experience_levelemployment_typejob_titlesalary_currencyemployee_residencecompany_locationcompany_size
count7704770477047704770477047704
unique43115627213
topSEFTData ScientistUSDUSUSM
freq5783769418417633732573337292
df.loc[(df['Label']) == 1].describe(include='object')
experience_levelemployment_typejob_titlesalary_currencyemployee_residencecompany_locationcompany_size
count6384638463846384638463846384
unique441432385753
topSEFTData AnalystUSDUSUSM
freq2959631817085230481048525769
df.loc[(df['Label']) == 2].describe(include='object')
experience_levelemployment_typejob_titlesalary_currencyemployee_residencecompany_locationcompany_size
count2406240624062406240624062406
unique4367616153
topSEFTMachine Learning EngineerUSDUSUSM
freq191024025192391229222932207
machine learning project for resume
best machine learning projects
cool machine learning projects

AIML salaries 2022-2024 AutoViz+CatBoost+SHAP

Importing libraries and loading data

# Install Python packages using pip.

# The "!pip" command allows you to run shell commands in Jupyter Notebook or Colab cells.
# It is used here to install Python packages.
# The "-q" flag stands for "quiet," which means it will suppress output during installation.
# "feature_engine are the packages being installed.
# The "2>/dev/null" part redirects any error messages (stderr) to the null device, effectively silencing them.
# This is often used when you want to hide installation messages.
!pip install -q feature_engine autoviz>=0.1.803 dataprep 2>/dev/null
# Import necessary libraries
import numpy as np  # Import NumPy for handling numerical operations
import pandas as pd  # Import Pandas for data manipulation and analysis
import warnings  # Import Warnings to suppress unnecessary warnings

# Suppress warning messages
warnings.filterwarnings("ignore")

# Import AutoViz from the autoviz library for automated visualization of data
from autoviz import AutoViz_Class

# Import load_dataset and create_report from the dataprep library for data loading and EDA
from dataprep.datasets import load_dataset
from dataprep.eda import create_report

# Import SHAP for interpreting model predictions
import shap

# Import matplotlib for data visualization
import matplotlib.pyplot as plt

# Import CatBoostRegressor for building a regression model
from catboost import Pool, CatBoostRegressor

# Import mean_squared_error for evaluating model performance
from sklearn.metrics import mean_squared_error

# Import train_test_split for splitting the data into training and testing sets
from sklearn.model_selection import train_test_split

# Import RareLabelEncoder from feature_engine.encoding for encoding categorical features
from feature_engine.encoding import RareLabelEncoder

# Import CountVectorizer from sklearn.feature_extraction.text for text feature extraction
from sklearn.feature_extraction.text import CountVectorizer

# Import ast and re for working with text and regular expressions
import ast
import re

# Set Pandas options to display a maximum of 1000 rows
pd.set_option('display.max_rows', 1000)
Imported v0.1.901. Please call AutoViz in this sequence:
    AV = AutoViz_Class()
    %matplotlib inline
    dfte = AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=1, lowess=False,
               chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)
# Read the CSV file containing job salaries data into a DataFrame and remove duplicate rows
df0 = pd.read_csv("/kaggle/input/data-jobs-salaries/salaries.csv").drop_duplicates()

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df0 = df0[df0['work_year'] >= 2022]

# Print the shape of the DataFrame to display the number of rows and columns
print(df0.shape)

# Display a random sample of 5 rows from the DataFrame, transposed for better visibility
df0.sample(5).T
(11152, 11)
176161355560184809264
work_year20232024202420242023
experience_levelSEMIENMIMI
employment_typeFTFTFTFTFT
job_titleApplied ScientistResearch ScientistData AnalystMachine Learning EngineerBusiness Intelligence Analyst
salary72000178200137100125100119000
salary_currencyUSDUSDUSDUSDUSD
salary_in_usd72000178200137100125100119000
employee_residenceUSUSUSUSUS
remote_ratio00000
company_locationUSUSUSUSUS
company_sizeLMMMM
# Read the CSV file containing data on data science salaries for 2023 into a DataFrame
df1 = pd.read_csv("/kaggle/input/data-science-salaries-2023/ds_salaries.csv").drop_duplicates()

# Filter the DataFrame to include only entries from the year 2022 and later
df1 = df1[df1['work_year'] >= 2022]

# Print the shape of the filtered DataFrame to show the number of rows and columns
print(df1.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df1.sample(5).T
(2281, 11)
1772556139718431859
work_year20232022202320222022
experience_levelSESEEXMISE
employment_typeFTFTFTFTFT
job_titleData EngineerData ArchitectHead of Data ScienceData ScientistData Scientist
salary241000230400195800150000220000
salary_currencyUSDUSDUSDUSDUSD
salary_in_usd241000230400195800150000220000
employee_residenceUSUSUSUSUS
remote_ratio0001000
company_locationUSUSUSUSUS
company_sizeMMMMM
# Read the CSV file containing job salaries data into a Pandas DataFrame
df2 = pd.read_csv("/kaggle/input/data-jobs-salaries/salaries.csv").drop_duplicates()

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df2 = df2[df2['work_year'] >= 2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df2.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df2.sample(5).T
(11152, 11)
77585241495757972103
work_year20242024202420242024
experience_levelENMIENENEN
employment_typeFTFTFTFTFT
job_titleData AnalystCloud Database EngineerMachine Learning EngineerMachine Learning EngineerData Analyst
salary10687510600015790099500223500
salary_currencyUSDUSDUSDUSDUSD
salary_in_usd10687510600015790099500223500
employee_residenceUSUSCAUSUS
remote_ratio0010000
company_locationUSUSCAUSUS
company_sizeMMMMM
# Read the CSV file containing job salaries data into a Pandas DataFrame
df3 = pd.read_csv("/kaggle/input/d/willianoliveiragibin/data-jobs-salaries/salaries.csv").drop_duplicates()

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df3 = df3[df3['work_year']>=2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df3.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df3.sample(5).T
(4402, 11)
6575581204472485521
work_year20232023202320222023
experience_levelSESEENMISE
employment_typeFTFTFTFTFT
job_titleData EngineerData ScientistData EngineerData EngineerData Engineer
salary600002161003500060000163625
salary_currencyGBPUSDGBPGBPUSD
salary_in_usd738242161004306473880163625
employee_residenceGBUSGBGBUS
remote_ratio001000100
company_locationGBUSGBGBUS
company_sizeMMMMM
# Read the CSV file containing job salaries data into a Pandas DataFrame
df4 = pd.read_csv("/kaggle/input/global-ai-ml-data-science-salary/salaries.csv").drop_duplicates()

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df4 = df4[df4['work_year']>=2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df4.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df4.sample(5).T
(4838, 11)
step machine learning
step of machine learning
ml projects
ml project
406085153458327310
work_year20232023202320222023
experience_levelSEENSESESE
employment_typeFTFTFTFTFT
job_titleData ScientistData ScientistData Science ManagerData AnalystData Science Consultant
salary16900018000134236120600116000
salary_currencyUSDEURUSDUSDUSD
salary_in_usd16900019434134236120600116000
employee_residenceUSGRUSUSUS
remote_ratio010001000
company_locationUSGRUSUSUS
company_sizeMLMMM
# Read the CSV file containing job salaries data into a Pandas DataFrame
df5 = pd.read_csv("/kaggle/input/2023-data-scientists-salary/ds_salaries.csv").drop_duplicates()

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df5 = df5[df5['work_year']>=2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df5.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df5.sample(5).T
(2281, 11)
158196024751816654
work_year20232022202220222023
experience_levelMISESEMISE
employment_typeFTFTFTFTFT
job_titleData AnalystResearch EngineerData AnalystData ScientistData Analyst
salary350002495001500001100000121600
salary_currencyGBPUSDUSDINRUSD
salary_in_usd4253324950015000013989121600
employee_residenceGBUSUSINUS
remote_ratio000100100
company_locationGBUSUSINUS
company_sizeMMMLM
#  Read the CSV file containing job salaries data into a Pandas DataFrame
df6 = pd.read_csv('/kaggle/input/salary-data-analist/ds_salaries new.csv')

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df6 = df6[df6['work_year']>=2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df6.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df6.sample(5).T
(3449, 11)
9771939309825033017
work_year20232022202220222022
experience_levelSESESESESE
employment_typeFTFTFTFTFT
job_titleData AnalystData EngineerData Science ManagerData EngineerData Analyst
salary12200078000249260130000175000
salary_currencyUSDUSDUSDUSDUSD
salary_in_usd12200078000249260130000175000
employee_residenceUSUSUSUSUS
remote_ratio100000100
company_locationUSUSUSUSUS
company_sizeMMMMM
#  Read the CSV file containing job salaries data into a Pandas DataFrame
df7 = pd.read_csv('/kaggle/input/data-science-job-salaries-2024/salaries.csv')

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df7 = df7[df7['work_year']>=2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df7.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df7.sample(5).T
(13679, 11)
486532078944176912657
work_year20232024202320242022
experience_levelSESESEMISE
employment_typeFTFTFTFTFT
job_titleData ArchitectBI DeveloperApplied ScientistData ScienceData Scientist
salary1500008000013600086000243900
salary_currencyUSDUSDUSDUSDUSD
salary_in_usd1500008000013600086000243900
employee_residenceUSINUSUSUS
remote_ratio010000100
company_locationUSINUSUSUS
company_sizeMMLMM
#  Read the CSV file containing job salaries data into a Pandas DataFrame
df8 = pd.read_csv('/kaggle/input/latest-data-science-job-salaries-2024/DataScience_salaries_2024.csv')

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df8 = df8[df8['work_year']>=2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df8.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df8.sample(5).T
(14545, 11)
1426285701383069273911
work_year20232024202320232023
experience_levelENSEMIMISE
employment_typeFTFTFTFTFT
job_titleData ScientistData ManagerData SpecialistData EngineerResearch Scientist
salary5000013120060000147100185000
salary_currencyUSDUSDUSDUSDUSD
salary_in_usd5000013120060000147100185000
employee_residenceINUSUSUSUS
remote_ratio100100000
company_locationUSUSUSUSUS
company_sizeMMMMM
projects on machine learning
machine learning project
#  Read the CSV file containing job salaries data into a Pandas DataFrame
df9 = pd.read_csv("/kaggle/input/machine-learning-engineer-salary-in-2024/salaries.csv")

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df9 = df9[df9['work_year']>=2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df9.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df9.sample(5).T
(16201, 11)
74105387135531450211737
work_year20232024202320232023
experience_levelSESEMISEEX
employment_typeFTFTFTFTFT
job_titleData AnalystBusiness Intelligence EngineerData ScientistData EngineerData Engineer
salary1500003120045000310000204500
salary_currencyUSDEUREURUSDUSD
salary_in_usd1500003466648585310000204500
employee_residenceUSLVESUSUS
remote_ratio00000
company_locationUSLVESUSUS
company_sizeMMMMM
#  Read the CSV file containing job salaries data into a Pandas DataFrame
df10 = pd.read_csv("/kaggle/input/data-engineer-salary-in-2024/salaries (2).csv")

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df10 = df10[df10['work_year']>=2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df10.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df10.sample(5).T
(16241, 11)
977793399462105743207
work_year20232023202320232024
experience_levelSESESESEMI
employment_typeFTFTFTFTFT
job_titleApplied ScientistMachine Learning EngineerMachine Learning EngineerData ScientistApplied Scientist
salary159100280000204500140100222200
salary_currencyUSDUSDUSDUSDUSD
salary_in_usd159100280000204500140100222200
employee_residenceUSUSUSUSUS
remote_ratio00000
company_locationUSUSUSUSUS
company_sizeLMMML
#  Read the CSV file containing job salaries data into a Pandas DataFrame
df11 = pd.read_csv("/kaggle/input/ai-ml-salaries/salaries.csv")

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df11 = df11[df11['work_year']>=2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df11.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df11.sample(5).T
(17763, 11)
91391049650551276567
work_year20232023202420242024
experience_levelSESESEMIMI
employment_typeFTFTFTFTFT
job_titleMachine Learning EngineerData EngineerData AnalystPrompt EngineerResearch Scientist
salary14440038500011120050000096200
salary_currencyUSDUSDUSDUSDUSD
salary_in_usd14440038500011120050000096200
employee_residenceCAUSUSUSUS
remote_ratio0000100
company_locationCAUSUSUSUS
company_sizeMMMMM
#  Read the CSV file containing job salaries data into a Pandas DataFrame
df12 = pd.read_csv("/kaggle/input/data-science-salary-data/salaries.csv")

# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df12 = df12[df12['work_year']>=2022]

# Print the shape of the DataFrame to show the number of rows and columns
print(df12.shape)

# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df12.sample(5).T
(13437, 11)
11374139934578703566
work_year20232024202320242023
experience_levelSESESESESE
employment_typeFTFTFTFTFT
job_titleMachine Learning EngineerData AnalystMachine Learning EngineerData ArchitectAnalytics Engineer
salary14220011000028000090000155000
salary_currencyUSDUSDUSDGBPUSD
salary_in_usd142200110000280000112500155000
employee_residenceUSUSUSGBUS
remote_ratio00000
company_locationUSUSUSGBUS
company_sizeMMMMM
# Concatenating DataFrames vertically
# This combines the rows of the DataFrames to create a new DataFrame
df = pd.concat([df0, df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12])

# Removing duplicate rows from the concatenated DataFrame
# This helps ensure that each row is unique in the final DataFrame
df = df.drop_duplicates()

# Printing the shape of the resulting DataFrame
# This provides information about the number of rows and columns in the DataFrame
print(df.shape)
(11938, 11)
df[df['salary_in_usd']==115573]
work_yearexperience_levelemployment_typejob_titlesalarysalary_currencysalary_in_usdemployee_residenceremote_ratiocompany_locationcompany_size
181052022SEFTMachine Learning Engineer110000EUR115573FR100FRM
188862022MIFTData Scientist110000EUR115573NL0NLM

Data visualisation

# An update taken from the nice work https://www.kaggle.com/code/anshtanwar/auto-eda-missing-migrants-interactive-charts 
# made by @anshtanwar

# Import the AutoViz_Class
# This class is used for automated exploratory data analysis and visualization.
AV = AutoViz_Class()

# Initialize variables
filename = ""  # Specify the filename of the dataset (empty in this case)
target_variable = 'salary_in_usd'  # Specify the target variable for analysis
custom_plot_dir = "custom_plot_directory"  # Specify the directory to save custom plots

# Perform automated EDA using the AutoViz library
# The following parameters are used:
# - filename: Empty in this case as the data is provided directly as 'df'
# - sep: Delimiter used in the data (comma in this case)
# - depVar: Target variable for analysis ('rating' in this case)
# - dfte: DataFrame to be analyzed ('df' is assumed to be defined earlier)
# - header: Indicates that the first row contains column names (0 for True)
# - verbose: Verbosity level (1 for verbose output)
# - lowess: Smoothing using Lowess algorithm (False for no smoothing)
# - chart_format: Format in which charts will be generated (HTML format in this case)
# - max_rows_analyzed: Maximum number of rows to analyze (up to 10,000 rows)
# - max_cols_analyzed: Maximum number of columns to analyze (up to 50 columns)
# - save_plot_dir: Directory to save the generated plots ('custom_plot_directory' in this case)
try:
    dft = AV.AutoViz(
        filename,
        sep=",",
        depVar=target_variable,
        dfte=df,
        header=0,
        verbose=1,
        lowess=False,
        chart_format="html",
        max_rows_analyzed=min([df.shape[0], 10**4]),
        max_cols_analyzed=min([df.shape[1], 50]),
        save_plot_dir=custom_plot_dir
    )
    
    # Import the necessary library for displaying HTML content
    from IPython.core.display import display, HTML

    # Import the pathlib library to work with file paths
    from pathlib import Path
    
    # Initialize an empty list to store file names
    file_names = []

    # Use pathlib to iterate through HTML files in a specific directory
    for file in Path(f'/kaggle/working/{custom_plot_dir}/{target_variable}/').glob('*.html'):

        # Extract the filename from the full path and add it to the list
        filename = str(file).split('/')[-1]
        file_names.append(filename)

    # Iterate through the list of file names and display each HTML file
    for file_name in file_names:

        # Construct the full file path for each HTML file
        file_path = f'/kaggle/working/{custom_plot_dir}/{target_variable}/{file_name}'

        # Open the HTML file for reading
        with open(file_path, 'r') as file:

            # Read the content of the HTML file
            html_content = file.read()

            # Display the HTML content using IPython
            display(HTML(html_content))
except Exception as e:
    print(f"Exception: {e}")
ml process
kaggle machine learning projects
machine learning project manager
    Since nrows is smaller than dataset, loading random sample of 10000 rows into pandas...
Shape of your Data Set loaded: (10000, 11)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
    Number of Numeric Columns =  0
    Number of Integer-Categorical Columns =  2
    Number of String-Categorical Columns =  6
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  0
    Number of Numeric-Boolean Columns =  0
    Number of Discrete String Columns =  1
    Number of NLP String Columns =  0
    Number of Date Time Columns =  1
    Number of ID Columns =  0
    Number of Columns to Delete =  0
    10 Predictors classified...
        No variables removed since no ID or low-information variables found in data set
Since Number of Rows in data 10000 exceeds maximum, randomly sampling 10000 rows for EDA...

################ Regression problem #####################
Saving scatterplots in HTML format
Saving pair_scatters in HTML format
Saving distplots_cats in HTML format
Saving distplots_nums in HTML format
Saving kde_plots in HTML format
Saving violinplots in HTML format
Saving heatmaps in HTML format
Saving timeseries_plots in HTML format
Saving cat_var_plots in HTML format
Time to run AutoViz (in seconds) = 13

Select_Variable X-AxisY-AxisColor Select_Variable X-AxisY-Axis X-AxisY-Axis Select_Cat_Variable

create_report(df.sample(10**3))

DataPrep ReportOverview

Variables ≡InteractionsCorrelationsMissing Values

Overview

Dataset Statistics

Number of Variables11
Number of Rows1000
Missing Cells0
Missing Cells (%)0.0%
Duplicate Rows0
Duplicate Rows (%)0.0%
Total Size in Memory456.8 KB
Average Row Size in Memory467.8 B
Variable TypesCategorical: 7 Numerical: 2 GeoGraphy: 2

Dataset Insights

salary and salary_in_usd have similar distributionsSimilar Distribution
salary is skewedSkewed
job_title has a high cardinality: 88 distinct valuesHigh Cardinality
work_year has constant length 4Constant Length
experience_level has constant length 2Constant Length
employment_type has constant length 2Constant Length
salary_currency has constant length 3Constant Length
employee_residence has constant length 2Constant Length
company_location has constant length 2Constant Length
company_size has constant length 1Constant Length

Variables

Sort byReverse order

work_year

Approximate Distinct Count3
Approximate Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory Size67.4 KB

experience_level

Approximate Distinct Count4
Approximate Unique (%)0.4%
Missing0
Missing (%)0.0%
Memory Size65.4 KB
  • The largest value (SE) is over 1.96 times larger than the second largest value (MI)

employment_type

Approximate Distinct Count4
Approximate Unique (%)0.4%
Missing0
Missing (%)0.0%
Memory Size65.4 KB
  • The largest value (FT) is over 247.75 times larger than the second largest value (CT)

job_title

Approximate Distinct Count88
Approximate Unique (%)8.8%
Missing0
Missing (%)0.0%
Memory Size79.8 KB

salary

Approximate Distinct Count559
Approximate Unique (%)55.9%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Memory Size15.6 KB
Mean166471.106
Minimum21000
Maximum5000000
Zeros0
Zeros (%)0.0%
Negatives0
Negatives (%)0.0%
  • salary is skewed right (γ1 = 14.2514)

salary_currency

Approximate Distinct Count10
Approximate Unique (%)1.0%
Missing0
Missing (%)0.0%
Memory Size66.4 KB
  • The largest value (USD) is over 19.57 times larger than the second largest value (GBP)

salary_in_usd

Approximate Distinct Count608
Approximate Unique (%)60.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Memory Size15.6 KB
Mean150069.385
Minimum17511
Maximum720000
Zeros0
Zeros (%)0.0%
Negatives0
Negatives (%)0.0%
  • salary_in_usd is skewed right (γ1 = 1.3168)

employee_residence

Approximate Distinct Count35
Approximate Unique (%)3.5%
Missing0
Missing (%)0.0%
Memory Size65.4 KB
  • The largest value (US) is over 16.31 times larger than the second largest value (GB)

remote_ratio

Approximate Distinct Count3
Approximate Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory Size65.1 KB
  • The largest value (0) is over 2.1 times larger than the second largest value (100)

company_location

Approximate Distinct Count34
Approximate Unique (%)3.4%
Missing0
Missing (%)0.0%
Memory Size65.4 KB
  • The largest value (US) is over 16.08 times larger than the second largest value (GB)

company_size

Approximate Distinct Count3
Approximate Unique (%)0.3%
Missing0
Missing (%)0.0%
Memory Size64.5 KB
  • The largest value (M) is over 18.43 times larger than the second largest value (L)
machine learning project management
machine learning projects for masters students

Interactions

X-AxisY-Axis

Correlations

PearsonSpearmanKendallTau

Missing Values

Bar ChartSpectrumHeat MapDendrogram

Report generated with DataPrep

Data transformation

# Convert 'salary_in_usd' column to thousands of dollars per year
label = 'salary_in_usd'
df[label] = df[label] * 1e-3

# Exclude 1% of smallest and 1% of highest salaries to remove outliers
P = np.percentile(df[label], [1, 99])
df = df[(df[label] > P[0]) & (df[label] < P[1])]

# Replace 'ML Engineer' with 'Machine Learning Engineer' in the 'job_title' column
df['job_title'].replace('ML Engineer', 'Machine Learning Engineer', inplace=True)

# Rename 'experience_level' based on a dictionary mapping
exp_dict = {'EN': 'Entry-level / Junior', 'MI': 'Mid-level / Intermediate', 'SE': 'Senior-level / Expert', 'EX': 'Executive-level / Director'}
df['experience_level'] = df['experience_level'].replace(exp_dict)

# Rename 'employment_type' based on a dictionary mapping
empl_dict = {'PT': 'Part-time', 'FT': 'Full-time', 'CT': 'Contract', 'FL': 'Freelance'}
df['employment_type'] = df['employment_type'].replace(empl_dict)

# Rename 'remote_ratio' based on a dictionary mapping
remote_dict = {0: 'No remote work (less than 20%)', 50: 'Partially remote', 100: 'Fully remote (more than 80%)'}
df['remote_ratio'] = df['remote_ratio'].replace(remote_dict)

# Rename 'company_size' based on a dictionary mapping
company_dict = {'S': 'Small', 'M': 'Medium', 'L': 'Large'}
df['company_size'] = df['company_size'].replace(company_dict)

# Combine 'employee_residence' and 'company_location' into a new 'residence_location' column
df['residence_location'] = df['employee_residence'] + '/' + df['company_location']

# Convert 'work_year' column to strings
df['work_year'] = df['work_year'].astype(str)

# Set up the rare label encoder for selected columns, limiting the number of categories
# and replacing rare categories with 'Other'
for col in ['job_title', 'residence_location', 'experience_level', 'employment_type']:
    encoder = RareLabelEncoder(n_categories=1, max_n_categories=50, replace_with='Other', tol=20/df.shape[0])
    df[col] = encoder.fit_transform(df[[col]])

# Drop unused columns
cols2drop = ['salary', 'employee_residence', 'company_location', 'salary_currency']
df = df.drop(cols2drop, axis=1)

# Display the shape of the resulting DataFrame
print(df.shape)
(11698, 8)
df.sample(10).T
4057500133561109818342992856423186114032746
work_year2024202420242023202220232024202420232022
experience_levelEntry-level / JuniorSenior-level / ExpertMid-level / IntermediateSenior-level / ExpertSenior-level / ExpertSenior-level / ExpertExecutive-level / DirectorMid-level / IntermediateMid-level / IntermediateSenior-level / Expert
employment_typeFull-timeFull-timeFull-timeFull-timeFull-timeFull-timeFull-timeFull-timeFull-timeFull-time
job_titleBusiness Intelligence AnalystBusiness Intelligence EngineerData EngineerBusiness Intelligence EngineerData EngineerData EngineerData EngineerData AnalystData EngineerMachine Learning Engineer
salary_in_usd111.4204.5149.9180.0172.2143.0190.059.46983.9131.3
remote_ratioFully remote (more than 80%)No remote work (less than 20%)No remote work (less than 20%)Fully remote (more than 80%)No remote work (less than 20%)No remote work (less than 20%)No remote work (less than 20%)No remote work (less than 20%)No remote work (less than 20%)Fully remote (more than 80%)
company_sizeMediumMediumMediumMediumMediumMediumMediumMediumMediumLarge
residence_locationUS/USUS/USUS/USUS/USUS/USUS/USUS/USUS/USUS/USUS/US

Machine learning

# Extract the target variable 'label' and features 'X' from the DataFrame
y = df[label].values.reshape(-1,)
X = df.drop([label], axis=1)

# Identify categorical columns in the feature set
cat_cols = df.select_dtypes(include=['object']).columns

# Get the indices of categorical columns in the feature set
cat_cols_idx = [list(X.columns).index(c) for c in cat_cols]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0, stratify=df[['residence_location']])

# Print the shapes of the training and testing sets to verify the split
print("Training set shapes - X_train: {}, y_train: {}".format(X_train.shape, y_train.shape))
print("Testing set shapes - X_test: {}, y_test: {}".format(X_test.shape, y_test.shape))
Training set shapes - X_train: (5849, 7), y_train: (5849,)
Testing set shapes - X_test: (5849, 7), y_test: (5849,)
# Initialize Pool: Creating CatBoost Pools for training and testing data, specifying categorical features.
train_pool = Pool(X_train, 
                  y_train, 
                  cat_features=cat_cols_idx)
test_pool = Pool(X_test,
                 y_test,
                 cat_features=cat_cols_idx)

# Specify Training Parameters: Configuring the CatBoostRegressor with specific hyperparameters.
model = CatBoostRegressor(iterations=1800, 
                          depth=6,
                          verbose=0,
                          early_stopping_rounds=100,
                          learning_rate=0.008, 
                          loss_function='RMSE')

# Train the Model: Fitting the model to the training data and using the test data for early stopping.
model.fit(train_pool, eval_set=test_pool)

# Make Predictions: Generating predictions on both the training and test sets.
y_train_pred = model.predict(train_pool)
y_test_pred = model.predict(test_pool)

# Evaluate Performance on Training and Test Sets: Calculating RMSE scores for both sets.
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)
rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)

# Print Results: Displaying the RMSE scores for the training and test sets.
print(f"RMSE score for train {round(rmse_train, 1)} kUSD/year, and for test {round(rmse_test, 1)} kUSD/year")
RMSE score for train 51.4 kUSD/year, and for test 52.0 kUSD/year
# Baseline scores (assuming the same prediction for all data samples)

# Calculating the root mean squared error (RMSE) for the training set based on the mean prediction
rmse_bs_train = mean_squared_error(y_train, [np.mean(y_train)] * len(y_train), squared=False)

# Calculating the root mean squared error (RMSE) for the test set based on the mean prediction
rmse_bs_test = mean_squared_error(y_test, [np.mean(y_train)] * len(y_test), squared=False)

# Printing the RMSE baseline scores for both the training and test sets
print(f"RMSE baseline score for train: {round(rmse_bs_train, 1)} kUSD/year, and for test: {round(rmse_bs_test, 1)} kUSD/year")
RMSE baseline score for train: 64.7 kUSD/year, and for test: 64.1 kUSD/year

Explanations with SHAP values

%matplotlib inline
# Initialize the SHAP JavaScript visualization library
shap.initjs()

# Create a SHAP TreeExplainer for the given 'model'
ex = shap.TreeExplainer(model)

# Compute SHAP values for the test dataset 'X_test' using the TreeExplainer
shap_values = ex.shap_values(X_test)

# Generate a summary plot of SHAP values to visualize feature contributions
shap.summary_plot(shap_values, X_test)
# Accessing the expected values from the 'ex' object, assuming it contains the expected values.
expected_values = ex.expected_value

# Printing the average predicted salary rounded to one decimal place in kilo USD per year.
print(f"Average predicted salary is {round(expected_values, 1)} kUSD/year")

# Calculating and printing the average actual salary from the 'y_test' array, rounded to one decimal place in kilo USD per year.
print(f"Average actual salary is {round(np.mean(y_test), 1)} kUSD/year")
Average predicted salary is 147.9 kUSD/year
Average actual salary is 146.5 kUSD/year
# Function to visualize SHAP values for a specific feature
def show_shap(col, shap_values=shap_values):
    # Create a copy of the test dataset to avoid modifying the original data
    df_infl = X_test.copy()
    
    # Add a new column for SHAP values corresponding to the specified feature
    df_infl['shap_'] = shap_values[:, df_infl.columns.tolist().index(col)]
    
    # Calculate the mean and standard deviation of SHAP values grouped by the specified feature
    gain = round(df_infl.groupby(col)['shap_'].mean(), 5)
    gain_std = round(df_infl.groupby(col)['shap_'].std(), 5)
    
    # Count the number of instances for each category of the specified feature
    cnt = df_infl.groupby(col)['shap_'].count()
    
    # Create a dictionary to store the results
    dd_dict = {'col': list(gain.index), 'gain': list(gain.values), 'gain_std': list(gain_std.values), 'count': cnt}
    
    # Create a DataFrame from the dictionary and sort it by 'gain' in descending order
    df_res = pd.DataFrame.from_dict(dd_dict).sort_values('gain', ascending=False).set_index('col')
    
    # Plotting SHAP values with error bars
    plt.figure(figsize=(9, 6))
    plt.errorbar(df_res.index, df_res['gain'], yerr=df_res['gain_std'], fmt="o", color="r")
    plt.title(f'SHAP values for {col}')
    plt.ylabel('kUSD/year')
    plt.tick_params(axis="x", rotation=90)
    plt.show()
    
    # Display the results DataFrame
    print(df_res)
    
    return

# Iterate through all columns in the test dataset
for col in X_test.columns:
    print()
    print(col)
    print()
    
    # Call the show_shap function for each feature
    show_shap(col, shap_values)
work_year

       gain    gain_std  count
col                           
2024  0.86390   1.69680  2874 
2023  0.29864   1.82858  2418 
2022 -6.28405   1.38722   557 

experience_level

                              gain    gain_std  count
col                                                  
Executive-level / Director  32.96379   4.27869   212 
Senior-level / Expert       10.19685   1.17599  3392 
Mid-level / Intermediate   -17.35785   1.30479  1655 
Entry-level / Junior       -27.45064   3.77944   590 

employment_type

machine learning python projects
machine learning projects in python
             gain    gain_std  count
col                                 
Full-time  -0.02771   1.19619  5814 
Other      -7.58667   1.18367     5 
Contract   -9.80807   3.58763    11 
Part-time -11.75318   2.88548    19 

job_title

machine learning projects
machine learning projects with source code
machine learning projects github
machine learning projects for final year
machine learning projects for students
                                            gain    gain_std  count
col                                                                
Machine Learning Engineer                 27.57581   2.25142   657 
Computer Vision Engineer                  26.98830   3.90854    21 
Research Scientist                        26.01017   1.09076   209 
Data Science Engineer                     25.75089   2.08551    19 
Data Infrastructure Engineer              25.53922   1.28684    10 
Head of Data                              25.16594   1.96862    26 
Machine Learning Scientist                24.53864   0.86595    55 
Machine Learning Infrastructure Engineer  24.30945   1.31455    21 
Machine Learning Researcher               24.18054   3.23081    16 
Applied Scientist                         23.94772   2.22138    84 
Data Science Manager                      23.54227   2.51141    44 
Director of Data Science                  23.17807   4.01040    11 
AI Architect                              22.90402   2.52727    12 
Data Analytics Lead                       21.32742   1.94926    13 
Research Engineer                         19.02157   1.54630   132 
Data Science Lead                         11.94530   1.74500    14 
Data Scientist                             4.81905   1.68475  1173 
Data Science                               3.57735   1.41696    94 
Data Analytics Manager                     3.32220   0.85416    28 
Data Architect                             3.16300   0.83773   159 
AI Engineer                                2.55602   1.33826    73 
Data Engineer                              0.20027   1.68056  1005 
Data Product Manager                      -0.97918   0.67761    21 
Analytics Engineer                        -1.07788   0.80328   162 
MLOps Engineer                            -2.09056   0.72661    16 
ETL Developer                             -2.46553   0.44679    15 
AI Scientist                              -4.87655   1.39078    12 
Data Lead                                 -5.54521   1.49889    13 
Business Intelligence                     -6.58333   1.69320    56 
Data Modeler                             -10.56544   0.35704    17 
Other                                    -13.42975   1.41266   302 
Business Intelligence Manager            -13.67929   0.40962     7 
Business Intelligence Engineer           -13.74015   1.51243    63 
AI Developer                             -15.72698   2.82001    12 
Research Analyst                         -19.46592   2.14006    65 
Business Intelligence Analyst            -22.83291   2.80047   116 
Data Strategist                          -24.35387   0.70741    15 
Data Quality Analyst                     -24.93376   7.06758    10 
Data Science Consultant                  -26.41770   3.58909    24 
Insight Analyst                          -28.67048   4.51092    10 
Data Management Analyst                  -29.89674   3.70526    11 
Data Analyst                             -30.54220   4.40807   790 
BI Developer                             -31.72639   4.00586    38 
Data Management Specialist               -32.50803   3.76743    10 
Data Specialist                          -33.05490   4.40766    44 
Data Operations Analyst                  -33.05554   3.22688    13 
Data Manager                             -33.71244   3.41207    59 
BI Analyst                               -33.97450   4.80845    24 
Business Intelligence Developer          -34.48365   3.10161    31 
Data Developer                           -34.68734   1.58439    17 

remote_ratio

                                 gain    gain_std  count
col                                                     
No remote work (less than 20%)  0.52673   1.15238  3922 
Fully remote (more than 80%)   -1.42224   1.44277  1845 
Partially remote               -6.36967   1.72677    82 

company_size

         gain    gain_std  count
col                             
Medium -0.11769   0.98566  5532 
Large  -1.06947   2.54676   253 
Small  -5.99881   2.27519    64 

residence_location

         gain    gain_std  count
col                             
US/US   6.44578   1.08267  4908 
CA/CA  -3.74517   1.52402   236 
AU/AU -17.00825   3.28928    27 
DE/DE -35.03124   3.85496    45 
FR/FR -41.77947   3.72310    29 
IN/IN -41.83046   4.90604    18 
GB/GB -42.95816   5.21321   313 
Other -43.79319   5.34311   175 
NL/NL -48.00334   4.79647    15 
LT/LT -48.76303   5.22976    11 
ZA/ZA -49.16019   3.26999    10 
PT/PT -49.54464   5.33928    12 
ES/ES -50.48376   3.74628    40 
LV/LV -50.59173   4.38976    10 

Additional analysis: ML eng vs Data Scientist gap analysis with SHAP values

def plot_gap(col, main_col="job_title", value1="Machine Learning Engineer", value2="Data Scientist"):
    df_infl = X_test.copy()
    df_infl['shap_gd'] = shap_values[:,int(list(X_test.columns).index(main_col))]
    df1_mean = pd.pivot_table(df_infl, values=['shap_gd'], index=[col, main_col], aggfunc=np.mean)
    df1_std = pd.pivot_table(df_infl, values=['shap_gd'], index=[col, main_col], aggfunc=np.std)
    df2_mean = pd.pivot(df1_mean.reset_index(), index=col, columns=main_col, values='shap_gd')[[value1, value2]].dropna(axis=0)
    df2_mean['gap'] = df2_mean[value1]-df2_mean[value2]
    df2_std = pd.pivot(df1_std.reset_index(), index=col, columns=main_col, values='shap_gd')[[value1, value2]]
    df2_std['std'] = np.sqrt(df2_std[value1]**2 + df2_std[value2]**2)
    df2 = df2_mean[['gap']].join(df2_std[['std']], how='inner')
    df2 = df2.dropna(axis=0).sort_values('gap', ascending=False).sort_values('gap', ascending=False)
    plt.figure(figsize=(12,8))
    plt.bar(x=df2.index, height=df2['gap'])
    plt.errorbar(df2.index, df2['gap'], yerr=df2['std'], fmt="o", color="r")
    plt.title(f'SHAP value of gap per {col}, yearly compensation')
    plt.ylabel('kUSD/year')
    plt.tick_params(axis="x", rotation=90)
    plt.show();
    print()
    print()
    df_infl['shap_'] = shap_values[:,int(list(X_test.columns).index(col))]
    df2['avg_pay'] = expected_values + df_infl.groupby(col)['shap_'].mean()
    df2['avg_pp'] = 100*df2['gap']/df2['avg_pay']
    df2 = df2.sort_values('avg_pp', ascending=False)
    plt.figure(figsize=(12,8))
    plt.bar(x=df2.index, height=df2['avg_pp'])
    plt.errorbar(df2.index, df2['avg_pp'], yerr=100*df2['std']/df2['avg_pay'], fmt="o", color="r")
    plt.title(f'Gap per {col} relative to average pay')
    plt.ylabel('Percentage points')
    plt.tick_params(axis="x", rotation=90)
    plt.show();
    return

for col in X_test.columns:
    if col != 'job_title':
        print(col)
        plot_gap(col)
work_year
projects on machine learning
machine learning project
project machine learning
machine learning certification
certification machine learning
experience_level
employment_type
remote_ratio
company_size
residence_location
ml model
machine learning projects
projects machine learning

Additional analysis: 2024 vs 2023 year analysis with SHAP values

for col in X_test.columns:
    if col != 'work_year':
        print(col)
        plot_gap(col, main_col="work_year", value1="2024", value2="2023")
experience_level
employment_type
job_title
remote_ratio
machine learning projects github
machine learning projects for final year
machine learning projects for students
company_size
residence_location
machine learning projects
machine learning projects with source code

Conclusion

Alright everyone, we’ve come to the conclusion of our in-depth analysis on Machine Learning Engineer salaries for 2024. Here’s what you need to know:

Key Points:

  • Salary Range: Machine Learning Engineers are earning competitive salaries due to their valuable skills and expertise.
  • Industry Demand: There is a high demand for AI professionals in various sectors, leading to an increase in salaries.
  • Geographical Impact: Where you live matters! Salaries can vary significantly based on location.
ml projects github
ml projects for final year
ml projects for students

Machine Learning Engineers are set for success in 2024, with attractive salaries and abundant opportunities. Whether you’re just starting out or a seasoned expert, the future looks promising in the AI field.

machine learning projects github
machine learning projects for final year
machine learning projects for students

Thank you for joining us on this journey. Stay tuned for more insights on the ever-changing tech industry!


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *