Hello there! Welcome to our blog where we’re delving into the captivating realm of Machine Learning Engineer salaries for 2024. Interested in knowing how much these tech wizards are earning? You’ve come to the perfect place.
Machine Learning is currently one of the most sought-after fields in the tech industry. Whether you’re considering a career in this field or simply curious about the pay scale, we’ve got all the exciting details.
Here’s what we’ll cover:
Salary Ranges: We’ll provide a breakdown of the average salaries you can expect.
Industry Demand: Discover which sectors are offering top dollar for AI talent.
Location, Location, Location: Learn how geography can impact your paycheck.
So grab a cup of coffee and get cozy, because we’re about to unravel everything you need to know about Machine Learning Engineer salaries in 2024. Let’s dive in!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_samples, silhouette_score
from scipy.cluster import hierarchy
step machine learning step of machine learning ml projects
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
ml projects ideas project manager artificial intelligence best machine learning courses reddit machine learning projects for resume
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
ml model machine learning projects projects machine learning
Bivariate Analysis
plt.figure(figsize=(16, 8))
sns.boxplot(data=df, y='salary_in_usd', x='experience_level', palette='Pastel1')
plt.title('Distribution of Salary in USD Across Different Experience Levels')
plt.grid(True)
plt.show()
plt.figure(figsize=(16, 8))
sns.boxplot(data=df, y='salary_in_usd', x='employment_type', palette='Pastel1')
plt.title('Distribution of Salary in USD Across Different Employment Type')
plt.grid(True)
plt.show()
plt.figure(figsize=(16, 8))
sns.boxplot(data=df, y='salary_in_usd', x='company_size', palette='Pastel1')
plt.title('Distribution of Salary in USD Across Different Company Size')
plt.grid(True)
plt.show()
plt.figure(figsize=(20, 40))
sns.boxplot(data=df, x='salary_in_usd', y='job_title', palette='Pastel1')
plt.title('Distribution of Salary in USD Across Different Job Titles')
plt.grid(True)
plt.show()
plt.figure(figsize=(20, 40))
sns.boxplot(data=df, x='salary_in_usd', y='company_location', color='experience_level', palette='Pastel1', order=df['company_location'].value_counts().index)
plt.title('Distribution of Salary in USD Across Different Locations')
plt.grid(True)
plt.show()
stacked_data = df[~(df['company_location'] == 'US')].groupby(['company_location', 'employment_type']).size().unstack()
# Plot stacked bar graph
stacked_data.plot(kind='bar', stacked=True, figsize=(20, 10), colormap='Pastel1')
plt.title('Distribution of Employment Type Across Different Company Locations (Excluding US)')
plt.xlabel('Company Location')
plt.ylabel('Count')
plt.grid(True)
plt.legend(title='Employment Type')
plt.show()
stacked_data = df[(df['company_location'] == 'US')].groupby(['company_location', 'employment_type']).size().unstack()
# Plot stacked bar graph
stacked_data.plot(kind='bar', stacked=True, figsize=(20, 10), colormap='Pastel1')
plt.title('Distribution of Employment Type Across Different Company Locations (US Only)')
plt.xlabel('Company Location')
plt.ylabel('Count')
plt.grid(True)
plt.legend(title='Employment Type')
plt.show()
stacked_data = df[(df['employment_type']=='FT')].groupby(['experience_level', 'employment_type']).size().unstack()
# Plot stacked bar graph
stacked_data.plot(kind='bar', stacked=True, figsize=(20, 10), colormap='Pastel1')
plt.title('Distribution of Employment Type Across Different experience_level (Full Time Only)')
plt.xlabel('Company Location')
plt.ylabel('Count')
plt.grid(True)
plt.legend(title='Employment Type')
plt.show()
stacked_data = df[~(df['employment_type']=='FT')].groupby(['experience_level', 'employment_type']).size().unstack()
# Plot stacked bar graph
stacked_data.plot(kind='bar', stacked=True, figsize=(20, 10), colormap='Pastel1')
plt.title('Distribution of Employment Type Across Different experience_level (Excluding Full Time)')
plt.xlabel('Company Location')
plt.ylabel('Count')
plt.grid(True)
plt.legend(title='Employment Type')
plt.show()
/tmp/ipykernel_18/4132009602.py:1: FutureWarning: The provided callable <function mean at 0x7942382084c0> is currently using SeriesGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
_ = df.groupby(['work_year', 'job_title']).agg({'salary_in_usd': np.mean})
Employee Residence Average Salary
0 AD 50745.000000
1 AE 86000.000000
2 AM 33500.000000
3 AR 58461.538462
4 AS 45555.000000
.. ... ...
83 UG 36000.000000
84 US 157220.590351
85 UZ 82000.000000
86 VN 56733.333333
87 ZA 53488.684211
[88 rows x 2 columns]
Company Location Average Salary
0 AD 50745.000000
1 AE 86000.000000
2 AM 50000.000000
3 AR 62444.444444
4 AS 31684.333333
.. ... ...
72 TR 23094.666667
73 UA 105600.000000
74 US 156954.893355
75 VN 63000.000000
76 ZA 53488.684211
[77 rows x 2 columns]
For k=2, silhouette score is 0.983
For k=3, silhouette score is 0.975
For k=4, silhouette score is 0.961
For k=5, silhouette score is 0.561
For k=6, silhouette score is 0.564
For k=7, silhouette score is 0.537
For k=8, silhouette score is 0.538
For k=9, silhouette score is 0.538
For k=10, silhouette score is 0.537
visualize_centroid(clustered_data, model, 0, 1,
0, 1,
'K-Means Clustering of Professions by Job Title and Experience Level',
'Job Title', 'Experience Level')
machine learning project for resume best machine learning projects cool machine learning projects
AIML salaries 2022-2024 AutoViz+CatBoost+SHAP
Importing libraries and loading data
# Install Python packages using pip.
# The "!pip" command allows you to run shell commands in Jupyter Notebook or Colab cells.
# It is used here to install Python packages.
# The "-q" flag stands for "quiet," which means it will suppress output during installation.
# "feature_engine are the packages being installed.
# The "2>/dev/null" part redirects any error messages (stderr) to the null device, effectively silencing them.
# This is often used when you want to hide installation messages.
!pip install -q feature_engine autoviz>=0.1.803 dataprep 2>/dev/null
# Import necessary libraries
import numpy as np # Import NumPy for handling numerical operations
import pandas as pd # Import Pandas for data manipulation and analysis
import warnings # Import Warnings to suppress unnecessary warnings
# Suppress warning messages
warnings.filterwarnings("ignore")
# Import AutoViz from the autoviz library for automated visualization of data
from autoviz import AutoViz_Class
# Import load_dataset and create_report from the dataprep library for data loading and EDA
from dataprep.datasets import load_dataset
from dataprep.eda import create_report
# Import SHAP for interpreting model predictions
import shap
# Import matplotlib for data visualization
import matplotlib.pyplot as plt
# Import CatBoostRegressor for building a regression model
from catboost import Pool, CatBoostRegressor
# Import mean_squared_error for evaluating model performance
from sklearn.metrics import mean_squared_error
# Import train_test_split for splitting the data into training and testing sets
from sklearn.model_selection import train_test_split
# Import RareLabelEncoder from feature_engine.encoding for encoding categorical features
from feature_engine.encoding import RareLabelEncoder
# Import CountVectorizer from sklearn.feature_extraction.text for text feature extraction
from sklearn.feature_extraction.text import CountVectorizer
# Import ast and re for working with text and regular expressions
import ast
import re
# Set Pandas options to display a maximum of 1000 rows
pd.set_option('display.max_rows', 1000)
Imported v0.1.901. Please call AutoViz in this sequence:
AV = AutoViz_Class()
%matplotlib inline
dfte = AV.AutoViz(filename, sep=',', depVar='', dfte=None, header=0, verbose=1, lowess=False,
chart_format='svg',max_rows_analyzed=150000,max_cols_analyzed=30, save_plot_dir=None)
# Read the CSV file containing job salaries data into a DataFrame and remove duplicate rows
df0 = pd.read_csv("/kaggle/input/data-jobs-salaries/salaries.csv").drop_duplicates()
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df0 = df0[df0['work_year'] >= 2022]
# Print the shape of the DataFrame to display the number of rows and columns
print(df0.shape)
# Display a random sample of 5 rows from the DataFrame, transposed for better visibility
df0.sample(5).T
(11152, 11)
17616
1355
5601
8480
9264
work_year
2023
2024
2024
2024
2023
experience_level
SE
MI
EN
MI
MI
employment_type
FT
FT
FT
FT
FT
job_title
Applied Scientist
Research Scientist
Data Analyst
Machine Learning Engineer
Business Intelligence Analyst
salary
72000
178200
137100
125100
119000
salary_currency
USD
USD
USD
USD
USD
salary_in_usd
72000
178200
137100
125100
119000
employee_residence
US
US
US
US
US
remote_ratio
0
0
0
0
0
company_location
US
US
US
US
US
company_size
L
M
M
M
M
# Read the CSV file containing data on data science salaries for 2023 into a DataFrame
df1 = pd.read_csv("/kaggle/input/data-science-salaries-2023/ds_salaries.csv").drop_duplicates()
# Filter the DataFrame to include only entries from the year 2022 and later
df1 = df1[df1['work_year'] >= 2022]
# Print the shape of the filtered DataFrame to show the number of rows and columns
print(df1.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df1.sample(5).T
(2281, 11)
177
2556
1397
1843
1859
work_year
2023
2022
2023
2022
2022
experience_level
SE
SE
EX
MI
SE
employment_type
FT
FT
FT
FT
FT
job_title
Data Engineer
Data Architect
Head of Data Science
Data Scientist
Data Scientist
salary
241000
230400
195800
150000
220000
salary_currency
USD
USD
USD
USD
USD
salary_in_usd
241000
230400
195800
150000
220000
employee_residence
US
US
US
US
US
remote_ratio
0
0
0
100
0
company_location
US
US
US
US
US
company_size
M
M
M
M
M
# Read the CSV file containing job salaries data into a Pandas DataFrame
df2 = pd.read_csv("/kaggle/input/data-jobs-salaries/salaries.csv").drop_duplicates()
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df2 = df2[df2['work_year'] >= 2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df2.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df2.sample(5).T
(11152, 11)
7758
5241
4957
5797
2103
work_year
2024
2024
2024
2024
2024
experience_level
EN
MI
EN
EN
EN
employment_type
FT
FT
FT
FT
FT
job_title
Data Analyst
Cloud Database Engineer
Machine Learning Engineer
Machine Learning Engineer
Data Analyst
salary
106875
106000
157900
99500
223500
salary_currency
USD
USD
USD
USD
USD
salary_in_usd
106875
106000
157900
99500
223500
employee_residence
US
US
CA
US
US
remote_ratio
0
0
100
0
0
company_location
US
US
CA
US
US
company_size
M
M
M
M
M
# Read the CSV file containing job salaries data into a Pandas DataFrame
df3 = pd.read_csv("/kaggle/input/d/willianoliveiragibin/data-jobs-salaries/salaries.csv").drop_duplicates()
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df3 = df3[df3['work_year']>=2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df3.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df3.sample(5).T
(4402, 11)
657
5581
2044
7248
5521
work_year
2023
2023
2023
2022
2023
experience_level
SE
SE
EN
MI
SE
employment_type
FT
FT
FT
FT
FT
job_title
Data Engineer
Data Scientist
Data Engineer
Data Engineer
Data Engineer
salary
60000
216100
35000
60000
163625
salary_currency
GBP
USD
GBP
GBP
USD
salary_in_usd
73824
216100
43064
73880
163625
employee_residence
GB
US
GB
GB
US
remote_ratio
0
0
100
0
100
company_location
GB
US
GB
GB
US
company_size
M
M
M
M
M
# Read the CSV file containing job salaries data into a Pandas DataFrame
df4 = pd.read_csv("/kaggle/input/global-ai-ml-data-science-salary/salaries.csv").drop_duplicates()
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df4 = df4[df4['work_year']>=2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df4.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df4.sample(5).T
(4838, 11)
step machine learning step of machine learning ml projects ml project
4060
851
5345
8327
310
work_year
2023
2023
2023
2022
2023
experience_level
SE
EN
SE
SE
SE
employment_type
FT
FT
FT
FT
FT
job_title
Data Scientist
Data Scientist
Data Science Manager
Data Analyst
Data Science Consultant
salary
169000
18000
134236
120600
116000
salary_currency
USD
EUR
USD
USD
USD
salary_in_usd
169000
19434
134236
120600
116000
employee_residence
US
GR
US
US
US
remote_ratio
0
100
0
100
0
company_location
US
GR
US
US
US
company_size
M
L
M
M
M
# Read the CSV file containing job salaries data into a Pandas DataFrame
df5 = pd.read_csv("/kaggle/input/2023-data-scientists-salary/ds_salaries.csv").drop_duplicates()
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df5 = df5[df5['work_year']>=2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df5.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df5.sample(5).T
(2281, 11)
158
1960
2475
1816
654
work_year
2023
2022
2022
2022
2023
experience_level
MI
SE
SE
MI
SE
employment_type
FT
FT
FT
FT
FT
job_title
Data Analyst
Research Engineer
Data Analyst
Data Scientist
Data Analyst
salary
35000
249500
150000
1100000
121600
salary_currency
GBP
USD
USD
INR
USD
salary_in_usd
42533
249500
150000
13989
121600
employee_residence
GB
US
US
IN
US
remote_ratio
0
0
0
100
100
company_location
GB
US
US
IN
US
company_size
M
M
M
L
M
# Read the CSV file containing job salaries data into a Pandas DataFrame
df6 = pd.read_csv('/kaggle/input/salary-data-analist/ds_salaries new.csv')
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df6 = df6[df6['work_year']>=2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df6.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df6.sample(5).T
(3449, 11)
977
1939
3098
2503
3017
work_year
2023
2022
2022
2022
2022
experience_level
SE
SE
SE
SE
SE
employment_type
FT
FT
FT
FT
FT
job_title
Data Analyst
Data Engineer
Data Science Manager
Data Engineer
Data Analyst
salary
122000
78000
249260
130000
175000
salary_currency
USD
USD
USD
USD
USD
salary_in_usd
122000
78000
249260
130000
175000
employee_residence
US
US
US
US
US
remote_ratio
100
0
0
0
100
company_location
US
US
US
US
US
company_size
M
M
M
M
M
# Read the CSV file containing job salaries data into a Pandas DataFrame
df7 = pd.read_csv('/kaggle/input/data-science-job-salaries-2024/salaries.csv')
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df7 = df7[df7['work_year']>=2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df7.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df7.sample(5).T
(13679, 11)
4865
3207
8944
1769
12657
work_year
2023
2024
2023
2024
2022
experience_level
SE
SE
SE
MI
SE
employment_type
FT
FT
FT
FT
FT
job_title
Data Architect
BI Developer
Applied Scientist
Data Science
Data Scientist
salary
150000
80000
136000
86000
243900
salary_currency
USD
USD
USD
USD
USD
salary_in_usd
150000
80000
136000
86000
243900
employee_residence
US
IN
US
US
US
remote_ratio
0
100
0
0
100
company_location
US
IN
US
US
US
company_size
M
M
L
M
M
# Read the CSV file containing job salaries data into a Pandas DataFrame
df8 = pd.read_csv('/kaggle/input/latest-data-science-job-salaries-2024/DataScience_salaries_2024.csv')
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df8 = df8[df8['work_year']>=2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df8.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df8.sample(5).T
(14545, 11)
14262
8570
13830
6927
3911
work_year
2023
2024
2023
2023
2023
experience_level
EN
SE
MI
MI
SE
employment_type
FT
FT
FT
FT
FT
job_title
Data Scientist
Data Manager
Data Specialist
Data Engineer
Research Scientist
salary
50000
131200
60000
147100
185000
salary_currency
USD
USD
USD
USD
USD
salary_in_usd
50000
131200
60000
147100
185000
employee_residence
IN
US
US
US
US
remote_ratio
100
100
0
0
0
company_location
US
US
US
US
US
company_size
M
M
M
M
M
projects on machine learning machine learning project
# Read the CSV file containing job salaries data into a Pandas DataFrame
df9 = pd.read_csv("/kaggle/input/machine-learning-engineer-salary-in-2024/salaries.csv")
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df9 = df9[df9['work_year']>=2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df9.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df9.sample(5).T
(16201, 11)
7410
5387
13553
14502
11737
work_year
2023
2024
2023
2023
2023
experience_level
SE
SE
MI
SE
EX
employment_type
FT
FT
FT
FT
FT
job_title
Data Analyst
Business Intelligence Engineer
Data Scientist
Data Engineer
Data Engineer
salary
150000
31200
45000
310000
204500
salary_currency
USD
EUR
EUR
USD
USD
salary_in_usd
150000
34666
48585
310000
204500
employee_residence
US
LV
ES
US
US
remote_ratio
0
0
0
0
0
company_location
US
LV
ES
US
US
company_size
M
M
M
M
M
# Read the CSV file containing job salaries data into a Pandas DataFrame
df10 = pd.read_csv("/kaggle/input/data-engineer-salary-in-2024/salaries (2).csv")
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df10 = df10[df10['work_year']>=2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df10.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df10.sample(5).T
(16241, 11)
9777
9339
9462
10574
3207
work_year
2023
2023
2023
2023
2024
experience_level
SE
SE
SE
SE
MI
employment_type
FT
FT
FT
FT
FT
job_title
Applied Scientist
Machine Learning Engineer
Machine Learning Engineer
Data Scientist
Applied Scientist
salary
159100
280000
204500
140100
222200
salary_currency
USD
USD
USD
USD
USD
salary_in_usd
159100
280000
204500
140100
222200
employee_residence
US
US
US
US
US
remote_ratio
0
0
0
0
0
company_location
US
US
US
US
US
company_size
L
M
M
M
L
# Read the CSV file containing job salaries data into a Pandas DataFrame
df11 = pd.read_csv("/kaggle/input/ai-ml-salaries/salaries.csv")
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df11 = df11[df11['work_year']>=2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df11.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df11.sample(5).T
(17763, 11)
9139
10496
5055
127
6567
work_year
2023
2023
2024
2024
2024
experience_level
SE
SE
SE
MI
MI
employment_type
FT
FT
FT
FT
FT
job_title
Machine Learning Engineer
Data Engineer
Data Analyst
Prompt Engineer
Research Scientist
salary
144400
385000
111200
500000
96200
salary_currency
USD
USD
USD
USD
USD
salary_in_usd
144400
385000
111200
500000
96200
employee_residence
CA
US
US
US
US
remote_ratio
0
0
0
0
100
company_location
CA
US
US
US
US
company_size
M
M
M
M
M
# Read the CSV file containing job salaries data into a Pandas DataFrame
df12 = pd.read_csv("/kaggle/input/data-science-salary-data/salaries.csv")
# Filter the DataFrame to include only rows where the 'work_year' is greater than or equal to 2022
df12 = df12[df12['work_year']>=2022]
# Print the shape of the DataFrame to show the number of rows and columns
print(df12.shape)
# Display a random sample of 5 rows from the filtered DataFrame, transposing for better readability
df12.sample(5).T
(13437, 11)
11374
1399
3457
870
3566
work_year
2023
2024
2023
2024
2023
experience_level
SE
SE
SE
SE
SE
employment_type
FT
FT
FT
FT
FT
job_title
Machine Learning Engineer
Data Analyst
Machine Learning Engineer
Data Architect
Analytics Engineer
salary
142200
110000
280000
90000
155000
salary_currency
USD
USD
USD
GBP
USD
salary_in_usd
142200
110000
280000
112500
155000
employee_residence
US
US
US
GB
US
remote_ratio
0
0
0
0
0
company_location
US
US
US
GB
US
company_size
M
M
M
M
M
# Concatenating DataFrames vertically
# This combines the rows of the DataFrames to create a new DataFrame
df = pd.concat([df0, df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12])
# Removing duplicate rows from the concatenated DataFrame
# This helps ensure that each row is unique in the final DataFrame
df = df.drop_duplicates()
# Printing the shape of the resulting DataFrame
# This provides information about the number of rows and columns in the DataFrame
print(df.shape)
(11938, 11)
df[df['salary_in_usd']==115573]
work_year
experience_level
employment_type
job_title
salary
salary_currency
salary_in_usd
employee_residence
remote_ratio
company_location
company_size
18105
2022
SE
FT
Machine Learning Engineer
110000
EUR
115573
FR
100
FR
M
18886
2022
MI
FT
Data Scientist
110000
EUR
115573
NL
0
NL
M
Data visualisation
# An update taken from the nice work https://www.kaggle.com/code/anshtanwar/auto-eda-missing-migrants-interactive-charts
# made by @anshtanwar
# Import the AutoViz_Class
# This class is used for automated exploratory data analysis and visualization.
AV = AutoViz_Class()
# Initialize variables
filename = "" # Specify the filename of the dataset (empty in this case)
target_variable = 'salary_in_usd' # Specify the target variable for analysis
custom_plot_dir = "custom_plot_directory" # Specify the directory to save custom plots
# Perform automated EDA using the AutoViz library
# The following parameters are used:
# - filename: Empty in this case as the data is provided directly as 'df'
# - sep: Delimiter used in the data (comma in this case)
# - depVar: Target variable for analysis ('rating' in this case)
# - dfte: DataFrame to be analyzed ('df' is assumed to be defined earlier)
# - header: Indicates that the first row contains column names (0 for True)
# - verbose: Verbosity level (1 for verbose output)
# - lowess: Smoothing using Lowess algorithm (False for no smoothing)
# - chart_format: Format in which charts will be generated (HTML format in this case)
# - max_rows_analyzed: Maximum number of rows to analyze (up to 10,000 rows)
# - max_cols_analyzed: Maximum number of columns to analyze (up to 50 columns)
# - save_plot_dir: Directory to save the generated plots ('custom_plot_directory' in this case)
try:
dft = AV.AutoViz(
filename,
sep=",",
depVar=target_variable,
dfte=df,
header=0,
verbose=1,
lowess=False,
chart_format="html",
max_rows_analyzed=min([df.shape[0], 10**4]),
max_cols_analyzed=min([df.shape[1], 50]),
save_plot_dir=custom_plot_dir
)
# Import the necessary library for displaying HTML content
from IPython.core.display import display, HTML
# Import the pathlib library to work with file paths
from pathlib import Path
# Initialize an empty list to store file names
file_names = []
# Use pathlib to iterate through HTML files in a specific directory
for file in Path(f'/kaggle/working/{custom_plot_dir}/{target_variable}/').glob('*.html'):
# Extract the filename from the full path and add it to the list
filename = str(file).split('/')[-1]
file_names.append(filename)
# Iterate through the list of file names and display each HTML file
for file_name in file_names:
# Construct the full file path for each HTML file
file_path = f'/kaggle/working/{custom_plot_dir}/{target_variable}/{file_name}'
# Open the HTML file for reading
with open(file_path, 'r') as file:
# Read the content of the HTML file
html_content = file.read()
# Display the HTML content using IPython
display(HTML(html_content))
except Exception as e:
print(f"Exception: {e}")
ml process kaggle machine learning projects machine learning project manager
Since nrows is smaller than dataset, loading random sample of 10000 rows into pandas...
Shape of your Data Set loaded: (10000, 11)
#######################################################################################
######################## C L A S S I F Y I N G V A R I A B L E S ####################
#######################################################################################
Classifying variables in data set...
Number of Numeric Columns = 0
Number of Integer-Categorical Columns = 2
Number of String-Categorical Columns = 6
Number of Factor-Categorical Columns = 0
Number of String-Boolean Columns = 0
Number of Numeric-Boolean Columns = 0
Number of Discrete String Columns = 1
Number of NLP String Columns = 0
Number of Date Time Columns = 1
Number of ID Columns = 0
Number of Columns to Delete = 0
10 Predictors classified...
No variables removed since no ID or low-information variables found in data set
Since Number of Rows in data 10000 exceeds maximum, randomly sampling 10000 rows for EDA...
################ Regression problem #####################
Saving scatterplots in HTML format
Saving pair_scatters in HTML format
Saving distplots_cats in HTML format
Saving distplots_nums in HTML format
Saving kde_plots in HTML format
Saving violinplots in HTML format
Saving heatmaps in HTML format
Saving timeseries_plots in HTML format
Saving cat_var_plots in HTML format
Time to run AutoViz (in seconds) = 13
# Convert 'salary_in_usd' column to thousands of dollars per year
label = 'salary_in_usd'
df[label] = df[label] * 1e-3
# Exclude 1% of smallest and 1% of highest salaries to remove outliers
P = np.percentile(df[label], [1, 99])
df = df[(df[label] > P[0]) & (df[label] < P[1])]
# Replace 'ML Engineer' with 'Machine Learning Engineer' in the 'job_title' column
df['job_title'].replace('ML Engineer', 'Machine Learning Engineer', inplace=True)
# Rename 'experience_level' based on a dictionary mapping
exp_dict = {'EN': 'Entry-level / Junior', 'MI': 'Mid-level / Intermediate', 'SE': 'Senior-level / Expert', 'EX': 'Executive-level / Director'}
df['experience_level'] = df['experience_level'].replace(exp_dict)
# Rename 'employment_type' based on a dictionary mapping
empl_dict = {'PT': 'Part-time', 'FT': 'Full-time', 'CT': 'Contract', 'FL': 'Freelance'}
df['employment_type'] = df['employment_type'].replace(empl_dict)
# Rename 'remote_ratio' based on a dictionary mapping
remote_dict = {0: 'No remote work (less than 20%)', 50: 'Partially remote', 100: 'Fully remote (more than 80%)'}
df['remote_ratio'] = df['remote_ratio'].replace(remote_dict)
# Rename 'company_size' based on a dictionary mapping
company_dict = {'S': 'Small', 'M': 'Medium', 'L': 'Large'}
df['company_size'] = df['company_size'].replace(company_dict)
# Combine 'employee_residence' and 'company_location' into a new 'residence_location' column
df['residence_location'] = df['employee_residence'] + '/' + df['company_location']
# Convert 'work_year' column to strings
df['work_year'] = df['work_year'].astype(str)
# Set up the rare label encoder for selected columns, limiting the number of categories
# and replacing rare categories with 'Other'
for col in ['job_title', 'residence_location', 'experience_level', 'employment_type']:
encoder = RareLabelEncoder(n_categories=1, max_n_categories=50, replace_with='Other', tol=20/df.shape[0])
df[col] = encoder.fit_transform(df[[col]])
# Drop unused columns
cols2drop = ['salary', 'employee_residence', 'company_location', 'salary_currency']
df = df.drop(cols2drop, axis=1)
# Display the shape of the resulting DataFrame
print(df.shape)
(11698, 8)
df.sample(10).T
4057
5001
3356
11098
18342
9928
5642
3186
11403
2746
work_year
2024
2024
2024
2023
2022
2023
2024
2024
2023
2022
experience_level
Entry-level / Junior
Senior-level / Expert
Mid-level / Intermediate
Senior-level / Expert
Senior-level / Expert
Senior-level / Expert
Executive-level / Director
Mid-level / Intermediate
Mid-level / Intermediate
Senior-level / Expert
employment_type
Full-time
Full-time
Full-time
Full-time
Full-time
Full-time
Full-time
Full-time
Full-time
Full-time
job_title
Business Intelligence Analyst
Business Intelligence Engineer
Data Engineer
Business Intelligence Engineer
Data Engineer
Data Engineer
Data Engineer
Data Analyst
Data Engineer
Machine Learning Engineer
salary_in_usd
111.4
204.5
149.9
180.0
172.2
143.0
190.0
59.469
83.9
131.3
remote_ratio
Fully remote (more than 80%)
No remote work (less than 20%)
No remote work (less than 20%)
Fully remote (more than 80%)
No remote work (less than 20%)
No remote work (less than 20%)
No remote work (less than 20%)
No remote work (less than 20%)
No remote work (less than 20%)
Fully remote (more than 80%)
company_size
Medium
Medium
Medium
Medium
Medium
Medium
Medium
Medium
Medium
Large
residence_location
US/US
US/US
US/US
US/US
US/US
US/US
US/US
US/US
US/US
US/US
Machine learning
# Extract the target variable 'label' and features 'X' from the DataFrame
y = df[label].values.reshape(-1,)
X = df.drop([label], axis=1)
# Identify categorical columns in the feature set
cat_cols = df.select_dtypes(include=['object']).columns
# Get the indices of categorical columns in the feature set
cat_cols_idx = [list(X.columns).index(c) for c in cat_cols]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0, stratify=df[['residence_location']])
# Print the shapes of the training and testing sets to verify the split
print("Training set shapes - X_train: {}, y_train: {}".format(X_train.shape, y_train.shape))
print("Testing set shapes - X_test: {}, y_test: {}".format(X_test.shape, y_test.shape))
Training set shapes - X_train: (5849, 7), y_train: (5849,)
Testing set shapes - X_test: (5849, 7), y_test: (5849,)
# Initialize Pool: Creating CatBoost Pools for training and testing data, specifying categorical features.
train_pool = Pool(X_train,
y_train,
cat_features=cat_cols_idx)
test_pool = Pool(X_test,
y_test,
cat_features=cat_cols_idx)
# Specify Training Parameters: Configuring the CatBoostRegressor with specific hyperparameters.
model = CatBoostRegressor(iterations=1800,
depth=6,
verbose=0,
early_stopping_rounds=100,
learning_rate=0.008,
loss_function='RMSE')
# Train the Model: Fitting the model to the training data and using the test data for early stopping.
model.fit(train_pool, eval_set=test_pool)
# Make Predictions: Generating predictions on both the training and test sets.
y_train_pred = model.predict(train_pool)
y_test_pred = model.predict(test_pool)
# Evaluate Performance on Training and Test Sets: Calculating RMSE scores for both sets.
rmse_train = mean_squared_error(y_train, y_train_pred, squared=False)
rmse_test = mean_squared_error(y_test, y_test_pred, squared=False)
# Print Results: Displaying the RMSE scores for the training and test sets.
print(f"RMSE score for train {round(rmse_train, 1)} kUSD/year, and for test {round(rmse_test, 1)} kUSD/year")
RMSE score for train 51.4 kUSD/year, and for test 52.0 kUSD/year
# Baseline scores (assuming the same prediction for all data samples)
# Calculating the root mean squared error (RMSE) for the training set based on the mean prediction
rmse_bs_train = mean_squared_error(y_train, [np.mean(y_train)] * len(y_train), squared=False)
# Calculating the root mean squared error (RMSE) for the test set based on the mean prediction
rmse_bs_test = mean_squared_error(y_test, [np.mean(y_train)] * len(y_test), squared=False)
# Printing the RMSE baseline scores for both the training and test sets
print(f"RMSE baseline score for train: {round(rmse_bs_train, 1)} kUSD/year, and for test: {round(rmse_bs_test, 1)} kUSD/year")
RMSE baseline score for train: 64.7 kUSD/year, and for test: 64.1 kUSD/year
Explanations with SHAP values
%matplotlib inline
# Initialize the SHAP JavaScript visualization library
shap.initjs()
# Create a SHAP TreeExplainer for the given 'model'
ex = shap.TreeExplainer(model)
# Compute SHAP values for the test dataset 'X_test' using the TreeExplainer
shap_values = ex.shap_values(X_test)
# Generate a summary plot of SHAP values to visualize feature contributions
shap.summary_plot(shap_values, X_test)
# Accessing the expected values from the 'ex' object, assuming it contains the expected values.
expected_values = ex.expected_value
# Printing the average predicted salary rounded to one decimal place in kilo USD per year.
print(f"Average predicted salary is {round(expected_values, 1)} kUSD/year")
# Calculating and printing the average actual salary from the 'y_test' array, rounded to one decimal place in kilo USD per year.
print(f"Average actual salary is {round(np.mean(y_test), 1)} kUSD/year")
Average predicted salary is 147.9 kUSD/year
Average actual salary is 146.5 kUSD/year
# Function to visualize SHAP values for a specific feature
def show_shap(col, shap_values=shap_values):
# Create a copy of the test dataset to avoid modifying the original data
df_infl = X_test.copy()
# Add a new column for SHAP values corresponding to the specified feature
df_infl['shap_'] = shap_values[:, df_infl.columns.tolist().index(col)]
# Calculate the mean and standard deviation of SHAP values grouped by the specified feature
gain = round(df_infl.groupby(col)['shap_'].mean(), 5)
gain_std = round(df_infl.groupby(col)['shap_'].std(), 5)
# Count the number of instances for each category of the specified feature
cnt = df_infl.groupby(col)['shap_'].count()
# Create a dictionary to store the results
dd_dict = {'col': list(gain.index), 'gain': list(gain.values), 'gain_std': list(gain_std.values), 'count': cnt}
# Create a DataFrame from the dictionary and sort it by 'gain' in descending order
df_res = pd.DataFrame.from_dict(dd_dict).sort_values('gain', ascending=False).set_index('col')
# Plotting SHAP values with error bars
plt.figure(figsize=(9, 6))
plt.errorbar(df_res.index, df_res['gain'], yerr=df_res['gain_std'], fmt="o", color="r")
plt.title(f'SHAP values for {col}')
plt.ylabel('kUSD/year')
plt.tick_params(axis="x", rotation=90)
plt.show()
# Display the results DataFrame
print(df_res)
return
# Iterate through all columns in the test dataset
for col in X_test.columns:
print()
print(col)
print()
# Call the show_shap function for each feature
show_shap(col, shap_values)
work_year
gain gain_std count
col
2024 0.86390 1.69680 2874
2023 0.29864 1.82858 2418
2022 -6.28405 1.38722 557
experience_level
gain gain_std count
col
Executive-level / Director 32.96379 4.27869 212
Senior-level / Expert 10.19685 1.17599 3392
Mid-level / Intermediate -17.35785 1.30479 1655
Entry-level / Junior -27.45064 3.77944 590
employment_type
machine learning python projects machine learning projects in python
gain gain_std count
col
Full-time -0.02771 1.19619 5814
Other -7.58667 1.18367 5
Contract -9.80807 3.58763 11
Part-time -11.75318 2.88548 19
job_title
machine learning projects machine learning projects with source code machine learning projects github machine learning projects for final year machine learning projects for students
gain gain_std count
col
Machine Learning Engineer 27.57581 2.25142 657
Computer Vision Engineer 26.98830 3.90854 21
Research Scientist 26.01017 1.09076 209
Data Science Engineer 25.75089 2.08551 19
Data Infrastructure Engineer 25.53922 1.28684 10
Head of Data 25.16594 1.96862 26
Machine Learning Scientist 24.53864 0.86595 55
Machine Learning Infrastructure Engineer 24.30945 1.31455 21
Machine Learning Researcher 24.18054 3.23081 16
Applied Scientist 23.94772 2.22138 84
Data Science Manager 23.54227 2.51141 44
Director of Data Science 23.17807 4.01040 11
AI Architect 22.90402 2.52727 12
Data Analytics Lead 21.32742 1.94926 13
Research Engineer 19.02157 1.54630 132
Data Science Lead 11.94530 1.74500 14
Data Scientist 4.81905 1.68475 1173
Data Science 3.57735 1.41696 94
Data Analytics Manager 3.32220 0.85416 28
Data Architect 3.16300 0.83773 159
AI Engineer 2.55602 1.33826 73
Data Engineer 0.20027 1.68056 1005
Data Product Manager -0.97918 0.67761 21
Analytics Engineer -1.07788 0.80328 162
MLOps Engineer -2.09056 0.72661 16
ETL Developer -2.46553 0.44679 15
AI Scientist -4.87655 1.39078 12
Data Lead -5.54521 1.49889 13
Business Intelligence -6.58333 1.69320 56
Data Modeler -10.56544 0.35704 17
Other -13.42975 1.41266 302
Business Intelligence Manager -13.67929 0.40962 7
Business Intelligence Engineer -13.74015 1.51243 63
AI Developer -15.72698 2.82001 12
Research Analyst -19.46592 2.14006 65
Business Intelligence Analyst -22.83291 2.80047 116
Data Strategist -24.35387 0.70741 15
Data Quality Analyst -24.93376 7.06758 10
Data Science Consultant -26.41770 3.58909 24
Insight Analyst -28.67048 4.51092 10
Data Management Analyst -29.89674 3.70526 11
Data Analyst -30.54220 4.40807 790
BI Developer -31.72639 4.00586 38
Data Management Specialist -32.50803 3.76743 10
Data Specialist -33.05490 4.40766 44
Data Operations Analyst -33.05554 3.22688 13
Data Manager -33.71244 3.41207 59
BI Analyst -33.97450 4.80845 24
Business Intelligence Developer -34.48365 3.10161 31
Data Developer -34.68734 1.58439 17
remote_ratio
gain gain_std count
col
No remote work (less than 20%) 0.52673 1.15238 3922
Fully remote (more than 80%) -1.42224 1.44277 1845
Partially remote -6.36967 1.72677 82
company_size
gain gain_std count
col
Medium -0.11769 0.98566 5532
Large -1.06947 2.54676 253
Small -5.99881 2.27519 64
residence_location
ml model machine learning projects projects machine learning
Additional analysis: 2024 vs 2023 year analysis with SHAP values
for col in X_test.columns:
if col != 'work_year':
print(col)
plot_gap(col, main_col="work_year", value1="2024", value2="2023")
experience_level
employment_type
job_title
remote_ratio
machine learning projects github machine learning projects for final year machine learning projects for students
company_size
residence_location
machine learning projects machine learning projects with source code
Conclusion
Alright everyone, we’ve come to the conclusion of our in-depth analysis on Machine Learning Engineer salaries for 2024. Here’s what you need to know:
Key Points:
Salary Range: Machine Learning Engineers are earning competitive salaries due to their valuable skills and expertise.
Industry Demand: There is a high demand for AI professionals in various sectors, leading to an increase in salaries.
Geographical Impact: Where you live matters! Salaries can vary significantly based on location.
ml projects github ml projects for final year ml projects for students
Machine Learning Engineers are set for success in 2024, with attractive salaries and abundant opportunities. Whether you’re just starting out or a seasoned expert, the future looks promising in the AI field.
machine learning projects github machine learning projects for final year machine learning projects for students
Thank you for joining us on this journey. Stay tuned for more insights on the ever-changing tech industry!
Introduction To Python Projects Get ready to explore a variety of Python projects with source code in this guide, tailored for everyone from beginners to advanced programmers. These projects are designed to give you hands-on Read more…
π§βπ» Discover the Best Python Course for Beginners! π§βπ» Are you excited to start learning Python but unsure where to begin? Python is an incredible programming language for beginners, known for its simplicity and versatility! Read more…
0 Comments