Sharing is caring!

Machine Learning Project 6: Obesity type Best EDA and classification

Table of Contents

Introduction

Hey there! Welcome to the inside scoop on ML Project 6: Obesity Type – Best EDA and Classification! We’re diving deep into the world of data science to tackle a real-life problem: obesity.

Also, check Machine Learning projects:

We’re not just scratching the surface, we’re diving headfirst into understanding the different types of obesity and how we can combat it using some serious data skills.

Imagine this: we’ll be sifting through data like a detective on a case, searching for clues about what causes each type of obesity. And once we have that knowledge, we’ll work our magic with top-notch machine learning techniques to classify obesity types like nobody’s business.

Think of it as a mission to crack the code of obesity and discover ways to help people lead healthier lives. So get ready, because we’re about to embark on an exciting journey through the world of data! Let’s get started!

ml projects github
ml projects for final year
ml projects for students

Dataset Information

This dataset provides information on obesity levels in individuals from Mexico, Peru, and Colombia. It includes data on their eating habits and physical condition. The dataset consists of 17 attributes and 2111 records.

Dataset Link: https://www.kaggle.com/code/diaakotb/obesity-type-eda-and-classification-99-boosting

Each record is labeled with the class variable NObesity (Obesity Level), which allows for classification based on values such as Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II, and Obesity Type III.

machine learning projects for resume
machine learning project for resume
best machine learning projects
cool machine learning projects

Approximately 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, while the remaining 23% was collected directly from users through a web platform.

  • Gender: Feature, Categorical, “Gender”
  • Age : Feature, Continuous, “Age”
  • Height: Feature, Continuous
  • Weight: Feature Continuous
  • family_history_with_overweight: Feature, Binary, ” Has a family member suffered or suffers from overweight? “
  • FAVC : Feature, Binary, ” Do you eat high caloric food frequently? “
  • FCVC : Feature, Integer, ” Do you usually eat vegetables in your meals? “
  • NCP : Feature, Continuous, ” How many main meals do you have daily? “
  • CAEC : Feature, Categorical, ” Do you eat any food between meals? “
  • SMOKE : Feature, Binary, ” Do you smoke? “
  • CH2O: Feature, Continuous, ” How much water do you drink daily? “
  • SCC: Feature, Binary, ” Do you monitor the calories you eat daily? “
  • FAF: Feature, Continuous, ” How often do you have physical activity? “
  • TUE : Feature, Integer, ” How much time do you use technological devices such as cell phone, videogames, television, computer and others? “
  • CALC : Feature, Categorical, ” How often do you drink alcohol? “
  • MTRANS : Feature, Categorical, ” Which transportation do you usually use? “
  • NObeyesdad : Target, Categorical, “Obesity level”

Importing libraries and Reading data

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from scipy import stats
import warnings
sns.set_style("darkgrid")
data = pd.read_csv("/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv")
data.head()
AgeGenderHeightWeightCALCFAVCFCVCNCPSCCSMOKECH2Ofamily_history_with_overweightFAFTUECAECMTRANSNObeyesdad
021.0Female1.6264.0nono2.03.0nono2.0yes0.01.0SometimesPublic_TransportationNormal_Weight
121.0Female1.5256.0Sometimesno3.03.0yesyes3.0yes3.00.0SometimesPublic_TransportationNormal_Weight
223.0Male1.8077.0Frequentlyno2.03.0nono2.0yes2.01.0SometimesPublic_TransportationNormal_Weight
327.0Male1.8087.0Frequentlyno3.03.0nono2.0no2.00.0SometimesWalkingOverweight_Level_I
422.0Male1.7889.8Sometimesno2.01.0nono2.0no0.00.0SometimesPublic_TransportationOverweight_Level_II
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Age                             2111 non-null   float64
 1   Gender                          2111 non-null   object 
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   CALC                            2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   SCC                             2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  family_history_with_overweight  2111 non-null   object 
 12  FAF                             2111 non-null   float64
 13  TUE                             2111 non-null   float64
 14  CAEC                            2111 non-null   object 
 15  MTRANS                          2111 non-null   object 
 16  NObeyesdad                      2111 non-null   object 
dtypes: float64(8), object(9)
memory usage: 280.5+ KB
ml projects for resume
ml project for resume
best ml projects
cool ml projects

Data wrangling

data.duplicated().sum()
24
data.loc[data.duplicated(keep=False), :]
AgeGenderHeightWeightCALCFAVCFCVCNCPSCCSMOKECH2Ofamily_history_with_overweightFAFTUECAECMTRANSNObeyesdad
9721.0Female1.5242.0Sometimesno3.01.0nono1.0no0.00.0FrequentlyPublic_TransportationInsufficient_Weight
9821.0Female1.5242.0Sometimesno3.01.0nono1.0no0.00.0FrequentlyPublic_TransportationInsufficient_Weight
10525.0Female1.5755.0Sometimesyes2.01.0nono2.0no2.00.0SometimesPublic_TransportationNormal_Weight
10625.0Female1.5755.0Sometimesyes2.01.0nono2.0no2.00.0SometimesPublic_TransportationNormal_Weight
14521.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
17421.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
17921.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
18421.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
20822.0Female1.6965.0Sometimesyes2.03.0nono2.0yes1.01.0SometimesPublic_TransportationNormal_Weight
20922.0Female1.6965.0Sometimesyes2.03.0nono2.0yes1.01.0SometimesPublic_TransportationNormal_Weight
28218.0Female1.6255.0noyes2.03.0nono1.0yes1.01.0FrequentlyPublic_TransportationNormal_Weight
29516.0Female1.6658.0nono2.01.0nono1.0no0.01.0SometimesWalkingNormal_Weight
30916.0Female1.6658.0nono2.01.0nono1.0no0.01.0SometimesWalkingNormal_Weight
44318.0Male1.7253.0Sometimesyes2.03.0nono2.0yes0.02.0SometimesPublic_TransportationInsufficient_Weight
46018.0Female1.6255.0noyes2.03.0nono1.0yes1.01.0FrequentlyPublic_TransportationNormal_Weight
46622.0Male1.7475.0noyes3.03.0nono1.0yes1.00.0FrequentlyAutomobileNormal_Weight
46722.0Male1.7475.0noyes3.03.0nono1.0yes1.00.0FrequentlyAutomobileNormal_Weight
49618.0Male1.7253.0Sometimesyes2.03.0nono2.0yes0.02.0SometimesPublic_TransportationInsufficient_Weight
52321.0Female1.5242.0Sometimesyes3.01.0nono1.0no0.00.0FrequentlyPublic_TransportationInsufficient_Weight
52721.0Female1.5242.0Sometimesyes3.01.0nono1.0no0.00.0FrequentlyPublic_TransportationInsufficient_Weight
65921.0Female1.5242.0Sometimesyes3.01.0nono1.0no0.00.0FrequentlyPublic_TransportationInsufficient_Weight
66321.0Female1.5242.0Sometimesyes3.01.0nono1.0no0.00.0FrequentlyPublic_TransportationInsufficient_Weight
76321.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
76421.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
82421.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
83021.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
83121.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
83221.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
83321.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
83421.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
92121.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
92221.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
92321.0Male1.6270.0Sometimesyes2.01.0nono3.0no1.00.0noPublic_TransportationOverweight_Level_I
  • There is no way to find out if these are duplicated entries or not so we keep them
data = data.rename(columns={"CALC":"alcohol_drinking_frequency",
            "FAVC":"high_calorie_food_eat",
            "FCVC":"vegetable_eat_daily",
            "NCP":"number_of_meals_daily",
            "SCC":"calories_monitoring",
            "CH2O":"water_drinking_daily",
            "FAF":"physical_activity_daily",
            "TUE":"electronics_usage_daily",
            "CAEC":"food_between_meals",
            "MTRANS":"method_of_transportion"})
for col in ['Age', 'Weight', 'vegetable_eat_daily','number_of_meals_daily', 'water_drinking_daily','physical_activity_daily','electronics_usage_daily']:
    data[col] = data.loc[:,col].round().astype(int)
data.describe()
AgeHeightWeightvegetable_eat_dailynumber_of_meals_dailywater_drinking_dailyphysical_activity_dailyelectronics_usage_daily
count2111.0000002111.0000002111.0000002111.0000002111.0000002111.0000002111.0000002111.000000
mean24.3159641.70167786.5864522.4234962.6878262.0146851.0066320.664614
std6.3570780.09330526.1901360.5839050.8096800.6886160.8954620.674009
min14.0000001.45000039.0000001.0000001.0000001.0000000.0000000.000000
25%20.0000001.63000065.5000002.0000003.0000002.0000000.0000000.000000
50%23.0000001.70049983.0000002.0000003.0000002.0000001.0000001.000000
75%26.0000001.768464107.0000003.0000003.0000002.0000002.0000001.000000
max61.0000001.980000173.0000003.0000004.0000003.0000003.0000002.000000

Univariate analysis

plt.figure(figsize=(18,15))
for i,col in enumerate(data.select_dtypes(include="object").columns[:-1]):
    plt.subplot(4,2,i+1)
    sns.countplot(data=data,x=col,palette=sns.color_palette("Set2"))
data["NObeyesdad"].value_counts().sort_values(ascending=False).plot(kind="bar",color="red")
<Axes: xlabel='NObeyesdad'>
machine learning projects
machine learning projects with source code
machine learning projects github
machine learning projects for final year
machine learning projects for students
plt.figure(figsize=(18,15))
for i,col in enumerate(data.select_dtypes(include="number").columns[:3]):
    plt.subplot(4,2,i+1)
    sns.boxplot(data=data,x=col,palette=sns.color_palette("Set2"))
  • We can see there is alot of outliers in age column we can reduce that
data=data[np.abs(stats.zscore(data["Age"])) < 2].reset_index(drop=True)
sns.boxplot(data=data,x="Age")
<Axes: xlabel='Age'>
data.shape
(1981, 17)
plt.figure(figsize=(18,15))
for i,col in enumerate(data.select_dtypes(include="number").columns[3:]):
    plt.subplot(4,2,i+1)
    sns.countplot(data=data,x=col)
ml projects ideas
project manager artificial intelligence
best ml courses reddit

Multivariate analysis

How is obesity type affected by eating high calorie food?

data.groupby(['NObeyesdad', 'high_calorie_food_eat'])["high_calorie_food_eat"].count()
NObeyesdad           high_calorie_food_eat
Insufficient_Weight  no                        51
                     yes                      220
Normal_Weight        no                        75
                     yes                      206
Obesity_Type_I       no                         9
                     yes                      284
Obesity_Type_II      no                         6
                     yes                      268
Obesity_Type_III     no                         1
                     yes                      323
Overweight_Level_I   no                        20
                     yes                      256
Overweight_Level_II  no                        71
                     yes                      191
Name: high_calorie_food_eat, dtype: int64
plt.figure(figsize=(10,7))
sns.countplot(data=data,x=data.NObeyesdad,hue=data.high_calorie_food_eat,palette=sns.color_palette("Dark2"))
plt.xticks(rotation=-20)
plt.show()
  • high calorie food seems to not affect obesity type that much but it does affect whether someone is above normal weight or not
  • obesity type 3 however seems to have no one not eating high calorie food acording to this data and type 2 is very high

Average age of each obesity type

data.groupby("NObeyesdad")["Age"].median()
NObeyesdad
Insufficient_Weight    19.0
Normal_Weight          21.0
Obesity_Type_I         23.0
Obesity_Type_II        27.0
Obesity_Type_III       25.0
Overweight_Level_I     21.0
Overweight_Level_II    23.0
Name: Age, dtype: float64
data.groupby("NObeyesdad")["Age"].median().sort_values(ascending=False).plot(kind="bar",color = sns.color_palette("Set2"))
plt.title("Average age of each obesity type")
Text(0.5, 1.0, 'Average age of each obesity type')
  • The avg age is the highest in obesity type 2 followed by 3 and 1
  • The avg age is the lowest in insufficient weight
  • so it seems that as age increases weight increases

Average weight of each obesity type

data.groupby("NObeyesdad")["Weight"].mean()
NObeyesdad
Insufficient_Weight     49.926199
Normal_Weight           62.106762
Obesity_Type_I          94.819113
Obesity_Type_II        115.306569
Obesity_Type_III       120.972222
Overweight_Level_I      74.510870
Overweight_Level_II     82.045802
Name: Weight, dtype: float64
data.groupby("NObeyesdad")["Weight"].mean().sort_values(ascending=False).plot(kind="bar",color=sns.color_palette("Set2"))
<Axes: xlabel='NObeyesdad'>
  • Here as expected obesity type 3 has the highest average weight followed by type 2 then type 1
ml projects
ml projects with source code
ml projects github
ml projects for final year
ml projects for students

Does gender affect obesity type?

plt.figure(figsize=(10,7))
sns.countplot(data=data,x="NObeyesdad",hue="Gender",palette=sns.color_palette("Dark2"))
plt.xticks(rotation=-20)
plt.show()
  • Males are higher in almost all obesity types except obesity type 3
  • Females are more likely to have insufficient weight
  • Females are more likely to have severe obesity(type 3)

Does eating food between meals affect obesity type?

plt.figure(figsize=(10,7))
sns.countplot(data=data,x="NObeyesdad",hue="food_between_meals",palette=sns.color_palette("Dark2"))
plt.xticks(rotation=-20)
plt.show()
  • Most people eat food in between meals sometimes
  • People with insufficient weight and normal weight eat food betwen meals frequently the most
  • it can be said that eating small meals in between meals decrease weight

Does family history with overweight affect obesity type?

plt.figure(figsize=(10,7))
sns.countplot(data=data,x="NObeyesdad",hue="family_history_with_overweight",palette=sns.color_palette("Dark2"))
plt.xticks(rotation=-20)
plt.show()
  • Having family history with overweight seem to have an effect of increasing weight as obesity type 3,2,1 seem to all have family history with overweight

Does people who drink also smoke?

sns.countplot(data=data,x=data.alcohol_drinking_frequency,hue=data.SMOKE)
<Axes: xlabel='alcohol_drinking_frequency', ylabel='count'>
  • No most of the people who drink alcohol don’t smoke

Data preprocessing and splitting data

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from  xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import MaxAbsScaler,RobustScaler
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split,cross_val_score
lgbm_settings = {'n_estimators': 137,
 'num_leaves': 16,
 'min_child_samples': 2,
 'learning_rate': 0.11333885880532285,
 'colsample_bytree': 0.7557376218643025,
 'reg_alpha': 0.0013323317789643257,
 'reg_lambda': 0.0018596588413880056,
 'n_jobs': -1,
 'max_bin': 511,
 'verbose': -1}
data.select_dtypes(include="object").columns
Index(['Gender', 'alcohol_drinking_frequency', 'high_calorie_food_eat',
       'calories_monitoring', 'SMOKE', 'family_history_with_overweight',
       'food_between_meals', 'method_of_transportion', 'NObeyesdad'],
      dtype='object')

Enoding ordinal features using label enoder

encoder  =LabelEncoder()
model_data = data.copy()
for col in ['alcohol_drinking_frequency','food_between_meals','NObeyesdad']:
    model_data[col] =encoder.fit_transform(model_data[col])

Encoding nominal data using pd dummies

cols = model_data.select_dtypes(include="object").columns
dums = pd.get_dummies(model_data[cols],dtype=int)
model_data = pd.concat([model_data,dums],axis=1).drop(columns=cols)
model_data.head()
AgeHeightWeightalcohol_drinking_frequencyvegetable_eat_dailynumber_of_meals_dailywater_drinking_dailyphysical_activity_dailyelectronics_usage_dailyfood_between_mealscalories_monitoring_yesSMOKE_noSMOKE_yesfamily_history_with_overweight_nofamily_history_with_overweight_yesmethod_of_transportion_Automobilemethod_of_transportion_Bikemethod_of_transportion_Motorbikemethod_of_transportion_Public_Transportationmethod_of_transportion_Walking
0211.626432320120100100010
1211.525623333021010100010
2231.807712322120100100010
3271.808713322020101000001
4221.789022120020101000010

5 rows Γ— 26 columns

Correlation between data atributes

corr_data =data.copy()
encoder  =LabelEncoder()
for col in corr_data.select_dtypes(include="object").columns:
    corr_data[col] =encoder.fit_transform(corr_data[col])
plt.figure(figsize=(16,13))
sns.heatmap(data=corr_data.corr(),annot=True)
<Axes: >
ml process
kaggle ml projects
ml project manager
ml project management
ml projects for masters students

Normalizing data using max absolute scaler

x= model_data.drop(columns="NObeyesdad")
y=model_data["NObeyesdad"]
scaler_mas = MaxAbsScaler()
for col in x.columns:
    scaler_mas.fit(x[[col]])
    x[col] = scaler_mas.transform (x[[col]])
x.head()
AgeHeightWeightalcohol_drinking_frequencyvegetable_eat_dailynumber_of_meals_dailywater_drinking_dailyphysical_activity_dailyelectronics_usage_dailyfood_between_mealscalories_monitoring_yesSMOKE_noSMOKE_yesfamily_history_with_overweight_nofamily_history_with_overweight_yesmethod_of_transportion_Automobilemethod_of_transportion_Bikemethod_of_transportion_Motorbikemethod_of_transportion_Public_Transportationmethod_of_transportion_Walking
00.5675680.8181820.3699421.0000000.6666670.750.6666670.0000000.50.6666670.01.00.00.01.00.00.00.01.00.0
10.5675680.7676770.3236990.6666671.0000000.751.0000001.0000000.00.6666671.00.01.00.01.00.00.00.01.00.0
20.6216220.9090910.4450870.3333330.6666670.750.6666670.6666670.50.6666670.01.00.00.01.00.00.00.01.00.0
30.7297300.9090910.5028900.3333331.0000000.750.6666670.6666670.00.6666670.01.00.01.00.00.00.00.00.01.0
40.5945950.8989900.5202310.6666670.6666670.250.6666670.0000000.00.6666670.01.00.01.00.00.00.00.01.00.0

5 rows Γ— 25 columns

Splitting data and training models

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2,random_state=7)
x.head()
AgeHeightWeightalcohol_drinking_frequencyvegetable_eat_dailynumber_of_meals_dailywater_drinking_dailyphysical_activity_dailyelectronics_usage_dailyfood_between_mealscalories_monitoring_yesSMOKE_noSMOKE_yesfamily_history_with_overweight_nofamily_history_with_overweight_yesmethod_of_transportion_Automobilemethod_of_transportion_Bikemethod_of_transportion_Motorbikemethod_of_transportion_Public_Transportationmethod_of_transportion_Walking
00.5675680.8181820.3699421.0000000.6666670.750.6666670.0000000.50.6666670.01.00.00.01.00.00.00.01.00.0
10.5675680.7676770.3236990.6666671.0000000.751.0000001.0000000.00.6666671.00.01.00.01.00.00.00.01.00.0
20.6216220.9090910.4450870.3333330.6666670.750.6666670.6666670.50.6666670.01.00.00.01.00.00.00.01.00.0
30.7297300.9090910.5028900.3333331.0000000.750.6666670.6666670.00.6666670.01.00.01.00.00.00.00.00.01.0
40.5945950.8989900.5202310.6666670.6666670.250.6666670.0000000.00.6666670.01.00.01.00.00.00.00.01.00.0

5 rows Γ— 25 columns

model_lgbm  = LGBMClassifier(**lgbm_settings)
model_xgb = XGBClassifier(objective="multi:softmax",num_class = 7)
model_gb = GradientBoostingClassifier(max_depth=9,min_samples_leaf=3,min_samples_split=13,subsample=0.751)
model_rfc = RandomForestClassifier()
models = [model_lgbm,model_xgb,model_gb,model_rfc]
for model in models:
    model.fit(x_train,y_train)
ml process
kaggle machine learning projects
machine learning project manager
machine learning project management
machine learning projects for masters students

Evaluating models

for model in models:
    model_name = type(model).__name__
    print(f"score for {model_name} on train data: {model.score(x_train,y_train)}")
score for LGBMClassifier on train data: 1.0
score for XGBClassifier on train data: 1.0
score for GradientBoostingClassifier on train data: 1.0
score for RandomForestClassifier on train data: 1.0
for model in models:
    model_name = type(model).__name__
    print(f"score for {model_name} on test data: {model.score(x_test,y_test)}")
score for LGBMClassifier on test data: 0.9899244332493703
score for XGBClassifier on test data: 0.982367758186398
score for GradientBoostingClassifier on test data: 0.9773299748110831
score for RandomForestClassifier on test data: 0.947103274559194
print("scores of each model using kfold validation:-\n\n")
for model in models:
    score = cross_val_score(model,x,y,cv=10)
    avg = np.mean(score)
    model_name = type(model).__name__
    print(f"scores for {model_name}:{score}")
    print(f"average score for {model_name}:{avg}\n")
scores of each model using kfold validation:-


scores for LGBMClassifier:[0.93969849 0.93434343 0.98989899 0.96969697 0.98484848 0.99494949
 0.97474747 0.99494949 0.97474747 0.98484848]
average score for LGBMClassifier:0.9742728795492613

scores for XGBClassifier:[0.91457286 0.93434343 0.97474747 0.97979798 0.97474747 0.98989899
 0.97474747 0.98989899 0.97979798 0.97979798]
average score for XGBClassifier:0.9692350642099384

scores for GradientBoostingClassifier:[0.90452261 0.94949495 0.97979798 0.97474747 0.97979798 0.99494949
 0.98989899 0.98484848 0.97979798 0.97979798]
average score for GradientBoostingClassifier:0.971765392619664

scores for RandomForestClassifier:[0.73869347 0.82323232 0.96464646 0.95454545 0.97979798 0.97979798
 0.96969697 0.96969697 0.97979798 0.98484848]
average score for RandomForestClassifier:0.9344754073397288

for model in models:
    y_predicted = model.predict(x_test)
    model_name = type(model).__name__
    print(f"Report:{model_name}")
    print(classification_report(y_test,y_predicted))
    
step machine learning
step of machine learning
ml projects
ml project
machine learning python projects
machine learning projects in python
Report:LGBMClassifier
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        46
           1       0.96      1.00      0.98        65
           2       1.00      0.99      0.99        67
           3       1.00      1.00      1.00        53
           4       1.00      1.00      1.00        63
           5       0.98      0.95      0.97        63
           6       1.00      1.00      1.00        40

    accuracy                           0.99       397
   macro avg       0.99      0.99      0.99       397
weighted avg       0.99      0.99      0.99       397

Report:XGBClassifier
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        46
           1       0.95      0.95      0.95        65
           2       1.00      0.99      0.99        67
           3       1.00      1.00      1.00        53
           4       1.00      1.00      1.00        63
           5       0.98      0.95      0.97        63
           6       0.95      1.00      0.98        40

    accuracy                           0.98       397
   macro avg       0.98      0.98      0.98       397
weighted avg       0.98      0.98      0.98       397

Report:GradientBoostingClassifier
              precision    recall  f1-score   support

           0       0.96      0.98      0.97        46
           1       0.94      0.95      0.95        65
           2       0.99      0.99      0.99        67
           3       1.00      0.98      0.99        53
           4       1.00      1.00      1.00        63
           5       0.98      0.95      0.97        63
           6       0.98      1.00      0.99        40

    accuracy                           0.98       397
   macro avg       0.98      0.98      0.98       397
weighted avg       0.98      0.98      0.98       397

Report:RandomForestClassifier
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        46
           1       0.86      0.91      0.88        65
           2       0.97      0.97      0.97        67
           3       1.00      1.00      1.00        53
           4       1.00      1.00      1.00        63
           5       0.92      0.87      0.89        63
           6       0.95      0.88      0.91        40

    accuracy                           0.95       397
   macro avg       0.95      0.95      0.95       397
weighted avg       0.95      0.95      0.95       397

for i,model in enumerate(models):
    plt.subplot(2,2,i+1)
    y_predicted = model.predict(x_test)
    model_name = type(model).__name__
    cm = confusion_matrix(y_test, y_predicted)
    sns.heatmap(cm, annot=True,fmt='d')
    plt.xlabel('Predicted')
    plt.ylabel('Truth')
    plt.title(f"{model_name} confusion matrix")
    plt.show()

– light gradient boosting is the best model acheiving 99% accuracy and an average acc of 97% in kfold cross validation

– xg and gradient are almost the same in accuracy (98%) and kfold validation (97%)

– random forest acheived the worst results with 94% acc and an average acc of (92%) in kfold cross validation and it also seemed to overfit

step ml
step of ml
ml projects
ml project
ml python projects
ml projects in python

Obesity Levels-Multi Head Attention-Hyperparameter

import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv
train_df=pd.read_csv("/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv")
train_df
AgeGenderHeightWeightCALCFAVCFCVCNCPSCCSMOKECH2Ofamily_history_with_overweightFAFTUECAECMTRANSNObeyesdad
021.000000Female1.62000064.000000nono2.03.0nono2.000000yes0.0000001.000000SometimesPublic_TransportationNormal_Weight
121.000000Female1.52000056.000000Sometimesno3.03.0yesyes3.000000yes3.0000000.000000SometimesPublic_TransportationNormal_Weight
223.000000Male1.80000077.000000Frequentlyno2.03.0nono2.000000yes2.0000001.000000SometimesPublic_TransportationNormal_Weight
327.000000Male1.80000087.000000Frequentlyno3.03.0nono2.000000no2.0000000.000000SometimesWalkingOverweight_Level_I
422.000000Male1.78000089.800000Sometimesno2.01.0nono2.000000no0.0000000.000000SometimesPublic_TransportationOverweight_Level_II
210620.976842Female1.710730131.408528Sometimesyes3.03.0nono1.728139yes1.6762690.906247SometimesPublic_TransportationObesity_Type_III
210721.982942Female1.748584133.742943Sometimesyes3.03.0nono2.005130yes1.3413900.599270SometimesPublic_TransportationObesity_Type_III
210822.524036Female1.752206133.689352Sometimesyes3.03.0nono2.054193yes1.4142090.646288SometimesPublic_TransportationObesity_Type_III
210924.361936Female1.739450133.346641Sometimesyes3.03.0nono2.852339yes1.1391070.586035SometimesPublic_TransportationObesity_Type_III
211023.664709Female1.738836133.472641Sometimesyes3.03.0nono2.863513yes1.0264520.714137SometimesPublic_TransportationObesity_Type_III

2111 rows Γ— 17 columns

unique_values = {}
mythreshold=7
for i, column in enumerate(train_df.columns, 1):
    unique_values[column] = train_df[column].unique()
    if train_df[column].nunique() <= mythreshold:
        print(f"{i}. Kolom:\" {column} \" , Kategorikal, Tipe Data:{train_df[column].dtype} , Unique:{unique_values[column]}, Jumlah Unique:{train_df[column].nunique()}")
    else:
        print(f"{i}. Kolom:\" {column} \" , Numerik, Tipe Data:{train_df[column].dtype} , Min:{train_df[column].min()}, Max:{train_df[column].max()}, Jumlah Unique:{train_df[column].nunique()}")
1. Kolom:" Age " , Numerik, Tipe Data:float64 , Min:14.0, Max:61.0, Jumlah Unique:1402
2. Kolom:" Gender " , Kategorikal, Tipe Data:object , Unique:['Female' 'Male'], Jumlah Unique:2
3. Kolom:" Height " , Numerik, Tipe Data:float64 , Min:1.45, Max:1.98, Jumlah Unique:1574
4. Kolom:" Weight " , Numerik, Tipe Data:float64 , Min:39.0, Max:173.0, Jumlah Unique:1525
5. Kolom:" CALC " , Kategorikal, Tipe Data:object , Unique:['no' 'Sometimes' 'Frequently' 'Always'], Jumlah Unique:4
6. Kolom:" FAVC " , Kategorikal, Tipe Data:object , Unique:['no' 'yes'], Jumlah Unique:2
7. Kolom:" FCVC " , Numerik, Tipe Data:float64 , Min:1.0, Max:3.0, Jumlah Unique:810
8. Kolom:" NCP " , Numerik, Tipe Data:float64 , Min:1.0, Max:4.0, Jumlah Unique:635
9. Kolom:" SCC " , Kategorikal, Tipe Data:object , Unique:['no' 'yes'], Jumlah Unique:2
10. Kolom:" SMOKE " , Kategorikal, Tipe Data:object , Unique:['no' 'yes'], Jumlah Unique:2
11. Kolom:" CH2O " , Numerik, Tipe Data:float64 , Min:1.0, Max:3.0, Jumlah Unique:1268
12. Kolom:" family_history_with_overweight " , Kategorikal, Tipe Data:object , Unique:['yes' 'no'], Jumlah Unique:2
13. Kolom:" FAF " , Numerik, Tipe Data:float64 , Min:0.0, Max:3.0, Jumlah Unique:1190
14. Kolom:" TUE " , Numerik, Tipe Data:float64 , Min:0.0, Max:2.0, Jumlah Unique:1129
15. Kolom:" CAEC " , Kategorikal, Tipe Data:object , Unique:['Sometimes' 'Frequently' 'Always' 'no'], Jumlah Unique:4
16. Kolom:" MTRANS " , Kategorikal, Tipe Data:object , Unique:['Public_Transportation' 'Walking' 'Automobile' 'Motorbike' 'Bike'], Jumlah Unique:5
17. Kolom:" NObeyesdad " , Kategorikal, Tipe Data:object , Unique:['Normal_Weight' 'Overweight_Level_I' 'Overweight_Level_II'
 'Obesity_Type_I' 'Insufficient_Weight' 'Obesity_Type_II'
 'Obesity_Type_III'], Jumlah Unique:7
numeric_features = []
categorical_features = []
for column in train_df.columns:
    if train_df[column].nunique() <= mythreshold:
        categorical_features.append(column)
    else:
        numeric_features.append(column)
print("Fitur Numerik:", numeric_features)
print("Fitur Kategorikal:", categorical_features)
Fitur Numerik: ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']
Fitur Kategorikal: ['Gender', 'CALC', 'FAVC', 'SCC', 'SMOKE', 'family_history_with_overweight', 'CAEC', 'MTRANS', 'NObeyesdad']
numerik_df=train_df.drop(categorical_features,axis=1)
numerik_df
AgeHeightWeightFCVCNCPCH2OFAFTUE
021.0000001.62000064.0000002.03.02.0000000.0000001.000000
121.0000001.52000056.0000003.03.03.0000003.0000000.000000
223.0000001.80000077.0000002.03.02.0000002.0000001.000000
327.0000001.80000087.0000003.03.02.0000002.0000000.000000
422.0000001.78000089.8000002.01.02.0000000.0000000.000000
210620.9768421.710730131.4085283.03.01.7281391.6762690.906247
210721.9829421.748584133.7429433.03.02.0051301.3413900.599270
210822.5240361.752206133.6893523.03.02.0541931.4142090.646288
210924.3619361.739450133.3466413.03.02.8523391.1391070.586035
211023.6647091.738836133.4726413.03.02.8635131.0264520.714137

2111 rows Γ— 8 columns

numerik_df.columns
Index(['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE'], dtype='object')
import matplotlib.pyplot as plt
num_cols = 2
num_rows = (len(numerik_df.columns) + num_cols - 1) // num_cols
default_subplot_size = (8, 6)
fig_width = default_subplot_size[0] * num_cols
fig_height = default_subplot_size[1] * num_rows
fig, axes = plt.subplots(num_rows, num_cols, figsize=(fig_width, fig_height))
for i, column in enumerate(numerik_df.columns):
    row_index = i // num_cols
    col_index = i % num_cols
    axes[row_index, col_index].hist(numerik_df[column], bins=10, color='skyblue', edgecolor='black')
    axes[row_index, col_index].set_title(f'Distribusi {column} sebelum discaler')
    axes[row_index, col_index].set_xlabel(column)
    axes[row_index, col_index].set_ylabel('Frequency')
    axes[row_index, col_index].grid(True)
if len(numerik_df.columns) % num_cols != 0:
    fig.delaxes(axes[num_rows-1, num_cols-1])
plt.tight_layout()
plt.show()
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerik_scaled = scaler.fit_transform(numerik_df)
numerik_scaled.shape
(2111, 8)
numerik_scaled
array([[-0.52212439, -0.87558934, -0.86255819, ..., -0.01307326,
        -1.18803911,  0.56199675],
       [-0.52212439, -1.94759928, -1.16807699, ...,  1.61875854,
         2.33975012, -1.08062463],
       [-0.20688898,  1.05402854, -0.36609013, ..., -0.01307326,
         1.16382038,  0.56199675],
       ...,
       [-0.28190933,  0.54167211,  1.79886776, ...,  0.0753606 ,
         0.47497132, -0.01901815],
       [ 0.00777624,  0.40492652,  1.78577968, ...,  1.37780063,
         0.15147069, -0.11799101],
       [-0.10211908,  0.39834438,  1.7905916 , ...,  1.39603472,
         0.01899633,  0.09243207]])
kolom_numerik=numerik_df.columns
numerik_scaled_df=pd.DataFrame(data=numerik_scaled,columns=kolom_numerik)
numerik_scaled_df
github artificial intelligence-projects
machine learning project life cycle
machine learning project python
machine learning projects python
deep learning projects for masters students
AgeHeightWeightFCVCNCPCH2OFAFTUE
0-0.522124-0.875589-0.862558-0.7850190.404153-0.013073-1.1880390.561997
1-0.522124-1.947599-1.1680771.0883420.4041531.6187592.339750-1.080625
2-0.2068891.054029-0.366090-0.7850190.404153-0.0130731.1638200.561997
30.4235821.0540290.0158081.0883420.404153-0.0130731.163820-1.080625
4-0.3645070.8396270.122740-0.785019-2.167023-0.013073-1.188039-1.080625
2106-0.5257740.0970451.7117631.0883420.404153-0.4567050.7831350.407996
2107-0.3671950.5028441.8009141.0883420.404153-0.0047020.389341-0.096251
2108-0.2819090.5416721.7988681.0883420.4041530.0753610.474971-0.019018
21090.0077760.4049271.7857801.0883420.4041531.3778010.151471-0.117991
2110-0.1021190.3983441.7905921.0883420.4041531.3960350.0189960.092432

2111 rows Γ— 8 columns

numerik_scaled_df.columns
Index(['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE'], dtype='object')
num_cols = 2
num_rows = (len(numerik_scaled_df.columns) + num_cols - 1) // num_cols
default_subplot_size = (8, 6)
fig_width = default_subplot_size[0] * num_cols
fig_height = default_subplot_size[1] * num_rows
fig, axes = plt.subplots(num_rows, num_cols, figsize=(fig_width, fig_height))
for i, column in enumerate(numerik_scaled_df.columns):
    row_index = i // num_cols
    col_index = i % num_cols
    axes[row_index, col_index].hist(numerik_scaled_df[column], bins=10, color='skyblue', edgecolor='black')
    axes[row_index, col_index].set_title(f'Distribusi {column} setelah discaler')
    axes[row_index, col_index].set_xlabel(column)
    axes[row_index, col_index].set_ylabel('Frequency')
    axes[row_index, col_index].grid(True)
if len(numerik_scaled_df.columns) % num_cols != 0:
    fig.delaxes(axes[num_rows-1, num_cols-1])
plt.tight_layout()
plt.show()
kategori_d=train_df.drop(numeric_features,axis=1)
kategori_d
GenderCALCFAVCSCCSMOKEfamily_history_with_overweightCAECMTRANSNObeyesdad
0FemalenonononoyesSometimesPublic_TransportationNormal_Weight
1FemaleSometimesnoyesyesyesSometimesPublic_TransportationNormal_Weight
2MaleFrequentlynononoyesSometimesPublic_TransportationNormal_Weight
3MaleFrequentlynonononoSometimesWalkingOverweight_Level_I
4MaleSometimesnonononoSometimesPublic_TransportationOverweight_Level_II
2106FemaleSometimesyesnonoyesSometimesPublic_TransportationObesity_Type_III
2107FemaleSometimesyesnonoyesSometimesPublic_TransportationObesity_Type_III
2108FemaleSometimesyesnonoyesSometimesPublic_TransportationObesity_Type_III
2109FemaleSometimesyesnonoyesSometimesPublic_TransportationObesity_Type_III
2110FemaleSometimesyesnonoyesSometimesPublic_TransportationObesity_Type_III

2111 rows Γ— 9 columns

kategori_df=kategori_d.drop(["NObeyesdad"],axis=1)
kategori_df
github artificial intelligence-projects
ml project life cycle
ml project python
ml projects python
deep learning projects for masters students
GenderCALCFAVCSCCSMOKEfamily_history_with_overweightCAECMTRANS
0FemalenonononoyesSometimesPublic_Transportation
1FemaleSometimesnoyesyesyesSometimesPublic_Transportation
2MaleFrequentlynononoyesSometimesPublic_Transportation
3MaleFrequentlynonononoSometimesWalking
4MaleSometimesnonononoSometimesPublic_Transportation
2106FemaleSometimesyesnonoyesSometimesPublic_Transportation
2107FemaleSometimesyesnonoyesSometimesPublic_Transportation
2108FemaleSometimesyesnonoyesSometimesPublic_Transportation
2109FemaleSometimesyesnonoyesSometimesPublic_Transportation
2110FemaleSometimesyesnonoyesSometimesPublic_Transportation

2111 rows Γ— 8 columns

kategori_df.columns
Index(['Gender', 'CALC', 'FAVC', 'SCC', 'SMOKE',
       'family_history_with_overweight', 'CAEC', 'MTRANS'],
      dtype='object')
kategori_encode_df = pd.get_dummies(kategori_df, columns=kategori_df.columns)
kategori_encode_df 
Gender_FemaleGender_MaleCALC_AlwaysCALC_FrequentlyCALC_SometimesCALC_noFAVC_noFAVC_yesSCC_noSCC_yesfamily_history_with_overweight_yesCAEC_AlwaysCAEC_FrequentlyCAEC_SometimesCAEC_noMTRANS_AutomobileMTRANS_BikeMTRANS_MotorbikeMTRANS_Public_TransportationMTRANS_Walking
0TrueFalseFalseFalseFalseTrueTrueFalseTrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
1TrueFalseFalseFalseTrueFalseTrueFalseFalseTrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2FalseTrueFalseTrueFalseFalseTrueFalseTrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
3FalseTrueFalseTrueFalseFalseTrueFalseTrueFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseTrue
4FalseTrueFalseFalseTrueFalseTrueFalseTrueFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse
2106TrueFalseFalseFalseTrueFalseFalseTrueTrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2107TrueFalseFalseFalseTrueFalseFalseTrueTrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2108TrueFalseFalseFalseTrueFalseFalseTrueTrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2109TrueFalseFalseFalseTrueFalseFalseTrueTrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2110TrueFalseFalseFalseTrueFalseFalseTrueTrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse

2111 rows Γ— 23 columns

kategori_encode_df.columns
Index(['Gender_Female', 'Gender_Male', 'CALC_Always', 'CALC_Frequently',
       'CALC_Sometimes', 'CALC_no', 'FAVC_no', 'FAVC_yes', 'SCC_no', 'SCC_yes',
       'SMOKE_no', 'SMOKE_yes', 'family_history_with_overweight_no',
       'family_history_with_overweight_yes', 'CAEC_Always', 'CAEC_Frequently',
       'CAEC_Sometimes', 'CAEC_no', 'MTRANS_Automobile', 'MTRANS_Bike',
       'MTRANS_Motorbike', 'MTRANS_Public_Transportation', 'MTRANS_Walking'],
      dtype='object')
kolom_numerik = list(numerik_df.columns)
kolom_kategorikal = list(kategori_df.columns)
train_df_filtered = train_df.drop(kolom_numerik + kolom_kategorikal, axis=1)
data_preprocessed_df = pd.concat([train_df_filtered, numerik_scaled_df, kategori_encode_df ], axis=1)
data_preprocessed_df
NObeyesdadAgeHeightWeightFCVCNCPCH2OFAFTUEGender_Femalefamily_history_with_overweight_yesCAEC_AlwaysCAEC_FrequentlyCAEC_SometimesCAEC_noMTRANS_AutomobileMTRANS_BikeMTRANS_MotorbikeMTRANS_Public_TransportationMTRANS_Walking
0Normal_Weight-0.522124-0.875589-0.862558-0.7850190.404153-0.013073-1.1880390.561997TrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
1Normal_Weight-0.522124-1.947599-1.1680771.0883420.4041531.6187592.339750-1.080625TrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2Normal_Weight-0.2068891.054029-0.366090-0.7850190.404153-0.0130731.1638200.561997FalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
3Overweight_Level_I0.4235821.0540290.0158081.0883420.404153-0.0130731.163820-1.080625FalseFalseFalseFalseTrueFalseFalseFalseFalseFalseTrue
4Overweight_Level_II-0.3645070.8396270.122740-0.785019-2.167023-0.013073-1.188039-1.080625FalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse
2106Obesity_Type_III-0.5257740.0970451.7117631.0883420.404153-0.4567050.7831350.407996TrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2107Obesity_Type_III-0.3671950.5028441.8009141.0883420.404153-0.0047020.389341-0.096251TrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2108Obesity_Type_III-0.2819090.5416721.7988681.0883420.4041530.0753610.474971-0.019018TrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2109Obesity_Type_III0.0077760.4049271.7857801.0883420.4041531.3778010.151471-0.117991TrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2110Obesity_Type_III-0.1021190.3983441.7905921.0883420.4041531.3960350.0189960.092432TrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse

2111 rows Γ— 32 columns

data_preprocessed_df.columns
Index(['NObeyesdad', 'Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF',
       'TUE', 'Gender_Female', 'Gender_Male', 'CALC_Always', 'CALC_Frequently',
       'CALC_Sometimes', 'CALC_no', 'FAVC_no', 'FAVC_yes', 'SCC_no', 'SCC_yes',
       'SMOKE_no', 'SMOKE_yes', 'family_history_with_overweight_no',
       'family_history_with_overweight_yes', 'CAEC_Always', 'CAEC_Frequently',
       'CAEC_Sometimes', 'CAEC_no', 'MTRANS_Automobile', 'MTRANS_Bike',
       'MTRANS_Motorbike', 'MTRANS_Public_Transportation', 'MTRANS_Walking'],
      dtype='object')
X_df = data_preprocessed_df.drop(columns=['NObeyesdad'],axis=1)
y_df = data_preprocessed_df['NObeyesdad']
X_df
AgeHeightWeightFCVCNCPCH2OFAFTUEGender_FemaleGender_Malefamily_history_with_overweight_yesCAEC_AlwaysCAEC_FrequentlyCAEC_SometimesCAEC_noMTRANS_AutomobileMTRANS_BikeMTRANS_MotorbikeMTRANS_Public_TransportationMTRANS_Walking
0-0.522124-0.875589-0.862558-0.7850190.404153-0.013073-1.1880390.561997TrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
1-0.522124-1.947599-1.1680771.0883420.4041531.6187592.339750-1.080625TrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2-0.2068891.054029-0.366090-0.7850190.404153-0.0130731.1638200.561997FalseTrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
30.4235821.0540290.0158081.0883420.404153-0.0130731.163820-1.080625FalseTrueFalseFalseFalseTrueFalseFalseFalseFalseFalseTrue
4-0.3645070.8396270.122740-0.785019-2.167023-0.013073-1.188039-1.080625FalseTrueFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse
2106-0.5257740.0970451.7117631.0883420.404153-0.4567050.7831350.407996TrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2107-0.3671950.5028441.8009141.0883420.404153-0.0047020.389341-0.096251TrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2108-0.2819090.5416721.7988681.0883420.4041530.0753610.474971-0.019018TrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
21090.0077760.4049271.7857801.0883420.4041531.3778010.151471-0.117991TrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
2110-0.1021190.3983441.7905921.0883420.4041531.3960350.0189960.092432TrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse

2111 rows Γ— 31 columns

y_df
0             Normal_Weight
1             Normal_Weight
2             Normal_Weight
3        Overweight_Level_I
4       Overweight_Level_II
               ...         
2106       Obesity_Type_III
2107       Obesity_Type_III
2108       Obesity_Type_III
2109       Obesity_Type_III
2110       Obesity_Type_III
Name: NObeyesdad, Length: 2111, dtype: object
y_df.unique()
github artificial intelligence-projects
ml project life cycle
ml project python
ml projects python
deep learning projects for masters students
array(['Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II',
       'Obesity_Type_I', 'Insufficient_Weight', 'Obesity_Type_II',
       'Obesity_Type_III'], dtype=object)
jumlah_kelas=y_df.nunique()
jumlah_kelas
7
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y=label_encoder.fit_transform(y_df)
y.shape
(2111,)
np.unique(y)
array([0, 1, 2, 3, 4, 5, 6])
from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val=train_test_split(X_df ,y,
                    test_size=0.3,random_state=42,stratify=y)
X_train.shape
(1477, 31)
X_train
AgeHeightWeightFCVCNCPCH2OFAFTUEGender_FemaleGender_Malefamily_history_with_overweight_yesCAEC_AlwaysCAEC_FrequentlyCAEC_SometimesCAEC_noMTRANS_AutomobileMTRANS_BikeMTRANS_MotorbikeMTRANS_Public_TransportationMTRANS_Walking
900.108346-0.7683880.2449471.0883421.689740-1.6449051.163820-1.080625TrueFalseFalseTrueFalseFalseFalseFalseFalseFalseTrueFalse
513-0.483801-1.111228-1.5940601.088342-1.2333520.7116640.362036-1.080625TrueFalseFalseFalseTrueFalseFalseFalseFalseFalseTrueFalse
1100-0.813763-0.019932-0.327900-2.477684-0.2498060.721364-1.1880390.112112FalseTrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
339-0.837360-1.840398-1.702735-0.7850190.404153-1.6449051.163820-1.080625TrueFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse
612-0.203982-1.253098-1.611971-0.401141-0.7171410.183223-0.017125-1.080625TrueFalseFalseFalseTrueFalseFalseFalseFalseFalseTrueFalse
15670.9785310.8830321.3006750.1499900.4041530.053754-0.201741-0.275980FalseTrueTrueFalseFalseTrueFalseTrueFalseFalseFalseFalse
1336-0.5203711.6577311.206713-0.7850190.4041531.618759-0.2607190.923421FalseTrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
609-0.6829240.554043-1.206367-0.7850191.0403251.5806911.1039302.204618FalseTrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
1659-0.1846021.5826041.3394201.088342-0.225612-0.513455-0.282915-1.080625FalseTrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
237-0.837360-0.661187-1.2826471.0883420.404153-1.644905-0.0121090.561997TrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse

1477 rows Γ— 31 columns

X_train.shape[0]
1477
X_train.shape[1]
31
X_val.shape
(634, 31)
X_val
AgeHeightWeightFCVCNCPCH2OFAFTUEGender_FemaleGender_Malefamily_history_with_overweight_yesCAEC_AlwaysCAEC_FrequentlyCAEC_SometimesCAEC_noMTRANS_AutomobileMTRANS_BikeMTRANS_MotorbikeMTRANS_Public_TransportationMTRANS_Walking
3320.4235821.590034-0.442470-0.785019-2.167023-0.013073-0.012109-1.080625FalseTrueTrueFalseFalseTrueFalseFalseFalseFalseFalseTrue
1235-0.1492560.6004720.335144-0.7850190.4041531.6187592.3397502.204618FalseTrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
160.4235822.4476410.588656-0.785019-2.167023-1.644905-0.012109-1.080625FalseTrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
12140.219669-1.244929-0.238106-0.785019-2.167023-0.013073-1.188039-1.080625TrueFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
521-0.837360-1.473782-1.6990661.088342-1.0172140.7319900.6894220.557726TrueFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse
445-0.837360-2.054800-1.588165-0.7850191.689740-1.6449052.339750-1.080625TrueFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse
15760.4169060.8474631.007138-0.2918660.4041530.137886-1.1880391.936629FalseTrueTrueFalseFalseTrueFalseTrueFalseFalseFalseFalse
1219-0.0367370.3404880.432531-0.7850190.4041531.3636620.3516101.118279FalseTrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
17710.2134271.0388061.197146-0.7156260.4041530.7000210.007004-1.047700FalseTrueTrueFalseFalseTrueFalseFalseFalseFalseTrueFalse
50-0.522124-0.982790-1.2253621.0883420.4041531.618759-1.1880390.561997TrueFalseTrueFalseFalseTrueFalseFalseFalseFalseFalseTrue
ml projects
ml projects with source code
ml projects github
ml projects for final year
ml projects for students

634 rows Γ— 31 columns

y_train.shape
(1477,)
y_train
array([3, 0, 6, ..., 0, 3, 1])
y_val.shape
(634,)
y_val

Model

import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, GlobalMaxPooling1D, Dropout, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras import regularizers
from sklearn.model_selection import RandomizedSearchCV
from keras.optimizers import Adam

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        assert d_model % self.num_heads == 0, "d_model harus habis dibagi dengan num_heads"
        self.depth = d_model // self.num_heads
        self.wq = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))
        self.wk = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))
        self.wv = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))
        self.dense = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))
        
    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
        
    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)
        scaled_attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
        output = self.dense(concat_attention)
        return output, attention_weights
    
    def scaled_dot_product_attention(self, q, k, v, mask):
        matmul_qk = tf.matmul(q, k, transpose_b=True)
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
        output = tf.matmul(attention_weights, v)
        return output, attention_weights

def create_model(units=128, dropout_rate=0.2, d_model=128, num_heads=8):
    input_layer = Input(shape=(X_train.shape[1],))
    dense_layer = Dense(units, activation='relu', kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))(input_layer)
    dropout_layer = Dropout(dropout_rate)(dense_layer)
    mha_layer = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
    mha_output, _ = mha_layer(dropout_layer, dropout_layer, dropout_layer, mask=None)
    pooling_layer = GlobalMaxPooling1D()(mha_output)
    batchnorm_layer = BatchNormalization()(pooling_layer)
    dense_layer2 = Dense(jumlah_kelas * 2, activation='relu', kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))(batchnorm_layer)
    dropout_layer2 = Dropout(0.1)(dense_layer2)
    output_layer = Dense(jumlah_kelas, activation='softmax', kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))(dropout_layer2)

    model = Model(inputs=input_layer, outputs=output_layer)
    optimizer = Adam(learning_rate=0.001)
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model
from scipy.stats import randint
param_dist = {
    'units': randint(64, 512),
    'dropout_rate': [0.1, 0.2, 0.3, 0.4],
    'd_model': [32, 64, 128, 256],
    'num_heads': [8, 16, 32], 
    'batch_size': [64, 128, 256]
}
from sklearn.metrics import accuracy_score
class CustomEstimator:
    def __init__(self, units=128, dropout_rate=0.2, d_model=128, num_heads=8, batch_size=32):
        self.units = units
        self.dropout_rate = dropout_rate
        self.d_model = d_model
        self.num_heads = num_heads
        self.batch_size = batch_size
        self.model = None
    def fit(self, X, y, **kwargs):
        if self.model is None:
            self.model = self._init_model()
        if self.d_model % self.num_heads != 0:
            print("Peringatan: d_model tidak dapat dibagi habis dengan num_heads. Melewati inisialisasi model.")
            return
        self.model.fit(X, y, batch_size=self.batch_size, **kwargs)
    def predict(self, X):
        if self.model is not None:
            y_pred_prob = self.model.predict(X)
            y_pred = np.argmax(y_pred_prob, axis=1)
            return y_pred
        else:
            print("Tidak ada model untuk melakukan prediksi. Melanjutkan ke langkah berikutnya dari algoritma.")
            return None
    def score(self, X, y):
        if self.model is not None:
            y_pred = self.predict(X)
            return accuracy_score(y, y_pred)
        else:
            print("Tidak ada model untuk melakukan perhitungan skor. Melanjutkan ke langkah berikutnya dari algoritma.")
            return 0.0
    def _init_model(self):
        if self.d_model % self.num_heads != 0:
            print("Peringatan: d_model tidak dapat dibagi habis dengan num_heads. Melewati inisialisasi model.")
            return None
        return create_model(units=self.units, dropout_rate=self.dropout_rate, d_model=self.d_model, num_heads=self.num_heads)
    def get_params(self, deep=True):
        return {
            'units': self.units,
            'dropout_rate': self.dropout_rate,
            'd_model': self.d_model,
            'num_heads': self.num_heads,
            'batch_size': self.batch_size
        }
    def set_params(self, **params):
        for param, value in params.items():
            setattr(self, param, value)
        return self
custom_estimator = CustomEstimator()
random_search = RandomizedSearchCV(estimator=custom_estimator, param_distributions=param_dist, n_iter=10, cv=3)
random_search.fit(X_train, y_train, validation_data=(X_val, y_val))
best_model = random_search.best_estimator_
best_model_params = random_search.best_estimator_.get_params()
print(best_model_params)
{'units': 407, 'dropout_rate': 0.1, 'd_model': 32, 'num_heads': 32, 'batch_size': 64}
from keras.callbacks import EarlyStopping,ReduceLROnPlateau
early_stopping = EarlyStopping(monitor='val_loss', patience=5, 
                                     restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, 
                                    patience=5, min_lr=0.0001)
history = best_model.model.fit(X_train, y_train, epochs=1000,
                    validation_data=(X_val, y_val), callbacks=[early_stopping,reduce_lr])
loss, accuracy = best_model.model.evaluate(X_val, y_val, verbose=0)
print(f'Loss: {loss:.2f}')
print(f'Accuracy: {accuracy * 100:.2f}%')
Loss: 0.56
Accuracy: 93.22%
loss = history.history['loss']
val_loss = history.history['val_loss']
plt.plot(loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']
plt.plot(accuracy, label='Training Accuracy')
plt.plot(val_accuracy, label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()
ml projects reddit
reddit ai subreddit
ml interesting projects
good ml projects
best_model.model.summary()
Model: "functional_61"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃    Param # ┃ Connected to      ┃
┑━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
β”‚ input_layer_30      β”‚ (None, 31)        β”‚          0 β”‚ -                 β”‚
β”‚ (InputLayer)        β”‚                   β”‚            β”‚                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dense_210 (Dense)   β”‚ (None, 407)       β”‚     13,024 β”‚ input_layer_30[0… β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dropout_60          β”‚ (None, 407)       β”‚          0 β”‚ dense_210[0][0]   β”‚
β”‚ (Dropout)           β”‚                   β”‚            β”‚                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ multi_head_attenti… β”‚ [(None, None,     β”‚     40,224 β”‚ dropout_60[0][0], β”‚
β”‚ (MultiHeadAttentio… β”‚ 32), (None, 32,   β”‚            β”‚ dropout_60[0][0], β”‚
β”‚                     β”‚ None, None)]      β”‚            β”‚ dropout_60[0][0]  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ global_max_pooling… β”‚ (None, 32)        β”‚          0 β”‚ multi_head_atten… β”‚
β”‚ (GlobalMaxPooling1… β”‚                   β”‚            β”‚                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ batch_normalizatio… β”‚ (None, 32)        β”‚        128 β”‚ global_max_pooli… β”‚
β”‚ (BatchNormalizatio… β”‚                   β”‚            β”‚                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dense_215 (Dense)   β”‚ (None, 14)        β”‚        462 β”‚ batch_normalizat… β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dropout_61          β”‚ (None, 14)        β”‚          0 β”‚ dense_215[0][0]   β”‚
β”‚ (Dropout)           β”‚                   β”‚            β”‚                   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ dense_216 (Dense)   β”‚ (None, 7)         β”‚        105 β”‚ dropout_61[0][0]  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 Total params: 161,703 (631.66 KB)
 Trainable params: 53,879 (210.46 KB)
 Non-trainable params: 64 (256.00 B)
 Optimizer params: 107,760 (420.94 KB)
deep learning projects github
deep learning project github
github artificial intelligence projects
from keras.utils import plot_model
file_name = 'arsitektur_model.png'
plot_model(best_model.model, to_file=file_name, show_shapes=True, show_layer_names=True)
plt.figure(figsize=(15,15))
img = plt.imread(file_name)
plt.imshow(img)
plt.title('Arsitektur Model', fontsize=18)
plt.axis('off') 
plt.savefig(file_name)
plt.show()
label_encoder.classes_
array(['Insufficient_Weight', 'Normal_Weight', 'Obesity_Type_I',
       'Obesity_Type_II', 'Obesity_Type_III', 'Overweight_Level_I',
       'Overweight_Level_II'], dtype=object)
from sklearn.metrics import classification_report
y_pred_prob = best_model.model.predict(X_val)
y_pred = np.argmax(y_pred_prob, axis=1)
target_names =[str(cls) for cls in label_encoder.classes_]
report = classification_report(y_val, y_pred,target_names=target_names,zero_division=1)
print("Classification Report:\n", report)
20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.96      0.98      0.97        82
      Normal_Weight       0.82      0.90      0.86        86
     Obesity_Type_I       0.95      0.96      0.96       106
    Obesity_Type_II       0.99      0.99      0.99        89
   Obesity_Type_III       1.00      0.99      0.99        97
 Overweight_Level_I       0.87      0.83      0.85        87
Overweight_Level_II       0.93      0.87      0.90        87

           accuracy                           0.93       634
          macro avg       0.93      0.93      0.93       634
       weighted avg       0.93      0.93      0.93       634

ml projects github
ml project github
github artificial intelligence projects
ml projects
from sklearn.metrics import confusion_matrix
import seaborn as sns
conf_matrix = confusion_matrix(y_val, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
train_df.to_csv("train_data.csv", index=False)

Obesity Levels Analysis with ML

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
df = pd.read_csv('/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv')
df.head()
AgeGenderHeightWeightCALCFAVCFCVCNCPSCCSMOKECH2Ofamily_history_with_overweightFAFTUECAECMTRANSNObeyesdad
021.0Female1.6264.0nono2.03.0nono2.0yes0.01.0SometimesPublic_TransportationNormal_Weight
121.0Female1.5256.0Sometimesno3.03.0yesyes3.0yes3.00.0SometimesPublic_TransportationNormal_Weight
223.0Male1.8077.0Frequentlyno2.03.0nono2.0yes2.01.0SometimesPublic_TransportationNormal_Weight
327.0Male1.8087.0Frequentlyno3.03.0nono2.0no2.00.0SometimesWalkingOverweight_Level_I
422.0Male1.7889.8Sometimesno2.01.0nono2.0no0.00.0SometimesPublic_TransportationOverweight_Level_II
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Age                             2111 non-null   float64
 1   Gender                          2111 non-null   object 
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   CALC                            2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   SCC                             2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  family_history_with_overweight  2111 non-null   object 
 12  FAF                             2111 non-null   float64
 13  TUE                             2111 non-null   float64
 14  CAEC                            2111 non-null   object 
 15  MTRANS                          2111 non-null   object 
 16  NObeyesdad                      2111 non-null   object 
dtypes: float64(8), object(9)
memory usage: 280.5+ KB
df.isnull().sum()
Age                               0
Gender                            0
Height                            0
Weight                            0
CALC                              0
FAVC                              0
FCVC                              0
NCP                               0
SCC                               0
SMOKE                             0
CH2O                              0
family_history_with_overweight    0
FAF                               0
TUE                               0
CAEC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64
#df = df.drop_duplicates()
# Define the order of categories and corresponding colors
order_colors = {"Male": "blue", "Female": "pink"}

plt.figure(figsize=(6, 6))
sns.countplot(x="Gender", data=df, order=order_colors.keys(), palette=order_colors.values())
plt.title("Gender Distribution", fontsize=14, fontweight="bold")
plt.xticks(rotation=45)

# Annotate each bar with its count
for i, count in enumerate(df["Gender"].value_counts()):
    plt.text(i, count, str(count), ha='center', va='bottom')

plt.tight_layout()
plt.show()
# Group the data by gender
grouped = df.groupby('Gender')

# Create a figure with multiple subplots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(16, 12))
fig.suptitle('Measures by Gender', fontsize=16)

# Visualize CALC
calc_counts = grouped['CALC'].value_counts().unstack()
calc_counts.plot(kind='bar', ax=axes[0, 0])

# Set title, labels, and annotations
axes[0, 0].set_title('How often do you drink alcohol?')
axes[0, 0].set_xlabel('CALC Values')
axes[0, 0].set_ylabel('Count')
for p in axes[0, 0].patches:
    axes[0, 0].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height() + 0.5),
                        ha='center', va='bottom')

# Visualize FAVC
favc_counts = grouped['FAVC'].value_counts().unstack()
favc_counts.plot(kind='bar', ax=axes[0, 1])

# Set title, labels, and annotations
axes[0, 1].set_title('Do you eat high caloric food frequently?')
axes[0, 1].set_xlabel('FAVC Values')
axes[0, 1].set_ylabel('Count')
for p in axes[0, 1].patches:
    axes[0, 1].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height() + 0.5),
                        ha='center', va='bottom')

# Visualize FCVC
fcvc_means = grouped['FCVC'].mean().reset_index()
fcvc_means.columns = ['Gender', 'FCVC Mean']
fcvc_means.set_index('Gender', inplace=True)
fcvc_means.plot(kind='bar', ax=axes[1, 0])

# Set title, labels, and annotations
axes[1, 0].set_title('Do you usually eat vegetables in your meals?')
axes[1, 0].set_xlabel('Gender')
axes[1, 0].set_ylabel('FCVC Mean')
for p in axes[1, 0].patches:
    bar_width = p.get_width()
    bar_height = p.get_height()
    bar_x = p.get_x()
    bar_middle = bar_x + bar_width / 2
    axes[1, 0].annotate(str(round(bar_height, 2)), (bar_middle, bar_height), ha='center', va='bottom')

# Visualize NCP
ncp_means = grouped['NCP'].mean().reset_index()
ncp_means.columns = ['Gender', 'NCP Mean']
ncp_means.set_index('Gender', inplace=True)
ncp_means.plot(kind='bar', ax=axes[1, 1])

# Set title, labels, and annotations
axes[1, 1].set_title('How many main meals do you have daily?')
axes[1, 1].set_xlabel('Gender')
axes[1, 1].set_ylabel('NCP Mean')
for p in axes[1, 1].patches:
    bar_width = p.get_width()
    bar_height = p.get_height()
    bar_x = p.get_x()
    bar_middle = bar_x + bar_width / 2
    axes[1, 1].annotate(str(round(bar_height, 2)), (bar_middle, bar_height), ha='center', va='bottom')

# Visualize SCC
scc_counts = grouped['SCC'].value_counts().unstack()
scc_counts.plot(kind='bar', ax=axes[2, 0])

# Set title, labels, and annotations
axes[2, 0].set_title('Do you monitor the calories you eat daily? ')
axes[2, 0].set_xlabel('SCC Values')
axes[2, 0].set_ylabel('Count')
for p in axes[2, 0].patches:
    axes[2, 0].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom')

# Visualize SMOKE
smoke_counts = grouped['SMOKE'].value_counts().unstack()
smoke_counts.plot(kind='bar', ax=axes[2, 1])

# Set title, labels, and annotations
axes[2, 1].set_title('Do you smoke?')
axes[2, 1].set_xlabel('SMOKE Values')
axes[2, 1].set_ylabel('Count')
for p in axes[2, 1].patches:
    axes[2, 1].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom')

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.5, wspace=0.5)

# Display the plot
plt.show()
ml projects for resume
ml project for resume
best ml projects
cool ml projects
# Sort NObeyesdad in descending order
sorted_obesity_levels = df['NObeyesdad'].value_counts().index

plt.figure(figsize=(6, 6))
sns.countplot(x="NObeyesdad", data=df, order=sorted_obesity_levels[::-1], palette="Greens")
plt.title("Obesity Level Distribution", fontsize=14, fontweight="bold")
plt.xticks(rotation=45)

# Annotate each bar with its count
for i, count in enumerate(df['NObeyesdad'].value_counts()[::-1]):
    plt.text(i, count, str(count), ha='center', va='bottom')

plt.tight_layout()
plt.show()
plt.figure(figsize=(16, 6))
plt.subplot(1, 3, 1)
sns.histplot(df["Age"].dropna(), kde=True, color="Red")
plt.title("Age Distribution", fontsize=14, fontweight="bold")

plt.subplot(1, 3, 2)
sns.histplot(df["Height"].dropna(), kde=True, color="Orange")
plt.title("Height Distribution", fontsize=14, fontweight="bold")

plt.subplot(1, 3, 3)
sns.histplot(df["Weight"].dropna(), kde=True, color="Purple")
plt.title("Weight Distribution", fontsize=14, fontweight="bold")
plt.tight_layout()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
# Correlation heatmap
plt.figure(figsize=(12, 8))

# Select only numerical columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Calculate correlation matrix
corr_matrix = df[numeric_cols].corr()

# Plot correlation heatmap
sns.heatmap(corr_matrix, annot=True, cmap="YlGnBu")
plt.title("Feature Correlation Heatmap", fontsize=16, fontweight="bold")
plt.show()
ml projects
ml project
# Define BMI categories and corresponding colors
bmi_colors = {
    "Normal": "green",
    "Overweight": "red",
    "Underweight": "blue"
}

# Calculate BMI for each person in the dataset
df['BMI'] = df['Weight'] / (df['Height'] ** 2)

# Create a new column to categorize BMI
df['BMI Category'] = pd.cut(df['BMI'], bins=[0, 18.5, 24.9, np.inf], labels=["Underweight", "Normal", "Overweight"], right=False)

# Plot the scatterplot with colors based on BMI categories
plt.figure(figsize=(8, 8))
for category, color in bmi_colors.items():
    subset = df[df['BMI Category'] == category]
    plt.scatter(subset['Height'], subset['Weight'], color=color, label=category)

plt.title("Height vs Weight", fontsize=14, fontweight="bold")
plt.xlabel("Height (m)")
plt.ylabel("Weight (kg)")
plt.legend()
plt.tight_layout()
plt.show()

Conclusion

Alright, everyone, we’ve reached the end of our journey through Machine Learning Project 6: Obesity Type – Best EDA and Classification! We’ve had an exciting adventure exploring the world of obesity types and harnessing the potential of machine learning to expertly classify them.

But this is more than just crunching numbers and making predictions. It’s about making a genuine difference in people’s lives.

ml project
ml projects
ml projects github
ml project with source code

By understanding the intricacies of obesity types, we’re paving the way for personalized interventions and treatments that can truly transform lives for the better.

As we conclude our journey, let’s maintain the momentum. Let’s continue pushing the boundaries of data science to address real-world issues and have a positive impact on society.

Remember, we hold the power of data in our hands – let’s use it wisely to create a healthier, happier world for everyone. Take care, and keep coding!

ml project with source code
ml project source code
ml projects for resume
ml project for resume
best ml projects
cool ml projects


4 Comments

Machine Learning Project 3: Best Explore Indian Cuisine · May 27, 2024 at 1:38 pm

[…] Machine Learning Project 6: Obesity type Best EDA and classification […]

Machine Learning Project 4: Best Explore Video Game Data · May 27, 2024 at 1:38 pm

[…] Machine Learning Project 6: Obesity type Best EDA and classification […]

Machine Learning Project 2: Diversity Tech Company Best EDA · May 27, 2024 at 1:39 pm

[…] Machine Learning Project 6: Obesity type Best EDA and classification […]

Machine Learning Project 1: Honda Motor Stocks Best Prices · May 27, 2024 at 1:40 pm

[…] Machine Learning Project 6: Obesity type Best EDA and classification […]

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *