ML Project 6: Obesity Type Best EDA And Classification

Machine Learning Project 6: Obesity type Best EDA and classification

Introduction

Hey there! Welcome to the inside scoop on ML Project 6: Obesity Type – Best EDA and Classification! We’re diving deep into the world of data science to tackle a real-life problem: obesity.

Also, check Machine Learning projects:

We’re not just scratching the surface, we’re diving headfirst into understanding the different types of obesity and how we can combat it using some serious data skills.

Imagine this: we’ll be sifting through data like a detective on a case, searching for clues about what causes each type of obesity. And once we have that knowledge, we’ll work our magic with top-notch machine learning techniques to classify obesity types like nobody’s business.

Think of it as a mission to crack the code of obesity and discover ways to help people lead healthier lives. So get ready, because we’re about to embark on an exciting journey through the world of data! Let’s get started!

ml projects github
ml projects for final year
ml projects for students

Dataset Information

This dataset provides information on obesity levels in individuals from Mexico, Peru, and Colombia. It includes data on their eating habits and physical condition. The dataset consists of 17 attributes and 2111 records.

Dataset Link: https://www.kaggle.com/code/diaakotb/obesity-type-eda-and-classification-99-boosting

Each record is labeled with the class variable NObesity (Obesity Level), which allows for classification based on values such as Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II, and Obesity Type III.

machine learning projects for resume
machine learning project for resume
best machine learning projects
cool machine learning projects

Approximately 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, while the remaining 23% was collected directly from users through a web platform.

Gender: Feature, Categorical, “Gender”
Age : Feature, Continuous, “Age”
Height: Feature, Continuous
Weight: Feature Continuous
family_history_with_overweight: Feature, Binary, ” Has a family member suffered or suffers from overweight? “
FAVC : Feature, Binary, ” Do you eat high caloric food frequently? “
FCVC : Feature, Integer, ” Do you usually eat vegetables in your meals? “
NCP : Feature, Continuous, ” How many main meals do you have daily? “
CAEC : Feature, Categorical, ” Do you eat any food between meals? “
SMOKE : Feature, Binary, ” Do you smoke? “
CH2O: Feature, Continuous, ” How much water do you drink daily? “
SCC: Feature, Binary, ” Do you monitor the calories you eat daily? “
FAF: Feature, Continuous, ” How often do you have physical activity? “
TUE : Feature, Integer, ” How much time do you use technological devices such as cell phone, videogames, television, computer and others? “
CALC : Feature, Categorical, ” How often do you drink alcohol? “
MTRANS : Feature, Categorical, ” Which transportation do you usually use? “
NObeyesdad : Target, Categorical, “Obesity level”

Importing libraries and Reading data

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
from scipy import stats
import warnings
sns.set_style("darkgrid")

data = pd.read_csv("/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv")

data.head()

	Age	Gender	Height	Weight	CALC	FAVC	FCVC	NCP	SCC	SMOKE	CH2O	family_history_with_overweight	FAF	TUE	CAEC	MTRANS	NObeyesdad
0	21.0	Female	1.62	64.0	no	no	2.0	3.0	no	no	2.0	yes	0.0	1.0	Sometimes	Public_Transportation	Normal_Weight
1	21.0	Female	1.52	56.0	Sometimes	no	3.0	3.0	yes	yes	3.0	yes	3.0	0.0	Sometimes	Public_Transportation	Normal_Weight
2	23.0	Male	1.80	77.0	Frequently	no	2.0	3.0	no	no	2.0	yes	2.0	1.0	Sometimes	Public_Transportation	Normal_Weight
3	27.0	Male	1.80	87.0	Frequently	no	3.0	3.0	no	no	2.0	no	2.0	0.0	Sometimes	Walking	Overweight_Level_I
4	22.0	Male	1.78	89.8	Sometimes	no	2.0	1.0	no	no	2.0	no	0.0	0.0	Sometimes	Public_Transportation	Overweight_Level_II

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Age                             2111 non-null   float64
 1   Gender                          2111 non-null   object 
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   CALC                            2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   SCC                             2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  family_history_with_overweight  2111 non-null   object 
 12  FAF                             2111 non-null   float64
 13  TUE                             2111 non-null   float64
 14  CAEC                            2111 non-null   object 
 15  MTRANS                          2111 non-null   object 
 16  NObeyesdad                      2111 non-null   object 
dtypes: float64(8), object(9)
memory usage: 280.5+ KB

ml projects for resume
ml project for resume
best ml projects
cool ml projects

Data wrangling

data.duplicated().sum()

data.loc[data.duplicated(keep=False), :]

	Age	Gender	Height	Weight	CALC	FAVC	FCVC	NCP	SCC	SMOKE	CH2O	family_history_with_overweight	FAF	TUE	CAEC	MTRANS	NObeyesdad
97	21.0	Female	1.52	42.0	Sometimes	no	3.0	1.0	no	no	1.0	no	0.0	0.0	Frequently	Public_Transportation	Insufficient_Weight
98	21.0	Female	1.52	42.0	Sometimes	no	3.0	1.0	no	no	1.0	no	0.0	0.0	Frequently	Public_Transportation	Insufficient_Weight
105	25.0	Female	1.57	55.0	Sometimes	yes	2.0	1.0	no	no	2.0	no	2.0	0.0	Sometimes	Public_Transportation	Normal_Weight
106	25.0	Female	1.57	55.0	Sometimes	yes	2.0	1.0	no	no	2.0	no	2.0	0.0	Sometimes	Public_Transportation	Normal_Weight
145	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
174	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
179	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
184	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
208	22.0	Female	1.69	65.0	Sometimes	yes	2.0	3.0	no	no	2.0	yes	1.0	1.0	Sometimes	Public_Transportation	Normal_Weight
209	22.0	Female	1.69	65.0	Sometimes	yes	2.0	3.0	no	no	2.0	yes	1.0	1.0	Sometimes	Public_Transportation	Normal_Weight
282	18.0	Female	1.62	55.0	no	yes	2.0	3.0	no	no	1.0	yes	1.0	1.0	Frequently	Public_Transportation	Normal_Weight
295	16.0	Female	1.66	58.0	no	no	2.0	1.0	no	no	1.0	no	0.0	1.0	Sometimes	Walking	Normal_Weight
309	16.0	Female	1.66	58.0	no	no	2.0	1.0	no	no	1.0	no	0.0	1.0	Sometimes	Walking	Normal_Weight
443	18.0	Male	1.72	53.0	Sometimes	yes	2.0	3.0	no	no	2.0	yes	0.0	2.0	Sometimes	Public_Transportation	Insufficient_Weight
460	18.0	Female	1.62	55.0	no	yes	2.0	3.0	no	no	1.0	yes	1.0	1.0	Frequently	Public_Transportation	Normal_Weight
466	22.0	Male	1.74	75.0	no	yes	3.0	3.0	no	no	1.0	yes	1.0	0.0	Frequently	Automobile	Normal_Weight
467	22.0	Male	1.74	75.0	no	yes	3.0	3.0	no	no	1.0	yes	1.0	0.0	Frequently	Automobile	Normal_Weight
496	18.0	Male	1.72	53.0	Sometimes	yes	2.0	3.0	no	no	2.0	yes	0.0	2.0	Sometimes	Public_Transportation	Insufficient_Weight
523	21.0	Female	1.52	42.0	Sometimes	yes	3.0	1.0	no	no	1.0	no	0.0	0.0	Frequently	Public_Transportation	Insufficient_Weight
527	21.0	Female	1.52	42.0	Sometimes	yes	3.0	1.0	no	no	1.0	no	0.0	0.0	Frequently	Public_Transportation	Insufficient_Weight
659	21.0	Female	1.52	42.0	Sometimes	yes	3.0	1.0	no	no	1.0	no	0.0	0.0	Frequently	Public_Transportation	Insufficient_Weight
663	21.0	Female	1.52	42.0	Sometimes	yes	3.0	1.0	no	no	1.0	no	0.0	0.0	Frequently	Public_Transportation	Insufficient_Weight
763	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
764	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
824	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
830	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
831	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
832	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
833	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
834	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
921	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
922	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I
923	21.0	Male	1.62	70.0	Sometimes	yes	2.0	1.0	no	no	3.0	no	1.0	0.0	no	Public_Transportation	Overweight_Level_I

There is no way to find out if these are duplicated entries or not so we keep them

data = data.rename(columns={"CALC":"alcohol_drinking_frequency",
            "FAVC":"high_calorie_food_eat",
            "FCVC":"vegetable_eat_daily",
            "NCP":"number_of_meals_daily",
            "SCC":"calories_monitoring",
            "CH2O":"water_drinking_daily",
            "FAF":"physical_activity_daily",
            "TUE":"electronics_usage_daily",
            "CAEC":"food_between_meals",
            "MTRANS":"method_of_transportion"})

for col in ['Age', 'Weight', 'vegetable_eat_daily','number_of_meals_daily', 'water_drinking_daily','physical_activity_daily','electronics_usage_daily']:
    data[col] = data.loc[:,col].round().astype(int)

data.describe()

	Age	Height	Weight	vegetable_eat_daily	number_of_meals_daily	water_drinking_daily	physical_activity_daily	electronics_usage_daily
count	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000	2111.000000
mean	24.315964	1.701677	86.586452	2.423496	2.687826	2.014685	1.006632	0.664614
std	6.357078	0.093305	26.190136	0.583905	0.809680	0.688616	0.895462	0.674009
min	14.000000	1.450000	39.000000	1.000000	1.000000	1.000000	0.000000	0.000000
25%	20.000000	1.630000	65.500000	2.000000	3.000000	2.000000	0.000000	0.000000
50%	23.000000	1.700499	83.000000	2.000000	3.000000	2.000000	1.000000	1.000000
75%	26.000000	1.768464	107.000000	3.000000	3.000000	2.000000	2.000000	1.000000
max	61.000000	1.980000	173.000000	3.000000	4.000000	3.000000	3.000000	2.000000

Univariate analysis

plt.figure(figsize=(18,15))
for i,col in enumerate(data.select_dtypes(include="object").columns[:-1]):
    plt.subplot(4,2,i+1)
    sns.countplot(data=data,x=col,palette=sns.color_palette("Set2"))

data["NObeyesdad"].value_counts().sort_values(ascending=False).plot(kind="bar",color="red")

<Axes: xlabel='NObeyesdad'>

machine learning projects
machine learning projects with source code
machine learning projects github
machine learning projects for final year
machine learning projects for students

plt.figure(figsize=(18,15))
for i,col in enumerate(data.select_dtypes(include="number").columns[:3]):
    plt.subplot(4,2,i+1)
    sns.boxplot(data=data,x=col,palette=sns.color_palette("Set2"))

We can see there is alot of outliers in age column we can reduce that

data=data[np.abs(stats.zscore(data["Age"])) < 2].reset_index(drop=True)

sns.boxplot(data=data,x="Age")

<Axes: xlabel='Age'>

data.shape

(1981, 17)

plt.figure(figsize=(18,15))
for i,col in enumerate(data.select_dtypes(include="number").columns[3:]):
    plt.subplot(4,2,i+1)
    sns.countplot(data=data,x=col)

ml projects ideas
project manager artificial intelligence
best ml courses reddit

Multivariate analysis

How is obesity type affected by eating high calorie food?

data.groupby(['NObeyesdad', 'high_calorie_food_eat'])["high_calorie_food_eat"].count()

NObeyesdad           high_calorie_food_eat
Insufficient_Weight  no                        51
                     yes                      220
Normal_Weight        no                        75
                     yes                      206
Obesity_Type_I       no                         9
                     yes                      284
Obesity_Type_II      no                         6
                     yes                      268
Obesity_Type_III     no                         1
                     yes                      323
Overweight_Level_I   no                        20
                     yes                      256
Overweight_Level_II  no                        71
                     yes                      191
Name: high_calorie_food_eat, dtype: int64

plt.figure(figsize=(10,7))
sns.countplot(data=data,x=data.NObeyesdad,hue=data.high_calorie_food_eat,palette=sns.color_palette("Dark2"))
plt.xticks(rotation=-20)
plt.show()

high calorie food seems to not affect obesity type that much but it does affect whether someone is above normal weight or not
obesity type 3 however seems to have no one not eating high calorie food acording to this data and type 2 is very high

Average age of each obesity type

data.groupby("NObeyesdad")["Age"].median()

NObeyesdad
Insufficient_Weight    19.0
Normal_Weight          21.0
Obesity_Type_I         23.0
Obesity_Type_II        27.0
Obesity_Type_III       25.0
Overweight_Level_I     21.0
Overweight_Level_II    23.0
Name: Age, dtype: float64

data.groupby("NObeyesdad")["Age"].median().sort_values(ascending=False).plot(kind="bar",color = sns.color_palette("Set2"))
plt.title("Average age of each obesity type")

Text(0.5, 1.0, 'Average age of each obesity type')

The avg age is the highest in obesity type 2 followed by 3 and 1
The avg age is the lowest in insufficient weight
so it seems that as age increases weight increases

Average weight of each obesity type

data.groupby("NObeyesdad")["Weight"].mean()

NObeyesdad
Insufficient_Weight     49.926199
Normal_Weight           62.106762
Obesity_Type_I          94.819113
Obesity_Type_II        115.306569
Obesity_Type_III       120.972222
Overweight_Level_I      74.510870
Overweight_Level_II     82.045802
Name: Weight, dtype: float64

data.groupby("NObeyesdad")["Weight"].mean().sort_values(ascending=False).plot(kind="bar",color=sns.color_palette("Set2"))

<Axes: xlabel='NObeyesdad'>

Here as expected obesity type 3 has the highest average weight followed by type 2 then type 1

ml projects
ml projects with source code
ml projects github
ml projects for final year
ml projects for students

Does gender affect obesity type?

plt.figure(figsize=(10,7))
sns.countplot(data=data,x="NObeyesdad",hue="Gender",palette=sns.color_palette("Dark2"))
plt.xticks(rotation=-20)
plt.show()

Males are higher in almost all obesity types except obesity type 3
Females are more likely to have insufficient weight
Females are more likely to have severe obesity(type 3)

Does eating food between meals affect obesity type?

plt.figure(figsize=(10,7))
sns.countplot(data=data,x="NObeyesdad",hue="food_between_meals",palette=sns.color_palette("Dark2"))
plt.xticks(rotation=-20)
plt.show()

Most people eat food in between meals sometimes
People with insufficient weight and normal weight eat food betwen meals frequently the most
it can be said that eating small meals in between meals decrease weight

Does family history with overweight affect obesity type?

plt.figure(figsize=(10,7))
sns.countplot(data=data,x="NObeyesdad",hue="family_history_with_overweight",palette=sns.color_palette("Dark2"))
plt.xticks(rotation=-20)
plt.show()

Having family history with overweight seem to have an effect of increasing weight as obesity type 3,2,1 seem to all have family history with overweight

Does people who drink also smoke?

sns.countplot(data=data,x=data.alcohol_drinking_frequency,hue=data.SMOKE)

<Axes: xlabel='alcohol_drinking_frequency', ylabel='count'>

No most of the people who drink alcohol don’t smoke

Data preprocessing and splitting data

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from  xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import MaxAbsScaler,RobustScaler
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split,cross_val_score

lgbm_settings = {'n_estimators': 137,
 'num_leaves': 16,
 'min_child_samples': 2,
 'learning_rate': 0.11333885880532285,
 'colsample_bytree': 0.7557376218643025,
 'reg_alpha': 0.0013323317789643257,
 'reg_lambda': 0.0018596588413880056,
 'n_jobs': -1,
 'max_bin': 511,
 'verbose': -1}

data.select_dtypes(include="object").columns

Index(['Gender', 'alcohol_drinking_frequency', 'high_calorie_food_eat',
       'calories_monitoring', 'SMOKE', 'family_history_with_overweight',
       'food_between_meals', 'method_of_transportion', 'NObeyesdad'],
      dtype='object')

Enoding ordinal features using label enoder

encoder  =LabelEncoder()
model_data = data.copy()
for col in ['alcohol_drinking_frequency','food_between_meals','NObeyesdad']:
    model_data[col] =encoder.fit_transform(model_data[col])

Encoding nominal data using pd dummies

cols = model_data.select_dtypes(include="object").columns
dums = pd.get_dummies(model_data[cols],dtype=int)
model_data = pd.concat([model_data,dums],axis=1).drop(columns=cols)

model_data.head()

	Age	Height	Weight	alcohol_drinking_frequency	vegetable_eat_daily	number_of_meals_daily	water_drinking_daily	physical_activity_daily	electronics_usage_daily	food_between_meals	…	calories_monitoring_yes	SMOKE_no	SMOKE_yes	family_history_with_overweight_no	family_history_with_overweight_yes	method_of_transportion_Public_Transportation	method_of_transportion_Walking
0	21	1.62	64	3	2	3	2	0	1	2	…	0	1	0	0	1	1	0
1	21	1.52	56	2	3	3	3	3	0	2	…	1	0	1	0	1	1	0
2	23	1.80	77	1	2	3	2	2	1	2	…	0	1	0	0	1	1	0
3	27	1.80	87	1	3	3	2	2	0	2	…	0	1	0	1	0	0	1
4	22	1.78	90	2	2	1	2	0	0	2	…	0	1	0	1	0	1	0

5 rows × 26 columns

Correlation between data atributes

corr_data =data.copy()
encoder  =LabelEncoder()
for col in corr_data.select_dtypes(include="object").columns:
    corr_data[col] =encoder.fit_transform(corr_data[col])

plt.figure(figsize=(16,13))
sns.heatmap(data=corr_data.corr(),annot=True)

<Axes: >

ml process
kaggle ml projects
ml project manager
ml project management
ml projects for masters students

Normalizing data using max absolute scaler

x= model_data.drop(columns="NObeyesdad")
y=model_data["NObeyesdad"]
scaler_mas = MaxAbsScaler()
for col in x.columns:
    scaler_mas.fit(x[[col]])
    x[col] = scaler_mas.transform (x[[col]])

x.head()

	Age	Height	Weight	alcohol_drinking_frequency	vegetable_eat_daily	number_of_meals_daily	water_drinking_daily	physical_activity_daily	electronics_usage_daily	food_between_meals	…	calories_monitoring_yes	SMOKE_no	SMOKE_yes	family_history_with_overweight_no	family_history_with_overweight_yes	method_of_transportion_Public_Transportation	method_of_transportion_Walking
0	0.567568	0.818182	0.369942	1.000000	0.666667	0.75	0.666667	0.000000	0.5	0.666667	…	0.0	1.0	0.0	0.0	1.0	1.0	0.0
1	0.567568	0.767677	0.323699	0.666667	1.000000	0.75	1.000000	1.000000	0.0	0.666667	…	1.0	0.0	1.0	0.0	1.0	1.0	0.0
2	0.621622	0.909091	0.445087	0.333333	0.666667	0.75	0.666667	0.666667	0.5	0.666667	…	0.0	1.0	0.0	0.0	1.0	1.0	0.0
3	0.729730	0.909091	0.502890	0.333333	1.000000	0.75	0.666667	0.666667	0.0	0.666667	…	0.0	1.0	0.0	1.0	0.0	0.0	1.0
4	0.594595	0.898990	0.520231	0.666667	0.666667	0.25	0.666667	0.000000	0.0	0.666667	…	0.0	1.0	0.0	1.0	0.0	1.0	0.0

5 rows × 25 columns

Splitting data and training models

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2,random_state=7)

x.head()

	Age	Height	Weight	alcohol_drinking_frequency	vegetable_eat_daily	number_of_meals_daily	water_drinking_daily	physical_activity_daily	electronics_usage_daily	food_between_meals	…	calories_monitoring_yes	SMOKE_no	SMOKE_yes	family_history_with_overweight_no	family_history_with_overweight_yes	method_of_transportion_Public_Transportation	method_of_transportion_Walking
0	0.567568	0.818182	0.369942	1.000000	0.666667	0.75	0.666667	0.000000	0.5	0.666667	…	0.0	1.0	0.0	0.0	1.0	1.0	0.0
1	0.567568	0.767677	0.323699	0.666667	1.000000	0.75	1.000000	1.000000	0.0	0.666667	…	1.0	0.0	1.0	0.0	1.0	1.0	0.0
2	0.621622	0.909091	0.445087	0.333333	0.666667	0.75	0.666667	0.666667	0.5	0.666667	…	0.0	1.0	0.0	0.0	1.0	1.0	0.0
3	0.729730	0.909091	0.502890	0.333333	1.000000	0.75	0.666667	0.666667	0.0	0.666667	…	0.0	1.0	0.0	1.0	0.0	0.0	1.0
4	0.594595	0.898990	0.520231	0.666667	0.666667	0.25	0.666667	0.000000	0.0	0.666667	…	0.0	1.0	0.0	1.0	0.0	1.0	0.0

5 rows × 25 columns

model_lgbm  = LGBMClassifier(**lgbm_settings)
model_xgb = XGBClassifier(objective="multi:softmax",num_class = 7)
model_gb = GradientBoostingClassifier(max_depth=9,min_samples_leaf=3,min_samples_split=13,subsample=0.751)
model_rfc = RandomForestClassifier()
models = [model_lgbm,model_xgb,model_gb,model_rfc]

for model in models:
    model.fit(x_train,y_train)

ml process
kaggle machine learning projects
machine learning project manager
machine learning project management
machine learning projects for masters students

Evaluating models

for model in models:
    model_name = type(model).__name__
    print(f"score for {model_name} on train data: {model.score(x_train,y_train)}")

score for LGBMClassifier on train data: 1.0
score for XGBClassifier on train data: 1.0
score for GradientBoostingClassifier on train data: 1.0
score for RandomForestClassifier on train data: 1.0

for model in models:
    model_name = type(model).__name__
    print(f"score for {model_name} on test data: {model.score(x_test,y_test)}")

score for LGBMClassifier on test data: 0.9899244332493703
score for XGBClassifier on test data: 0.982367758186398
score for GradientBoostingClassifier on test data: 0.9773299748110831
score for RandomForestClassifier on test data: 0.947103274559194

print("scores of each model using kfold validation:-\n\n")
for model in models:
    score = cross_val_score(model,x,y,cv=10)
    avg = np.mean(score)
    model_name = type(model).__name__
    print(f"scores for {model_name}:{score}")
    print(f"average score for {model_name}:{avg}\n")

scores of each model using kfold validation:-


scores for LGBMClassifier:[0.93969849 0.93434343 0.98989899 0.96969697 0.98484848 0.99494949
 0.97474747 0.99494949 0.97474747 0.98484848]
average score for LGBMClassifier:0.9742728795492613

scores for XGBClassifier:[0.91457286 0.93434343 0.97474747 0.97979798 0.97474747 0.98989899
 0.97474747 0.98989899 0.97979798 0.97979798]
average score for XGBClassifier:0.9692350642099384

scores for GradientBoostingClassifier:[0.90452261 0.94949495 0.97979798 0.97474747 0.97979798 0.99494949
 0.98989899 0.98484848 0.97979798 0.97979798]
average score for GradientBoostingClassifier:0.971765392619664

scores for RandomForestClassifier:[0.73869347 0.82323232 0.96464646 0.95454545 0.97979798 0.97979798
 0.96969697 0.96969697 0.97979798 0.98484848]
average score for RandomForestClassifier:0.9344754073397288

for model in models:
    y_predicted = model.predict(x_test)
    model_name = type(model).__name__
    print(f"Report:{model_name}")
    print(classification_report(y_test,y_predicted))

step machine learning
step of machine learning
ml projects
ml project
machine learning python projects
machine learning projects in python

Report:LGBMClassifier
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        46
           1       0.96      1.00      0.98        65
           2       1.00      0.99      0.99        67
           3       1.00      1.00      1.00        53
           4       1.00      1.00      1.00        63
           5       0.98      0.95      0.97        63
           6       1.00      1.00      1.00        40

    accuracy                           0.99       397
   macro avg       0.99      0.99      0.99       397
weighted avg       0.99      0.99      0.99       397

Report:XGBClassifier
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        46
           1       0.95      0.95      0.95        65
           2       1.00      0.99      0.99        67
           3       1.00      1.00      1.00        53
           4       1.00      1.00      1.00        63
           5       0.98      0.95      0.97        63
           6       0.95      1.00      0.98        40

    accuracy                           0.98       397
   macro avg       0.98      0.98      0.98       397
weighted avg       0.98      0.98      0.98       397

Report:GradientBoostingClassifier
              precision    recall  f1-score   support

           0       0.96      0.98      0.97        46
           1       0.94      0.95      0.95        65
           2       0.99      0.99      0.99        67
           3       1.00      0.98      0.99        53
           4       1.00      1.00      1.00        63
           5       0.98      0.95      0.97        63
           6       0.98      1.00      0.99        40

    accuracy                           0.98       397
   macro avg       0.98      0.98      0.98       397
weighted avg       0.98      0.98      0.98       397

Report:RandomForestClassifier
              precision    recall  f1-score   support

           0       0.96      1.00      0.98        46
           1       0.86      0.91      0.88        65
           2       0.97      0.97      0.97        67
           3       1.00      1.00      1.00        53
           4       1.00      1.00      1.00        63
           5       0.92      0.87      0.89        63
           6       0.95      0.88      0.91        40

    accuracy                           0.95       397
   macro avg       0.95      0.95      0.95       397
weighted avg       0.95      0.95      0.95       397

for i,model in enumerate(models):
    plt.subplot(2,2,i+1)
    y_predicted = model.predict(x_test)
    model_name = type(model).__name__
    cm = confusion_matrix(y_test, y_predicted)
    sns.heatmap(cm, annot=True,fmt='d')
    plt.xlabel('Predicted')
    plt.ylabel('Truth')
    plt.title(f"{model_name} confusion matrix")
    plt.show()

– light gradient boosting is the best model acheiving 99% accuracy and an average acc of 97% in kfold cross validation

– xg and gradient are almost the same in accuracy (98%) and kfold validation (97%)

– random forest acheived the worst results with 94% acc and an average acc of (92%) in kfold cross validation and it also seemed to overfit

step ml
step of ml
ml projects
ml project
ml python projects
ml projects in python

Obesity Levels-Multi Head Attention-Hyperparameter

import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv

train_df=pd.read_csv("/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv")

train_df

	Age	Gender	Height	Weight	CALC	FAVC	FCVC	NCP	SCC	SMOKE	CH2O	family_history_with_overweight	FAF	TUE	CAEC	MTRANS	NObeyesdad
0	21.000000	Female	1.620000	64.000000	no	no	2.0	3.0	no	no	2.000000	yes	0.000000	1.000000	Sometimes	Public_Transportation	Normal_Weight
1	21.000000	Female	1.520000	56.000000	Sometimes	no	3.0	3.0	yes	yes	3.000000	yes	3.000000	0.000000	Sometimes	Public_Transportation	Normal_Weight
2	23.000000	Male	1.800000	77.000000	Frequently	no	2.0	3.0	no	no	2.000000	yes	2.000000	1.000000	Sometimes	Public_Transportation	Normal_Weight
3	27.000000	Male	1.800000	87.000000	Frequently	no	3.0	3.0	no	no	2.000000	no	2.000000	0.000000	Sometimes	Walking	Overweight_Level_I
4	22.000000	Male	1.780000	89.800000	Sometimes	no	2.0	1.0	no	no	2.000000	no	0.000000	0.000000	Sometimes	Public_Transportation	Overweight_Level_II
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2106	20.976842	Female	1.710730	131.408528	Sometimes	yes	3.0	3.0	no	no	1.728139	yes	1.676269	0.906247	Sometimes	Public_Transportation	Obesity_Type_III
2107	21.982942	Female	1.748584	133.742943	Sometimes	yes	3.0	3.0	no	no	2.005130	yes	1.341390	0.599270	Sometimes	Public_Transportation	Obesity_Type_III
2108	22.524036	Female	1.752206	133.689352	Sometimes	yes	3.0	3.0	no	no	2.054193	yes	1.414209	0.646288	Sometimes	Public_Transportation	Obesity_Type_III
2109	24.361936	Female	1.739450	133.346641	Sometimes	yes	3.0	3.0	no	no	2.852339	yes	1.139107	0.586035	Sometimes	Public_Transportation	Obesity_Type_III
2110	23.664709	Female	1.738836	133.472641	Sometimes	yes	3.0	3.0	no	no	2.863513	yes	1.026452	0.714137	Sometimes	Public_Transportation	Obesity_Type_III

2111 rows × 17 columns

unique_values = {}
mythreshold=7
for i, column in enumerate(train_df.columns, 1):
    unique_values[column] = train_df[column].unique()
    if train_df[column].nunique() <= mythreshold:
        print(f"{i}. Kolom:\" {column} \" , Kategorikal, Tipe Data:{train_df[column].dtype} , Unique:{unique_values[column]}, Jumlah Unique:{train_df[column].nunique()}")
    else:
        print(f"{i}. Kolom:\" {column} \" , Numerik, Tipe Data:{train_df[column].dtype} , Min:{train_df[column].min()}, Max:{train_df[column].max()}, Jumlah Unique:{train_df[column].nunique()}")

1. Kolom:" Age " , Numerik, Tipe Data:float64 , Min:14.0, Max:61.0, Jumlah Unique:1402
2. Kolom:" Gender " , Kategorikal, Tipe Data:object , Unique:['Female' 'Male'], Jumlah Unique:2
3. Kolom:" Height " , Numerik, Tipe Data:float64 , Min:1.45, Max:1.98, Jumlah Unique:1574
4. Kolom:" Weight " , Numerik, Tipe Data:float64 , Min:39.0, Max:173.0, Jumlah Unique:1525
5. Kolom:" CALC " , Kategorikal, Tipe Data:object , Unique:['no' 'Sometimes' 'Frequently' 'Always'], Jumlah Unique:4
6. Kolom:" FAVC " , Kategorikal, Tipe Data:object , Unique:['no' 'yes'], Jumlah Unique:2
7. Kolom:" FCVC " , Numerik, Tipe Data:float64 , Min:1.0, Max:3.0, Jumlah Unique:810
8. Kolom:" NCP " , Numerik, Tipe Data:float64 , Min:1.0, Max:4.0, Jumlah Unique:635
9. Kolom:" SCC " , Kategorikal, Tipe Data:object , Unique:['no' 'yes'], Jumlah Unique:2
10. Kolom:" SMOKE " , Kategorikal, Tipe Data:object , Unique:['no' 'yes'], Jumlah Unique:2
11. Kolom:" CH2O " , Numerik, Tipe Data:float64 , Min:1.0, Max:3.0, Jumlah Unique:1268
12. Kolom:" family_history_with_overweight " , Kategorikal, Tipe Data:object , Unique:['yes' 'no'], Jumlah Unique:2
13. Kolom:" FAF " , Numerik, Tipe Data:float64 , Min:0.0, Max:3.0, Jumlah Unique:1190
14. Kolom:" TUE " , Numerik, Tipe Data:float64 , Min:0.0, Max:2.0, Jumlah Unique:1129
15. Kolom:" CAEC " , Kategorikal, Tipe Data:object , Unique:['Sometimes' 'Frequently' 'Always' 'no'], Jumlah Unique:4
16. Kolom:" MTRANS " , Kategorikal, Tipe Data:object , Unique:['Public_Transportation' 'Walking' 'Automobile' 'Motorbike' 'Bike'], Jumlah Unique:5
17. Kolom:" NObeyesdad " , Kategorikal, Tipe Data:object , Unique:['Normal_Weight' 'Overweight_Level_I' 'Overweight_Level_II'
 'Obesity_Type_I' 'Insufficient_Weight' 'Obesity_Type_II'
 'Obesity_Type_III'], Jumlah Unique:7

numeric_features = []
categorical_features = []
for column in train_df.columns:
    if train_df[column].nunique() <= mythreshold:
        categorical_features.append(column)
    else:
        numeric_features.append(column)
print("Fitur Numerik:", numeric_features)
print("Fitur Kategorikal:", categorical_features)

Fitur Numerik: ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE']
Fitur Kategorikal: ['Gender', 'CALC', 'FAVC', 'SCC', 'SMOKE', 'family_history_with_overweight', 'CAEC', 'MTRANS', 'NObeyesdad']

numerik_df=train_df.drop(categorical_features,axis=1)

numerik_df

	Age	Height	Weight	FCVC	NCP	CH2O	FAF	TUE
0	21.000000	1.620000	64.000000	2.0	3.0	2.000000	0.000000	1.000000
1	21.000000	1.520000	56.000000	3.0	3.0	3.000000	3.000000	0.000000
2	23.000000	1.800000	77.000000	2.0	3.0	2.000000	2.000000	1.000000
3	27.000000	1.800000	87.000000	3.0	3.0	2.000000	2.000000	0.000000
4	22.000000	1.780000	89.800000	2.0	1.0	2.000000	0.000000	0.000000
…	…	…	…	…	…	…	…	…
2106	20.976842	1.710730	131.408528	3.0	3.0	1.728139	1.676269	0.906247
2107	21.982942	1.748584	133.742943	3.0	3.0	2.005130	1.341390	0.599270
2108	22.524036	1.752206	133.689352	3.0	3.0	2.054193	1.414209	0.646288
2109	24.361936	1.739450	133.346641	3.0	3.0	2.852339	1.139107	0.586035
2110	23.664709	1.738836	133.472641	3.0	3.0	2.863513	1.026452	0.714137

2111 rows × 8 columns

numerik_df.columns

Index(['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE'], dtype='object')

import matplotlib.pyplot as plt
num_cols = 2
num_rows = (len(numerik_df.columns) + num_cols - 1) // num_cols
default_subplot_size = (8, 6)
fig_width = default_subplot_size[0] * num_cols
fig_height = default_subplot_size[1] * num_rows
fig, axes = plt.subplots(num_rows, num_cols, figsize=(fig_width, fig_height))
for i, column in enumerate(numerik_df.columns):
    row_index = i // num_cols
    col_index = i % num_cols
    axes[row_index, col_index].hist(numerik_df[column], bins=10, color='skyblue', edgecolor='black')
    axes[row_index, col_index].set_title(f'Distribusi {column} sebelum discaler')
    axes[row_index, col_index].set_xlabel(column)
    axes[row_index, col_index].set_ylabel('Frequency')
    axes[row_index, col_index].grid(True)
if len(numerik_df.columns) % num_cols != 0:
    fig.delaxes(axes[num_rows-1, num_cols-1])
plt.tight_layout()
plt.show()

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
numerik_scaled = scaler.fit_transform(numerik_df)

numerik_scaled.shape

(2111, 8)

numerik_scaled

array([[-0.52212439, -0.87558934, -0.86255819, ..., -0.01307326,
        -1.18803911,  0.56199675],
       [-0.52212439, -1.94759928, -1.16807699, ...,  1.61875854,
         2.33975012, -1.08062463],
       [-0.20688898,  1.05402854, -0.36609013, ..., -0.01307326,
         1.16382038,  0.56199675],
       ...,
       [-0.28190933,  0.54167211,  1.79886776, ...,  0.0753606 ,
         0.47497132, -0.01901815],
       [ 0.00777624,  0.40492652,  1.78577968, ...,  1.37780063,
         0.15147069, -0.11799101],
       [-0.10211908,  0.39834438,  1.7905916 , ...,  1.39603472,
         0.01899633,  0.09243207]])

kolom_numerik=numerik_df.columns
numerik_scaled_df=pd.DataFrame(data=numerik_scaled,columns=kolom_numerik)

numerik_scaled_df

github artificial intelligence-projects
machine learning project life cycle
machine learning project python
machine learning projects python
deep learning projects for masters students

	Age	Height	Weight	FCVC	NCP	CH2O	FAF	TUE
0	-0.522124	-0.875589	-0.862558	-0.785019	0.404153	-0.013073	-1.188039	0.561997
1	-0.522124	-1.947599	-1.168077	1.088342	0.404153	1.618759	2.339750	-1.080625
2	-0.206889	1.054029	-0.366090	-0.785019	0.404153	-0.013073	1.163820	0.561997
3	0.423582	1.054029	0.015808	1.088342	0.404153	-0.013073	1.163820	-1.080625
4	-0.364507	0.839627	0.122740	-0.785019	-2.167023	-0.013073	-1.188039	-1.080625
…	…	…	…	…	…	…	…	…
2106	-0.525774	0.097045	1.711763	1.088342	0.404153	-0.456705	0.783135	0.407996
2107	-0.367195	0.502844	1.800914	1.088342	0.404153	-0.004702	0.389341	-0.096251
2108	-0.281909	0.541672	1.798868	1.088342	0.404153	0.075361	0.474971	-0.019018
2109	0.007776	0.404927	1.785780	1.088342	0.404153	1.377801	0.151471	-0.117991
2110	-0.102119	0.398344	1.790592	1.088342	0.404153	1.396035	0.018996	0.092432

2111 rows × 8 columns

numerik_scaled_df.columns

Index(['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE'], dtype='object')

num_cols = 2
num_rows = (len(numerik_scaled_df.columns) + num_cols - 1) // num_cols
default_subplot_size = (8, 6)
fig_width = default_subplot_size[0] * num_cols
fig_height = default_subplot_size[1] * num_rows
fig, axes = plt.subplots(num_rows, num_cols, figsize=(fig_width, fig_height))
for i, column in enumerate(numerik_scaled_df.columns):
    row_index = i // num_cols
    col_index = i % num_cols
    axes[row_index, col_index].hist(numerik_scaled_df[column], bins=10, color='skyblue', edgecolor='black')
    axes[row_index, col_index].set_title(f'Distribusi {column} setelah discaler')
    axes[row_index, col_index].set_xlabel(column)
    axes[row_index, col_index].set_ylabel('Frequency')
    axes[row_index, col_index].grid(True)
if len(numerik_scaled_df.columns) % num_cols != 0:
    fig.delaxes(axes[num_rows-1, num_cols-1])
plt.tight_layout()
plt.show()

kategori_d=train_df.drop(numeric_features,axis=1)

kategori_d

	Gender	CALC	FAVC	SCC	SMOKE	family_history_with_overweight	CAEC	MTRANS	NObeyesdad
0	Female	no	no	no	no	yes	Sometimes	Public_Transportation	Normal_Weight
1	Female	Sometimes	no	yes	yes	yes	Sometimes	Public_Transportation	Normal_Weight
2	Male	Frequently	no	no	no	yes	Sometimes	Public_Transportation	Normal_Weight
3	Male	Frequently	no	no	no	no	Sometimes	Walking	Overweight_Level_I
4	Male	Sometimes	no	no	no	no	Sometimes	Public_Transportation	Overweight_Level_II
…	…	…	…	…	…	…	…	…	…
2106	Female	Sometimes	yes	no	no	yes	Sometimes	Public_Transportation	Obesity_Type_III
2107	Female	Sometimes	yes	no	no	yes	Sometimes	Public_Transportation	Obesity_Type_III
2108	Female	Sometimes	yes	no	no	yes	Sometimes	Public_Transportation	Obesity_Type_III
2109	Female	Sometimes	yes	no	no	yes	Sometimes	Public_Transportation	Obesity_Type_III
2110	Female	Sometimes	yes	no	no	yes	Sometimes	Public_Transportation	Obesity_Type_III

2111 rows × 9 columns

kategori_df=kategori_d.drop(["NObeyesdad"],axis=1)

kategori_df

github artificial intelligence-projects
ml project life cycle
ml project python
ml projects python
deep learning projects for masters students

	Gender	CALC	FAVC	SCC	SMOKE	family_history_with_overweight	CAEC	MTRANS
0	Female	no	no	no	no	yes	Sometimes	Public_Transportation
1	Female	Sometimes	no	yes	yes	yes	Sometimes	Public_Transportation
2	Male	Frequently	no	no	no	yes	Sometimes	Public_Transportation
3	Male	Frequently	no	no	no	no	Sometimes	Walking
4	Male	Sometimes	no	no	no	no	Sometimes	Public_Transportation
…	…	…	…	…	…	…	…	…
2106	Female	Sometimes	yes	no	no	yes	Sometimes	Public_Transportation
2107	Female	Sometimes	yes	no	no	yes	Sometimes	Public_Transportation
2108	Female	Sometimes	yes	no	no	yes	Sometimes	Public_Transportation
2109	Female	Sometimes	yes	no	no	yes	Sometimes	Public_Transportation
2110	Female	Sometimes	yes	no	no	yes	Sometimes	Public_Transportation

2111 rows × 8 columns

kategori_df.columns

Index(['Gender', 'CALC', 'FAVC', 'SCC', 'SMOKE',
       'family_history_with_overweight', 'CAEC', 'MTRANS'],
      dtype='object')

kategori_encode_df = pd.get_dummies(kategori_df, columns=kategori_df.columns)

kategori_encode_df

	Gender_Female	Gender_Male	CALC_Always	CALC_Frequently	CALC_Sometimes	CALC_no	FAVC_no	FAVC_yes	SCC_no	SCC_yes	…	family_history_with_overweight_yes	CAEC_Always	CAEC_Frequently	CAEC_Sometimes	CAEC_no	MTRANS_Automobile	MTRANS_Bike	MTRANS_Motorbike	MTRANS_Public_Transportation	MTRANS_Walking
0	True	False	False	False	False	True	True	False	True	False	…	True	False	False	True	False	False	False	False	True	False
1	True	False	False	False	True	False	True	False	False	True	…	True	False	False	True	False	False	False	False	True	False
2	False	True	False	True	False	False	True	False	True	False	…	True	False	False	True	False	False	False	False	True	False
3	False	True	False	True	False	False	True	False	True	False	…	False	False	False	True	False	False	False	False	False	True
4	False	True	False	False	True	False	True	False	True	False	…	False	False	False	True	False	False	False	False	True	False
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2106	True	False	False	False	True	False	False	True	True	False	…	True	False	False	True	False	False	False	False	True	False
2107	True	False	False	False	True	False	False	True	True	False	…	True	False	False	True	False	False	False	False	True	False
2108	True	False	False	False	True	False	False	True	True	False	…	True	False	False	True	False	False	False	False	True	False
2109	True	False	False	False	True	False	False	True	True	False	…	True	False	False	True	False	False	False	False	True	False
2110	True	False	False	False	True	False	False	True	True	False	…	True	False	False	True	False	False	False	False	True	False

2111 rows × 23 columns

kategori_encode_df.columns

Index(['Gender_Female', 'Gender_Male', 'CALC_Always', 'CALC_Frequently',
       'CALC_Sometimes', 'CALC_no', 'FAVC_no', 'FAVC_yes', 'SCC_no', 'SCC_yes',
       'SMOKE_no', 'SMOKE_yes', 'family_history_with_overweight_no',
       'family_history_with_overweight_yes', 'CAEC_Always', 'CAEC_Frequently',
       'CAEC_Sometimes', 'CAEC_no', 'MTRANS_Automobile', 'MTRANS_Bike',
       'MTRANS_Motorbike', 'MTRANS_Public_Transportation', 'MTRANS_Walking'],
      dtype='object')

kolom_numerik = list(numerik_df.columns)
kolom_kategorikal = list(kategori_df.columns)
train_df_filtered = train_df.drop(kolom_numerik + kolom_kategorikal, axis=1)
data_preprocessed_df = pd.concat([train_df_filtered, numerik_scaled_df, kategori_encode_df ], axis=1)

data_preprocessed_df

	NObeyesdad	Age	Height	Weight	FCVC	NCP	CH2O	FAF	TUE	Gender_Female	…	family_history_with_overweight_yes	CAEC_Always	CAEC_Frequently	CAEC_Sometimes	CAEC_no	MTRANS_Automobile	MTRANS_Bike	MTRANS_Motorbike	MTRANS_Public_Transportation	MTRANS_Walking
0	Normal_Weight	-0.522124	-0.875589	-0.862558	-0.785019	0.404153	-0.013073	-1.188039	0.561997	True	…	True	False	False	True	False	False	False	False	True	False
1	Normal_Weight	-0.522124	-1.947599	-1.168077	1.088342	0.404153	1.618759	2.339750	-1.080625	True	…	True	False	False	True	False	False	False	False	True	False
2	Normal_Weight	-0.206889	1.054029	-0.366090	-0.785019	0.404153	-0.013073	1.163820	0.561997	False	…	True	False	False	True	False	False	False	False	True	False
3	Overweight_Level_I	0.423582	1.054029	0.015808	1.088342	0.404153	-0.013073	1.163820	-1.080625	False	…	False	False	False	True	False	False	False	False	False	True
4	Overweight_Level_II	-0.364507	0.839627	0.122740	-0.785019	-2.167023	-0.013073	-1.188039	-1.080625	False	…	False	False	False	True	False	False	False	False	True	False
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2106	Obesity_Type_III	-0.525774	0.097045	1.711763	1.088342	0.404153	-0.456705	0.783135	0.407996	True	…	True	False	False	True	False	False	False	False	True	False
2107	Obesity_Type_III	-0.367195	0.502844	1.800914	1.088342	0.404153	-0.004702	0.389341	-0.096251	True	…	True	False	False	True	False	False	False	False	True	False
2108	Obesity_Type_III	-0.281909	0.541672	1.798868	1.088342	0.404153	0.075361	0.474971	-0.019018	True	…	True	False	False	True	False	False	False	False	True	False
2109	Obesity_Type_III	0.007776	0.404927	1.785780	1.088342	0.404153	1.377801	0.151471	-0.117991	True	…	True	False	False	True	False	False	False	False	True	False
2110	Obesity_Type_III	-0.102119	0.398344	1.790592	1.088342	0.404153	1.396035	0.018996	0.092432	True	…	True	False	False	True	False	False	False	False	True	False

2111 rows × 32 columns

data_preprocessed_df.columns

Index(['NObeyesdad', 'Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF',
       'TUE', 'Gender_Female', 'Gender_Male', 'CALC_Always', 'CALC_Frequently',
       'CALC_Sometimes', 'CALC_no', 'FAVC_no', 'FAVC_yes', 'SCC_no', 'SCC_yes',
       'SMOKE_no', 'SMOKE_yes', 'family_history_with_overweight_no',
       'family_history_with_overweight_yes', 'CAEC_Always', 'CAEC_Frequently',
       'CAEC_Sometimes', 'CAEC_no', 'MTRANS_Automobile', 'MTRANS_Bike',
       'MTRANS_Motorbike', 'MTRANS_Public_Transportation', 'MTRANS_Walking'],
      dtype='object')

X_df = data_preprocessed_df.drop(columns=['NObeyesdad'],axis=1)
y_df = data_preprocessed_df['NObeyesdad']

X_df

	Age	Height	Weight	FCVC	NCP	CH2O	FAF	TUE	Gender_Female	Gender_Male	…	family_history_with_overweight_yes	CAEC_Always	CAEC_Frequently	CAEC_Sometimes	CAEC_no	MTRANS_Automobile	MTRANS_Bike	MTRANS_Motorbike	MTRANS_Public_Transportation	MTRANS_Walking
0	-0.522124	-0.875589	-0.862558	-0.785019	0.404153	-0.013073	-1.188039	0.561997	True	False	…	True	False	False	True	False	False	False	False	True	False
1	-0.522124	-1.947599	-1.168077	1.088342	0.404153	1.618759	2.339750	-1.080625	True	False	…	True	False	False	True	False	False	False	False	True	False
2	-0.206889	1.054029	-0.366090	-0.785019	0.404153	-0.013073	1.163820	0.561997	False	True	…	True	False	False	True	False	False	False	False	True	False
3	0.423582	1.054029	0.015808	1.088342	0.404153	-0.013073	1.163820	-1.080625	False	True	…	False	False	False	True	False	False	False	False	False	True
4	-0.364507	0.839627	0.122740	-0.785019	-2.167023	-0.013073	-1.188039	-1.080625	False	True	…	False	False	False	True	False	False	False	False	True	False
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
2106	-0.525774	0.097045	1.711763	1.088342	0.404153	-0.456705	0.783135	0.407996	True	False	…	True	False	False	True	False	False	False	False	True	False
2107	-0.367195	0.502844	1.800914	1.088342	0.404153	-0.004702	0.389341	-0.096251	True	False	…	True	False	False	True	False	False	False	False	True	False
2108	-0.281909	0.541672	1.798868	1.088342	0.404153	0.075361	0.474971	-0.019018	True	False	…	True	False	False	True	False	False	False	False	True	False
2109	0.007776	0.404927	1.785780	1.088342	0.404153	1.377801	0.151471	-0.117991	True	False	…	True	False	False	True	False	False	False	False	True	False
2110	-0.102119	0.398344	1.790592	1.088342	0.404153	1.396035	0.018996	0.092432	True	False	…	True	False	False	True	False	False	False	False	True	False

2111 rows × 31 columns

y_df

0             Normal_Weight
1             Normal_Weight
2             Normal_Weight
3        Overweight_Level_I
4       Overweight_Level_II
               ...         
2106       Obesity_Type_III
2107       Obesity_Type_III
2108       Obesity_Type_III
2109       Obesity_Type_III
2110       Obesity_Type_III
Name: NObeyesdad, Length: 2111, dtype: object

y_df.unique()

github artificial intelligence-projects
ml project life cycle
ml project python
ml projects python
deep learning projects for masters students

array(['Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II',
       'Obesity_Type_I', 'Insufficient_Weight', 'Obesity_Type_II',
       'Obesity_Type_III'], dtype=object)

jumlah_kelas=y_df.nunique()

jumlah_kelas

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y=label_encoder.fit_transform(y_df)

y.shape

(2111,)

np.unique(y)

array([0, 1, 2, 3, 4, 5, 6])

from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val=train_test_split(X_df ,y,
                    test_size=0.3,random_state=42,stratify=y)

X_train.shape

(1477, 31)

X_train

	Age	Height	Weight	FCVC	NCP	CH2O	FAF	TUE	Gender_Female	Gender_Male	…	family_history_with_overweight_yes	CAEC_Always	CAEC_Frequently	CAEC_Sometimes	CAEC_no	MTRANS_Automobile	MTRANS_Bike	MTRANS_Motorbike	MTRANS_Public_Transportation	MTRANS_Walking
90	0.108346	-0.768388	0.244947	1.088342	1.689740	-1.644905	1.163820	-1.080625	True	False	…	False	True	False	False	False	False	False	False	True	False
513	-0.483801	-1.111228	-1.594060	1.088342	-1.233352	0.711664	0.362036	-1.080625	True	False	…	False	False	True	False	False	False	False	False	True	False
1100	-0.813763	-0.019932	-0.327900	-2.477684	-0.249806	0.721364	-1.188039	0.112112	False	True	…	True	False	False	True	False	False	False	False	True	False
339	-0.837360	-1.840398	-1.702735	-0.785019	0.404153	-1.644905	1.163820	-1.080625	True	False	…	False	False	False	True	False	False	False	False	True	False
612	-0.203982	-1.253098	-1.611971	-0.401141	-0.717141	0.183223	-0.017125	-1.080625	True	False	…	False	False	True	False	False	False	False	False	True	False
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
1567	0.978531	0.883032	1.300675	0.149990	0.404153	0.053754	-0.201741	-0.275980	False	True	…	True	False	False	True	False	True	False	False	False	False
1336	-0.520371	1.657731	1.206713	-0.785019	0.404153	1.618759	-0.260719	0.923421	False	True	…	True	False	False	True	False	False	False	False	True	False
609	-0.682924	0.554043	-1.206367	-0.785019	1.040325	1.580691	1.103930	2.204618	False	True	…	True	False	False	True	False	False	False	False	True	False
1659	-0.184602	1.582604	1.339420	1.088342	-0.225612	-0.513455	-0.282915	-1.080625	False	True	…	True	False	False	True	False	False	False	False	True	False
237	-0.837360	-0.661187	-1.282647	1.088342	0.404153	-1.644905	-0.012109	0.561997	True	False	…	True	False	False	True	False	False	False	False	True	False

1477 rows × 31 columns

X_train.shape[0]

X_train.shape[1]

X_val.shape

(634, 31)

X_val

	Age	Height	Weight	FCVC	NCP	CH2O	FAF	TUE	Gender_Female	Gender_Male	…	family_history_with_overweight_yes	CAEC_Always	CAEC_Frequently	CAEC_Sometimes	CAEC_no	MTRANS_Automobile	MTRANS_Bike	MTRANS_Motorbike	MTRANS_Public_Transportation	MTRANS_Walking
332	0.423582	1.590034	-0.442470	-0.785019	-2.167023	-0.013073	-0.012109	-1.080625	False	True	…	True	False	False	True	False	False	False	False	False	True
1235	-0.149256	0.600472	0.335144	-0.785019	0.404153	1.618759	2.339750	2.204618	False	True	…	True	False	False	True	False	False	False	False	True	False
16	0.423582	2.447641	0.588656	-0.785019	-2.167023	-1.644905	-0.012109	-1.080625	False	True	…	True	False	False	True	False	False	False	False	True	False
1214	0.219669	-1.244929	-0.238106	-0.785019	-2.167023	-0.013073	-1.188039	-1.080625	True	False	…	True	False	False	True	False	False	False	False	True	False
521	-0.837360	-1.473782	-1.699066	1.088342	-1.017214	0.731990	0.689422	0.557726	True	False	…	False	False	False	True	False	False	False	False	True	False
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
445	-0.837360	-2.054800	-1.588165	-0.785019	1.689740	-1.644905	2.339750	-1.080625	True	False	…	False	False	False	True	False	False	False	False	True	False
1576	0.416906	0.847463	1.007138	-0.291866	0.404153	0.137886	-1.188039	1.936629	False	True	…	True	False	False	True	False	True	False	False	False	False
1219	-0.036737	0.340488	0.432531	-0.785019	0.404153	1.363662	0.351610	1.118279	False	True	…	True	False	False	True	False	False	False	False	True	False
1771	0.213427	1.038806	1.197146	-0.715626	0.404153	0.700021	0.007004	-1.047700	False	True	…	True	False	False	True	False	False	False	False	True	False
50	-0.522124	-0.982790	-1.225362	1.088342	0.404153	1.618759	-1.188039	0.561997	True	False	…	True	False	False	True	False	False	False	False	False	True

ml projects
ml projects with source code
ml projects github
ml projects for final year
ml projects for students

634 rows × 31 columns

y_train.shape

(1477,)

y_train

array([3, 0, 6, ..., 0, 3, 1])

y_val.shape

(634,)

y_val

Model

import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, GlobalMaxPooling1D, Dropout, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras import regularizers
from sklearn.model_selection import RandomizedSearchCV
from keras.optimizers import Adam

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        assert d_model % self.num_heads == 0, "d_model harus habis dibagi dengan num_heads"
        self.depth = d_model // self.num_heads
        self.wq = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))
        self.wk = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))
        self.wv = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))
        self.dense = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))
        
    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
        
    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)
        scaled_attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask)
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))
        output = self.dense(concat_attention)
        return output, attention_weights
    
    def scaled_dot_product_attention(self, q, k, v, mask):
        matmul_qk = tf.matmul(q, k, transpose_b=True)
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)
        output = tf.matmul(attention_weights, v)
        return output, attention_weights

def create_model(units=128, dropout_rate=0.2, d_model=128, num_heads=8):
    input_layer = Input(shape=(X_train.shape[1],))
    dense_layer = Dense(units, activation='relu', kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))(input_layer)
    dropout_layer = Dropout(dropout_rate)(dense_layer)
    mha_layer = MultiHeadAttention(d_model=d_model, num_heads=num_heads)
    mha_output, _ = mha_layer(dropout_layer, dropout_layer, dropout_layer, mask=None)
    pooling_layer = GlobalMaxPooling1D()(mha_output)
    batchnorm_layer = BatchNormalization()(pooling_layer)
    dense_layer2 = Dense(jumlah_kelas * 2, activation='relu', kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))(batchnorm_layer)
    dropout_layer2 = Dropout(0.1)(dense_layer2)
    output_layer = Dense(jumlah_kelas, activation='softmax', kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))(dropout_layer2)

    model = Model(inputs=input_layer, outputs=output_layer)
    optimizer = Adam(learning_rate=0.001)
    model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

from scipy.stats import randint
param_dist = {
    'units': randint(64, 512),
    'dropout_rate': [0.1, 0.2, 0.3, 0.4],
    'd_model': [32, 64, 128, 256],
    'num_heads': [8, 16, 32], 
    'batch_size': [64, 128, 256]
}

from sklearn.metrics import accuracy_score
class CustomEstimator:
    def __init__(self, units=128, dropout_rate=0.2, d_model=128, num_heads=8, batch_size=32):
        self.units = units
        self.dropout_rate = dropout_rate
        self.d_model = d_model
        self.num_heads = num_heads
        self.batch_size = batch_size
        self.model = None
    def fit(self, X, y, **kwargs):
        if self.model is None:
            self.model = self._init_model()
        if self.d_model % self.num_heads != 0:
            print("Peringatan: d_model tidak dapat dibagi habis dengan num_heads. Melewati inisialisasi model.")
            return
        self.model.fit(X, y, batch_size=self.batch_size, **kwargs)
    def predict(self, X):
        if self.model is not None:
            y_pred_prob = self.model.predict(X)
            y_pred = np.argmax(y_pred_prob, axis=1)
            return y_pred
        else:
            print("Tidak ada model untuk melakukan prediksi. Melanjutkan ke langkah berikutnya dari algoritma.")
            return None
    def score(self, X, y):
        if self.model is not None:
            y_pred = self.predict(X)
            return accuracy_score(y, y_pred)
        else:
            print("Tidak ada model untuk melakukan perhitungan skor. Melanjutkan ke langkah berikutnya dari algoritma.")
            return 0.0
    def _init_model(self):
        if self.d_model % self.num_heads != 0:
            print("Peringatan: d_model tidak dapat dibagi habis dengan num_heads. Melewati inisialisasi model.")
            return None
        return create_model(units=self.units, dropout_rate=self.dropout_rate, d_model=self.d_model, num_heads=self.num_heads)
    def get_params(self, deep=True):
        return {
            'units': self.units,
            'dropout_rate': self.dropout_rate,
            'd_model': self.d_model,
            'num_heads': self.num_heads,
            'batch_size': self.batch_size
        }
    def set_params(self, **params):
        for param, value in params.items():
            setattr(self, param, value)
        return self

custom_estimator = CustomEstimator()
random_search = RandomizedSearchCV(estimator=custom_estimator, param_distributions=param_dist, n_iter=10, cv=3)
random_search.fit(X_train, y_train, validation_data=(X_val, y_val))

best_model = random_search.best_estimator_

best_model_params = random_search.best_estimator_.get_params()
print(best_model_params)

{'units': 407, 'dropout_rate': 0.1, 'd_model': 32, 'num_heads': 32, 'batch_size': 64}

from keras.callbacks import EarlyStopping,ReduceLROnPlateau
early_stopping = EarlyStopping(monitor='val_loss', patience=5, 
                                     restore_best_weights=True)
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, 
                                    patience=5, min_lr=0.0001)

history = best_model.model.fit(X_train, y_train, epochs=1000,
                    validation_data=(X_val, y_val), callbacks=[early_stopping,reduce_lr])

loss, accuracy = best_model.model.evaluate(X_val, y_val, verbose=0)
print(f'Loss: {loss:.2f}')
print(f'Accuracy: {accuracy * 100:.2f}%')

Loss: 0.56
Accuracy: 93.22%

loss = history.history['loss']
val_loss = history.history['val_loss']
plt.plot(loss, label='Training Loss')
plt.plot(val_loss, label='Validation Loss')
plt.title('Training and Validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

accuracy = history.history['accuracy']
val_accuracy = history.history['val_accuracy']
plt.plot(accuracy, label='Training Accuracy')
plt.plot(val_accuracy, label='Validation Accuracy')
plt.title('Training and Validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

ml projects reddit
reddit ai subreddit
ml interesting projects
good ml projects

best_model.model.summary()

Model: "functional_61"

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Layer (type)        ┃ Output Shape      ┃    Param # ┃ Connected to      ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━┩
│ input_layer_30      │ (None, 31)        │          0 │ -                 │
│ (InputLayer)        │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense_210 (Dense)   │ (None, 407)       │     13,024 │ input_layer_30[0… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dropout_60          │ (None, 407)       │          0 │ dense_210[0][0]   │
│ (Dropout)           │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ multi_head_attenti… │ [(None, None,     │     40,224 │ dropout_60[0][0], │
│ (MultiHeadAttentio… │ 32), (None, 32,   │            │ dropout_60[0][0], │
│                     │ None, None)]      │            │ dropout_60[0][0]  │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ global_max_pooling… │ (None, 32)        │          0 │ multi_head_atten… │
│ (GlobalMaxPooling1… │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ batch_normalizatio… │ (None, 32)        │        128 │ global_max_pooli… │
│ (BatchNormalizatio… │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense_215 (Dense)   │ (None, 14)        │        462 │ batch_normalizat… │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dropout_61          │ (None, 14)        │          0 │ dense_215[0][0]   │
│ (Dropout)           │                   │            │                   │
├─────────────────────┼───────────────────┼────────────┼───────────────────┤
│ dense_216 (Dense)   │ (None, 7)         │        105 │ dropout_61[0][0]  │
└─────────────────────┴───────────────────┴────────────┴───────────────────┘

 Total params: 161,703 (631.66 KB)

 Trainable params: 53,879 (210.46 KB)

 Non-trainable params: 64 (256.00 B)

 Optimizer params: 107,760 (420.94 KB)

deep learning projects github
deep learning project github
github artificial intelligence projects

from keras.utils import plot_model
file_name = 'arsitektur_model.png'
plot_model(best_model.model, to_file=file_name, show_shapes=True, show_layer_names=True)
plt.figure(figsize=(15,15))
img = plt.imread(file_name)
plt.imshow(img)
plt.title('Arsitektur Model', fontsize=18)
plt.axis('off') 
plt.savefig(file_name)
plt.show()

label_encoder.classes_

array(['Insufficient_Weight', 'Normal_Weight', 'Obesity_Type_I',
       'Obesity_Type_II', 'Obesity_Type_III', 'Overweight_Level_I',
       'Overweight_Level_II'], dtype=object)

from sklearn.metrics import classification_report
y_pred_prob = best_model.model.predict(X_val)
y_pred = np.argmax(y_pred_prob, axis=1)
target_names =[str(cls) for cls in label_encoder.classes_]
report = classification_report(y_val, y_pred,target_names=target_names,zero_division=1)
print("Classification Report:\n", report)

20/20 ━━━━━━━━━━━━━━━━━━━━ 0s 10ms/step
Classification Report:
                      precision    recall  f1-score   support

Insufficient_Weight       0.96      0.98      0.97        82
      Normal_Weight       0.82      0.90      0.86        86
     Obesity_Type_I       0.95      0.96      0.96       106
    Obesity_Type_II       0.99      0.99      0.99        89
   Obesity_Type_III       1.00      0.99      0.99        97
 Overweight_Level_I       0.87      0.83      0.85        87
Overweight_Level_II       0.93      0.87      0.90        87

           accuracy                           0.93       634
          macro avg       0.93      0.93      0.93       634
       weighted avg       0.93      0.93      0.93       634

ml projects github
ml project github
github artificial intelligence projects 
ml projects

from sklearn.metrics import confusion_matrix
import seaborn as sns
conf_matrix = confusion_matrix(y_val, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

train_df.to_csv("train_data.csv", index=False)

Obesity Levels Analysis with ML

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

df = pd.read_csv('/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv')

df.head()

	Age	Gender	Height	Weight	CALC	FAVC	FCVC	NCP	SCC	SMOKE	CH2O	family_history_with_overweight	FAF	TUE	CAEC	MTRANS	NObeyesdad
0	21.0	Female	1.62	64.0	no	no	2.0	3.0	no	no	2.0	yes	0.0	1.0	Sometimes	Public_Transportation	Normal_Weight
1	21.0	Female	1.52	56.0	Sometimes	no	3.0	3.0	yes	yes	3.0	yes	3.0	0.0	Sometimes	Public_Transportation	Normal_Weight
2	23.0	Male	1.80	77.0	Frequently	no	2.0	3.0	no	no	2.0	yes	2.0	1.0	Sometimes	Public_Transportation	Normal_Weight
3	27.0	Male	1.80	87.0	Frequently	no	3.0	3.0	no	no	2.0	no	2.0	0.0	Sometimes	Walking	Overweight_Level_I
4	22.0	Male	1.78	89.8	Sometimes	no	2.0	1.0	no	no	2.0	no	0.0	0.0	Sometimes	Public_Transportation	Overweight_Level_II

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Age                             2111 non-null   float64
 1   Gender                          2111 non-null   object 
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   CALC                            2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   SCC                             2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  family_history_with_overweight  2111 non-null   object 
 12  FAF                             2111 non-null   float64
 13  TUE                             2111 non-null   float64
 14  CAEC                            2111 non-null   object 
 15  MTRANS                          2111 non-null   object 
 16  NObeyesdad                      2111 non-null   object 
dtypes: float64(8), object(9)
memory usage: 280.5+ KB

df.isnull().sum()

Age                               0
Gender                            0
Height                            0
Weight                            0
CALC                              0
FAVC                              0
FCVC                              0
NCP                               0
SCC                               0
SMOKE                             0
CH2O                              0
family_history_with_overweight    0
FAF                               0
TUE                               0
CAEC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64

#df = df.drop_duplicates()

# Define the order of categories and corresponding colors
order_colors = {"Male": "blue", "Female": "pink"}

plt.figure(figsize=(6, 6))
sns.countplot(x="Gender", data=df, order=order_colors.keys(), palette=order_colors.values())
plt.title("Gender Distribution", fontsize=14, fontweight="bold")
plt.xticks(rotation=45)

# Annotate each bar with its count
for i, count in enumerate(df["Gender"].value_counts()):
    plt.text(i, count, str(count), ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Group the data by gender
grouped = df.groupby('Gender')

# Create a figure with multiple subplots
fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(16, 12))
fig.suptitle('Measures by Gender', fontsize=16)

# Visualize CALC
calc_counts = grouped['CALC'].value_counts().unstack()
calc_counts.plot(kind='bar', ax=axes[0, 0])

# Set title, labels, and annotations
axes[0, 0].set_title('How often do you drink alcohol?')
axes[0, 0].set_xlabel('CALC Values')
axes[0, 0].set_ylabel('Count')
for p in axes[0, 0].patches:
    axes[0, 0].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height() + 0.5),
                        ha='center', va='bottom')

# Visualize FAVC
favc_counts = grouped['FAVC'].value_counts().unstack()
favc_counts.plot(kind='bar', ax=axes[0, 1])

# Set title, labels, and annotations
axes[0, 1].set_title('Do you eat high caloric food frequently?')
axes[0, 1].set_xlabel('FAVC Values')
axes[0, 1].set_ylabel('Count')
for p in axes[0, 1].patches:
    axes[0, 1].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height() + 0.5),
                        ha='center', va='bottom')

# Visualize FCVC
fcvc_means = grouped['FCVC'].mean().reset_index()
fcvc_means.columns = ['Gender', 'FCVC Mean']
fcvc_means.set_index('Gender', inplace=True)
fcvc_means.plot(kind='bar', ax=axes[1, 0])

# Set title, labels, and annotations
axes[1, 0].set_title('Do you usually eat vegetables in your meals?')
axes[1, 0].set_xlabel('Gender')
axes[1, 0].set_ylabel('FCVC Mean')
for p in axes[1, 0].patches:
    bar_width = p.get_width()
    bar_height = p.get_height()
    bar_x = p.get_x()
    bar_middle = bar_x + bar_width / 2
    axes[1, 0].annotate(str(round(bar_height, 2)), (bar_middle, bar_height), ha='center', va='bottom')

# Visualize NCP
ncp_means = grouped['NCP'].mean().reset_index()
ncp_means.columns = ['Gender', 'NCP Mean']
ncp_means.set_index('Gender', inplace=True)
ncp_means.plot(kind='bar', ax=axes[1, 1])

# Set title, labels, and annotations
axes[1, 1].set_title('How many main meals do you have daily?')
axes[1, 1].set_xlabel('Gender')
axes[1, 1].set_ylabel('NCP Mean')
for p in axes[1, 1].patches:
    bar_width = p.get_width()
    bar_height = p.get_height()
    bar_x = p.get_x()
    bar_middle = bar_x + bar_width / 2
    axes[1, 1].annotate(str(round(bar_height, 2)), (bar_middle, bar_height), ha='center', va='bottom')

# Visualize SCC
scc_counts = grouped['SCC'].value_counts().unstack()
scc_counts.plot(kind='bar', ax=axes[2, 0])

# Set title, labels, and annotations
axes[2, 0].set_title('Do you monitor the calories you eat daily? ')
axes[2, 0].set_xlabel('SCC Values')
axes[2, 0].set_ylabel('Count')
for p in axes[2, 0].patches:
    axes[2, 0].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom')

# Visualize SMOKE
smoke_counts = grouped['SMOKE'].value_counts().unstack()
smoke_counts.plot(kind='bar', ax=axes[2, 1])

# Set title, labels, and annotations
axes[2, 1].set_title('Do you smoke?')
axes[2, 1].set_xlabel('SMOKE Values')
axes[2, 1].set_ylabel('Count')
for p in axes[2, 1].patches:
    axes[2, 1].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom')

# Adjust spacing between subplots
plt.subplots_adjust(hspace=0.5, wspace=0.5)

# Display the plot
plt.show()

ml projects for resume
ml project for resume
best ml projects
cool ml projects

# Sort NObeyesdad in descending order
sorted_obesity_levels = df['NObeyesdad'].value_counts().index

plt.figure(figsize=(6, 6))
sns.countplot(x="NObeyesdad", data=df, order=sorted_obesity_levels[::-1], palette="Greens")
plt.title("Obesity Level Distribution", fontsize=14, fontweight="bold")
plt.xticks(rotation=45)

# Annotate each bar with its count
for i, count in enumerate(df['NObeyesdad'].value_counts()[::-1]):
    plt.text(i, count, str(count), ha='center', va='bottom')

plt.tight_layout()
plt.show()

plt.figure(figsize=(16, 6))
plt.subplot(1, 3, 1)
sns.histplot(df["Age"].dropna(), kde=True, color="Red")
plt.title("Age Distribution", fontsize=14, fontweight="bold")

plt.subplot(1, 3, 2)
sns.histplot(df["Height"].dropna(), kde=True, color="Orange")
plt.title("Height Distribution", fontsize=14, fontweight="bold")

plt.subplot(1, 3, 3)
sns.histplot(df["Weight"].dropna(), kde=True, color="Purple")
plt.title("Weight Distribution", fontsize=14, fontweight="bold")
plt.tight_layout()

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

# Correlation heatmap
plt.figure(figsize=(12, 8))

# Select only numerical columns
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Calculate correlation matrix
corr_matrix = df[numeric_cols].corr()

# Plot correlation heatmap
sns.heatmap(corr_matrix, annot=True, cmap="YlGnBu")
plt.title("Feature Correlation Heatmap", fontsize=16, fontweight="bold")
plt.show()

ml projects
ml project

# Define BMI categories and corresponding colors
bmi_colors = {
    "Normal": "green",
    "Overweight": "red",
    "Underweight": "blue"
}

# Calculate BMI for each person in the dataset
df['BMI'] = df['Weight'] / (df['Height'] ** 2)

# Create a new column to categorize BMI
df['BMI Category'] = pd.cut(df['BMI'], bins=[0, 18.5, 24.9, np.inf], labels=["Underweight", "Normal", "Overweight"], right=False)

# Plot the scatterplot with colors based on BMI categories
plt.figure(figsize=(8, 8))
for category, color in bmi_colors.items():
    subset = df[df['BMI Category'] == category]
    plt.scatter(subset['Height'], subset['Weight'], color=color, label=category)

plt.title("Height vs Weight", fontsize=14, fontweight="bold")
plt.xlabel("Height (m)")
plt.ylabel("Weight (kg)")
plt.legend()
plt.tight_layout()
plt.show()

Conclusion

Alright, everyone, we’ve reached the end of our journey through Machine Learning Project 6: Obesity Type – Best EDA and Classification! We’ve had an exciting adventure exploring the world of obesity types and harnessing the potential of machine learning to expertly classify them.

But this is more than just crunching numbers and making predictions. It’s about making a genuine difference in people’s lives.

ml project
ml projects
ml projects github
ml project with source code

By understanding the intricacies of obesity types, we’re paving the way for personalized interventions and treatments that can truly transform lives for the better.

As we conclude our journey, let’s maintain the momentum. Let’s continue pushing the boundaries of data science to address real-world issues and have a positive impact on society.

Remember, we hold the power of data in our hands – let’s use it wisely to create a healthier, happier world for everyone. Take care, and keep coding!

ml project with source code
ml project source code
ml projects for resume
ml project for resume
best ml projects
cool ml projects

Learn more

More info about our us

Facebook: Click

Telegram group of exercises: Click

YouTube: Click

4 Comments

Machine Learning Project 3: Best Explore Indian Cuisine · May 27, 2024 at 1:38 pm

[…] Machine Learning Project 6: Obesity type Best EDA and classification […]

Machine Learning Project 6: Obesity type Best EDA and classification

Published by Darek Dari on May 27, 2024May 27, 2024

Table of Contents

Introduction

Dataset Information

Importing libraries and Reading data

Data wrangling

Univariate analysis

Multivariate analysis

How is obesity type affected by eating high calorie food?

Average age of each obesity type

Average weight of each obesity type

Does gender affect obesity type?

Does eating food between meals affect obesity type?

Does family history with overweight affect obesity type?

Does people who drink also smoke?

Data preprocessing and splitting data

Enoding ordinal features using label enoder

Encoding nominal data using pd dummies

Correlation between data atributes

Normalizing data using max absolute scaler

Splitting data and training models

Evaluating models

– light gradient boosting is the best model acheiving 99% accuracy and an average acc of 97% in kfold cross validation

– xg and gradient are almost the same in accuracy (98%) and kfold validation (97%)

– random forest acheived the worst results with 94% acc and an average acc of (92%) in kfold cross validation and it also seemed to overfit

Obesity Levels-Multi Head Attention-Hyperparameter

Obesity Levels Analysis with ML

Conclusion

Learn more

More info about our us

4 Comments

Machine Learning Project 3: Best Explore Indian Cuisine · May 27, 2024 at 1:38 pm

Machine Learning Project 4: Best Explore Video Game Data · May 27, 2024 at 1:38 pm

Machine Learning Project 2: Diversity Tech Company Best EDA · May 27, 2024 at 1:39 pm

Machine Learning Project 1: Honda Motor Stocks Best Prices · May 27, 2024 at 1:40 pm

Leave a Reply Cancel reply

Related Posts

Computer Engineering

Build a Chatbot with OpenAI and Python in Under 30 Minutes

Python

Selenium Login to Website Python – Complete Guide for 2025

Python

Python Script to Login to Website Automatically: Step-by-Step Guide