Table of Contents
Introduction
Hey there! Welcome to the inside scoop on ML Project 6: Obesity Type – Best EDA and Classification! We’re diving deep into the world of data science to tackle a real-life problem: obesity.
Also, check Machine Learning projects:
- Machine Learning Project 1: Honda Motor Stocks best Prices analysis
- Machine Learning Project 2: Diversity in Tech Companies Best EDA Analysis
- Machine Learning Project 3: Exploring Indian Cuisine Best Analysis
- Machine Learning Project 4: Exploring Video Game Data
- Machine Learning Project 5: Best Students Performance EDA
- Machine Learning Project 6: Obesity type Best EDA and classification
We’re not just scratching the surface, we’re diving headfirst into understanding the different types of obesity and how we can combat it using some serious data skills.
Imagine this: we’ll be sifting through data like a detective on a case, searching for clues about what causes each type of obesity. And once we have that knowledge, we’ll work our magic with top-notch machine learning techniques to classify obesity types like nobody’s business.
Think of it as a mission to crack the code of obesity and discover ways to help people lead healthier lives. So get ready, because we’re about to embark on an exciting journey through the world of data! Let’s get started!
ml projects github
ml projects for final year
ml projects for students
Dataset Information
This dataset provides information on obesity levels in individuals from Mexico, Peru, and Colombia. It includes data on their eating habits and physical condition. The dataset consists of 17 attributes and 2111 records.
Dataset Link: https://www.kaggle.com/code/diaakotb/obesity-type-eda-and-classification-99-boosting
Each record is labeled with the class variable NObesity (Obesity Level), which allows for classification based on values such as Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II, and Obesity Type III.
machine learning projects for resume
machine learning project for resume
best machine learning projects
cool machine learning projects
Approximately 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, while the remaining 23% was collected directly from users through a web platform.
- Gender: Feature, Categorical, “Gender”
- Age : Feature, Continuous, “Age”
- Height: Feature, Continuous
- Weight: Feature Continuous
- family_history_with_overweight: Feature, Binary, ” Has a family member suffered or suffers from overweight? “
- FAVC : Feature, Binary, ” Do you eat high caloric food frequently? “
- FCVC : Feature, Integer, ” Do you usually eat vegetables in your meals? “
- NCP : Feature, Continuous, ” How many main meals do you have daily? “
- CAEC : Feature, Categorical, ” Do you eat any food between meals? “
- SMOKE : Feature, Binary, ” Do you smoke? “
- CH2O: Feature, Continuous, ” How much water do you drink daily? “
- SCC: Feature, Binary, ” Do you monitor the calories you eat daily? “
- FAF: Feature, Continuous, ” How often do you have physical activity? “
- TUE : Feature, Integer, ” How much time do you use technological devices such as cell phone, videogames, television, computer and others? “
- CALC : Feature, Categorical, ” How often do you drink alcohol? “
- MTRANS : Feature, Categorical, ” Which transportation do you usually use? “
- NObeyesdad : Target, Categorical, “Obesity level”
Importing libraries and Reading data
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from scipy import stats import warnings sns.set_style("darkgrid")
data = pd.read_csv("/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv")
data.head()
Age | Gender | Height | Weight | CALC | FAVC | FCVC | NCP | SCC | SMOKE | CH2O | family_history_with_overweight | FAF | TUE | CAEC | MTRANS | NObeyesdad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | Female | 1.62 | 64.0 | no | no | 2.0 | 3.0 | no | no | 2.0 | yes | 0.0 | 1.0 | Sometimes | Public_Transportation | Normal_Weight |
1 | 21.0 | Female | 1.52 | 56.0 | Sometimes | no | 3.0 | 3.0 | yes | yes | 3.0 | yes | 3.0 | 0.0 | Sometimes | Public_Transportation | Normal_Weight |
2 | 23.0 | Male | 1.80 | 77.0 | Frequently | no | 2.0 | 3.0 | no | no | 2.0 | yes | 2.0 | 1.0 | Sometimes | Public_Transportation | Normal_Weight |
3 | 27.0 | Male | 1.80 | 87.0 | Frequently | no | 3.0 | 3.0 | no | no | 2.0 | no | 2.0 | 0.0 | Sometimes | Walking | Overweight_Level_I |
4 | 22.0 | Male | 1.78 | 89.8 | Sometimes | no | 2.0 | 1.0 | no | no | 2.0 | no | 0.0 | 0.0 | Sometimes | Public_Transportation | Overweight_Level_II |
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2111 entries, 0 to 2110 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 2111 non-null float64 1 Gender 2111 non-null object 2 Height 2111 non-null float64 3 Weight 2111 non-null float64 4 CALC 2111 non-null object 5 FAVC 2111 non-null object 6 FCVC 2111 non-null float64 7 NCP 2111 non-null float64 8 SCC 2111 non-null object 9 SMOKE 2111 non-null object 10 CH2O 2111 non-null float64 11 family_history_with_overweight 2111 non-null object 12 FAF 2111 non-null float64 13 TUE 2111 non-null float64 14 CAEC 2111 non-null object 15 MTRANS 2111 non-null object 16 NObeyesdad 2111 non-null object dtypes: float64(8), object(9) memory usage: 280.5+ KB
ml projects for resume
ml project for resume
best ml projects
cool ml projects
Data wrangling
data.duplicated().sum()
24
data.loc[data.duplicated(keep=False), :]
Age | Gender | Height | Weight | CALC | FAVC | FCVC | NCP | SCC | SMOKE | CH2O | family_history_with_overweight | FAF | TUE | CAEC | MTRANS | NObeyesdad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
97 | 21.0 | Female | 1.52 | 42.0 | Sometimes | no | 3.0 | 1.0 | no | no | 1.0 | no | 0.0 | 0.0 | Frequently | Public_Transportation | Insufficient_Weight |
98 | 21.0 | Female | 1.52 | 42.0 | Sometimes | no | 3.0 | 1.0 | no | no | 1.0 | no | 0.0 | 0.0 | Frequently | Public_Transportation | Insufficient_Weight |
105 | 25.0 | Female | 1.57 | 55.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 2.0 | no | 2.0 | 0.0 | Sometimes | Public_Transportation | Normal_Weight |
106 | 25.0 | Female | 1.57 | 55.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 2.0 | no | 2.0 | 0.0 | Sometimes | Public_Transportation | Normal_Weight |
145 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
174 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
179 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
184 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
208 | 22.0 | Female | 1.69 | 65.0 | Sometimes | yes | 2.0 | 3.0 | no | no | 2.0 | yes | 1.0 | 1.0 | Sometimes | Public_Transportation | Normal_Weight |
209 | 22.0 | Female | 1.69 | 65.0 | Sometimes | yes | 2.0 | 3.0 | no | no | 2.0 | yes | 1.0 | 1.0 | Sometimes | Public_Transportation | Normal_Weight |
282 | 18.0 | Female | 1.62 | 55.0 | no | yes | 2.0 | 3.0 | no | no | 1.0 | yes | 1.0 | 1.0 | Frequently | Public_Transportation | Normal_Weight |
295 | 16.0 | Female | 1.66 | 58.0 | no | no | 2.0 | 1.0 | no | no | 1.0 | no | 0.0 | 1.0 | Sometimes | Walking | Normal_Weight |
309 | 16.0 | Female | 1.66 | 58.0 | no | no | 2.0 | 1.0 | no | no | 1.0 | no | 0.0 | 1.0 | Sometimes | Walking | Normal_Weight |
443 | 18.0 | Male | 1.72 | 53.0 | Sometimes | yes | 2.0 | 3.0 | no | no | 2.0 | yes | 0.0 | 2.0 | Sometimes | Public_Transportation | Insufficient_Weight |
460 | 18.0 | Female | 1.62 | 55.0 | no | yes | 2.0 | 3.0 | no | no | 1.0 | yes | 1.0 | 1.0 | Frequently | Public_Transportation | Normal_Weight |
466 | 22.0 | Male | 1.74 | 75.0 | no | yes | 3.0 | 3.0 | no | no | 1.0 | yes | 1.0 | 0.0 | Frequently | Automobile | Normal_Weight |
467 | 22.0 | Male | 1.74 | 75.0 | no | yes | 3.0 | 3.0 | no | no | 1.0 | yes | 1.0 | 0.0 | Frequently | Automobile | Normal_Weight |
496 | 18.0 | Male | 1.72 | 53.0 | Sometimes | yes | 2.0 | 3.0 | no | no | 2.0 | yes | 0.0 | 2.0 | Sometimes | Public_Transportation | Insufficient_Weight |
523 | 21.0 | Female | 1.52 | 42.0 | Sometimes | yes | 3.0 | 1.0 | no | no | 1.0 | no | 0.0 | 0.0 | Frequently | Public_Transportation | Insufficient_Weight |
527 | 21.0 | Female | 1.52 | 42.0 | Sometimes | yes | 3.0 | 1.0 | no | no | 1.0 | no | 0.0 | 0.0 | Frequently | Public_Transportation | Insufficient_Weight |
659 | 21.0 | Female | 1.52 | 42.0 | Sometimes | yes | 3.0 | 1.0 | no | no | 1.0 | no | 0.0 | 0.0 | Frequently | Public_Transportation | Insufficient_Weight |
663 | 21.0 | Female | 1.52 | 42.0 | Sometimes | yes | 3.0 | 1.0 | no | no | 1.0 | no | 0.0 | 0.0 | Frequently | Public_Transportation | Insufficient_Weight |
763 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
764 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
824 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
830 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
831 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
832 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
833 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
834 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
921 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
922 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
923 | 21.0 | Male | 1.62 | 70.0 | Sometimes | yes | 2.0 | 1.0 | no | no | 3.0 | no | 1.0 | 0.0 | no | Public_Transportation | Overweight_Level_I |
- There is no way to find out if these are duplicated entries or not so we keep them
data = data.rename(columns={"CALC":"alcohol_drinking_frequency", "FAVC":"high_calorie_food_eat", "FCVC":"vegetable_eat_daily", "NCP":"number_of_meals_daily", "SCC":"calories_monitoring", "CH2O":"water_drinking_daily", "FAF":"physical_activity_daily", "TUE":"electronics_usage_daily", "CAEC":"food_between_meals", "MTRANS":"method_of_transportion"})
for col in ['Age', 'Weight', 'vegetable_eat_daily','number_of_meals_daily', 'water_drinking_daily','physical_activity_daily','electronics_usage_daily']: data[col] = data.loc[:,col].round().astype(int)
data.describe()
Age | Height | Weight | vegetable_eat_daily | number_of_meals_daily | water_drinking_daily | physical_activity_daily | electronics_usage_daily | |
---|---|---|---|---|---|---|---|---|
count | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 |
mean | 24.315964 | 1.701677 | 86.586452 | 2.423496 | 2.687826 | 2.014685 | 1.006632 | 0.664614 |
std | 6.357078 | 0.093305 | 26.190136 | 0.583905 | 0.809680 | 0.688616 | 0.895462 | 0.674009 |
min | 14.000000 | 1.450000 | 39.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 20.000000 | 1.630000 | 65.500000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 |
50% | 23.000000 | 1.700499 | 83.000000 | 2.000000 | 3.000000 | 2.000000 | 1.000000 | 1.000000 |
75% | 26.000000 | 1.768464 | 107.000000 | 3.000000 | 3.000000 | 2.000000 | 2.000000 | 1.000000 |
max | 61.000000 | 1.980000 | 173.000000 | 3.000000 | 4.000000 | 3.000000 | 3.000000 | 2.000000 |
Univariate analysis
plt.figure(figsize=(18,15)) for i,col in enumerate(data.select_dtypes(include="object").columns[:-1]): plt.subplot(4,2,i+1) sns.countplot(data=data,x=col,palette=sns.color_palette("Set2"))
data["NObeyesdad"].value_counts().sort_values(ascending=False).plot(kind="bar",color="red")
<Axes: xlabel='NObeyesdad'>
machine learning projects
machine learning projects with source code
machine learning projects github
machine learning projects for final year
machine learning projects for students
plt.figure(figsize=(18,15)) for i,col in enumerate(data.select_dtypes(include="number").columns[:3]): plt.subplot(4,2,i+1) sns.boxplot(data=data,x=col,palette=sns.color_palette("Set2"))
- We can see there is alot of outliers in age column we can reduce that
data=data[np.abs(stats.zscore(data["Age"])) < 2].reset_index(drop=True)
sns.boxplot(data=data,x="Age")
<Axes: xlabel='Age'>
data.shape
(1981, 17)
plt.figure(figsize=(18,15)) for i,col in enumerate(data.select_dtypes(include="number").columns[3:]): plt.subplot(4,2,i+1) sns.countplot(data=data,x=col)
ml projects ideas
project manager artificial intelligence
best ml courses reddit
Multivariate analysis
How is obesity type affected by eating high calorie food?
data.groupby(['NObeyesdad', 'high_calorie_food_eat'])["high_calorie_food_eat"].count()
NObeyesdad high_calorie_food_eat Insufficient_Weight no 51 yes 220 Normal_Weight no 75 yes 206 Obesity_Type_I no 9 yes 284 Obesity_Type_II no 6 yes 268 Obesity_Type_III no 1 yes 323 Overweight_Level_I no 20 yes 256 Overweight_Level_II no 71 yes 191 Name: high_calorie_food_eat, dtype: int64
plt.figure(figsize=(10,7)) sns.countplot(data=data,x=data.NObeyesdad,hue=data.high_calorie_food_eat,palette=sns.color_palette("Dark2")) plt.xticks(rotation=-20) plt.show()
- high calorie food seems to not affect obesity type that much but it does affect whether someone is above normal weight or not
- obesity type 3 however seems to have no one not eating high calorie food acording to this data and type 2 is very high
Average age of each obesity type
data.groupby("NObeyesdad")["Age"].median()
NObeyesdad Insufficient_Weight 19.0 Normal_Weight 21.0 Obesity_Type_I 23.0 Obesity_Type_II 27.0 Obesity_Type_III 25.0 Overweight_Level_I 21.0 Overweight_Level_II 23.0 Name: Age, dtype: float64
data.groupby("NObeyesdad")["Age"].median().sort_values(ascending=False).plot(kind="bar",color = sns.color_palette("Set2")) plt.title("Average age of each obesity type")
Text(0.5, 1.0, 'Average age of each obesity type')
- The avg age is the highest in obesity type 2 followed by 3 and 1
- The avg age is the lowest in insufficient weight
- so it seems that as age increases weight increases
Average weight of each obesity type
data.groupby("NObeyesdad")["Weight"].mean()
NObeyesdad Insufficient_Weight 49.926199 Normal_Weight 62.106762 Obesity_Type_I 94.819113 Obesity_Type_II 115.306569 Obesity_Type_III 120.972222 Overweight_Level_I 74.510870 Overweight_Level_II 82.045802 Name: Weight, dtype: float64
data.groupby("NObeyesdad")["Weight"].mean().sort_values(ascending=False).plot(kind="bar",color=sns.color_palette("Set2"))
<Axes: xlabel='NObeyesdad'>
- Here as expected obesity type 3 has the highest average weight followed by type 2 then type 1
ml projects
ml projects with source code
ml projects github
ml projects for final year
ml projects for students
Does gender affect obesity type?
plt.figure(figsize=(10,7)) sns.countplot(data=data,x="NObeyesdad",hue="Gender",palette=sns.color_palette("Dark2")) plt.xticks(rotation=-20) plt.show()
- Males are higher in almost all obesity types except obesity type 3
- Females are more likely to have insufficient weight
- Females are more likely to have severe obesity(type 3)
Does eating food between meals affect obesity type?
plt.figure(figsize=(10,7)) sns.countplot(data=data,x="NObeyesdad",hue="food_between_meals",palette=sns.color_palette("Dark2")) plt.xticks(rotation=-20) plt.show()
- Most people eat food in between meals sometimes
- People with insufficient weight and normal weight eat food betwen meals frequently the most
- it can be said that eating small meals in between meals decrease weight
Does family history with overweight affect obesity type?
plt.figure(figsize=(10,7)) sns.countplot(data=data,x="NObeyesdad",hue="family_history_with_overweight",palette=sns.color_palette("Dark2")) plt.xticks(rotation=-20) plt.show()
- Having family history with overweight seem to have an effect of increasing weight as obesity type 3,2,1 seem to all have family history with overweight
Does people who drink also smoke?
sns.countplot(data=data,x=data.alcohol_drinking_frequency,hue=data.SMOKE)
<Axes: xlabel='alcohol_drinking_frequency', ylabel='count'>
- No most of the people who drink alcohol don’t smoke
Data preprocessing and splitting data
from sklearn.preprocessing import LabelEncoder from sklearn.ensemble import RandomForestClassifier from xgboost import XGBClassifier from lightgbm import LGBMClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.preprocessing import MaxAbsScaler,RobustScaler from sklearn.metrics import classification_report,confusion_matrix,accuracy_score from sklearn.model_selection import train_test_split,cross_val_score
lgbm_settings = {'n_estimators': 137, 'num_leaves': 16, 'min_child_samples': 2, 'learning_rate': 0.11333885880532285, 'colsample_bytree': 0.7557376218643025, 'reg_alpha': 0.0013323317789643257, 'reg_lambda': 0.0018596588413880056, 'n_jobs': -1, 'max_bin': 511, 'verbose': -1}
data.select_dtypes(include="object").columns
Index(['Gender', 'alcohol_drinking_frequency', 'high_calorie_food_eat', 'calories_monitoring', 'SMOKE', 'family_history_with_overweight', 'food_between_meals', 'method_of_transportion', 'NObeyesdad'], dtype='object')
Enoding ordinal features using label enoder
encoder =LabelEncoder() model_data = data.copy() for col in ['alcohol_drinking_frequency','food_between_meals','NObeyesdad']: model_data[col] =encoder.fit_transform(model_data[col])
Encoding nominal data using pd dummies
cols = model_data.select_dtypes(include="object").columns dums = pd.get_dummies(model_data[cols],dtype=int) model_data = pd.concat([model_data,dums],axis=1).drop(columns=cols)
model_data.head()
Age | Height | Weight | alcohol_drinking_frequency | vegetable_eat_daily | number_of_meals_daily | water_drinking_daily | physical_activity_daily | electronics_usage_daily | food_between_meals | … | calories_monitoring_yes | SMOKE_no | SMOKE_yes | family_history_with_overweight_no | family_history_with_overweight_yes | method_of_transportion_Automobile | method_of_transportion_Bike | method_of_transportion_Motorbike | method_of_transportion_Public_Transportation | method_of_transportion_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21 | 1.62 | 64 | 3 | 2 | 3 | 2 | 0 | 1 | 2 | … | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1 | 21 | 1.52 | 56 | 2 | 3 | 3 | 3 | 3 | 0 | 2 | … | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 23 | 1.80 | 77 | 1 | 2 | 3 | 2 | 2 | 1 | 2 | … | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
3 | 27 | 1.80 | 87 | 1 | 3 | 3 | 2 | 2 | 0 | 2 | … | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 22 | 1.78 | 90 | 2 | 2 | 1 | 2 | 0 | 0 | 2 | … | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows Γ 26 columns
Correlation between data atributes
corr_data =data.copy() encoder =LabelEncoder() for col in corr_data.select_dtypes(include="object").columns: corr_data[col] =encoder.fit_transform(corr_data[col])
plt.figure(figsize=(16,13)) sns.heatmap(data=corr_data.corr(),annot=True)
<Axes: >
ml process
kaggle ml projects
ml project manager
ml project management
ml projects for masters students
Normalizing data using max absolute scaler
x= model_data.drop(columns="NObeyesdad") y=model_data["NObeyesdad"] scaler_mas = MaxAbsScaler() for col in x.columns: scaler_mas.fit(x[[col]]) x[col] = scaler_mas.transform (x[[col]])
x.head()
Age | Height | Weight | alcohol_drinking_frequency | vegetable_eat_daily | number_of_meals_daily | water_drinking_daily | physical_activity_daily | electronics_usage_daily | food_between_meals | … | calories_monitoring_yes | SMOKE_no | SMOKE_yes | family_history_with_overweight_no | family_history_with_overweight_yes | method_of_transportion_Automobile | method_of_transportion_Bike | method_of_transportion_Motorbike | method_of_transportion_Public_Transportation | method_of_transportion_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.567568 | 0.818182 | 0.369942 | 1.000000 | 0.666667 | 0.75 | 0.666667 | 0.000000 | 0.5 | 0.666667 | … | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 0.567568 | 0.767677 | 0.323699 | 0.666667 | 1.000000 | 0.75 | 1.000000 | 1.000000 | 0.0 | 0.666667 | … | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.621622 | 0.909091 | 0.445087 | 0.333333 | 0.666667 | 0.75 | 0.666667 | 0.666667 | 0.5 | 0.666667 | … | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 0.729730 | 0.909091 | 0.502890 | 0.333333 | 1.000000 | 0.75 | 0.666667 | 0.666667 | 0.0 | 0.666667 | … | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 0.594595 | 0.898990 | 0.520231 | 0.666667 | 0.666667 | 0.25 | 0.666667 | 0.000000 | 0.0 | 0.666667 | … | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows Γ 25 columns
Splitting data and training models
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size=0.2,random_state=7)
x.head()
Age | Height | Weight | alcohol_drinking_frequency | vegetable_eat_daily | number_of_meals_daily | water_drinking_daily | physical_activity_daily | electronics_usage_daily | food_between_meals | … | calories_monitoring_yes | SMOKE_no | SMOKE_yes | family_history_with_overweight_no | family_history_with_overweight_yes | method_of_transportion_Automobile | method_of_transportion_Bike | method_of_transportion_Motorbike | method_of_transportion_Public_Transportation | method_of_transportion_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.567568 | 0.818182 | 0.369942 | 1.000000 | 0.666667 | 0.75 | 0.666667 | 0.000000 | 0.5 | 0.666667 | … | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
1 | 0.567568 | 0.767677 | 0.323699 | 0.666667 | 1.000000 | 0.75 | 1.000000 | 1.000000 | 0.0 | 0.666667 | … | 1.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.621622 | 0.909091 | 0.445087 | 0.333333 | 0.666667 | 0.75 | 0.666667 | 0.666667 | 0.5 | 0.666667 | … | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
3 | 0.729730 | 0.909091 | 0.502890 | 0.333333 | 1.000000 | 0.75 | 0.666667 | 0.666667 | 0.0 | 0.666667 | … | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | 0.594595 | 0.898990 | 0.520231 | 0.666667 | 0.666667 | 0.25 | 0.666667 | 0.000000 | 0.0 | 0.666667 | … | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 |
5 rows Γ 25 columns
model_lgbm = LGBMClassifier(**lgbm_settings) model_xgb = XGBClassifier(objective="multi:softmax",num_class = 7) model_gb = GradientBoostingClassifier(max_depth=9,min_samples_leaf=3,min_samples_split=13,subsample=0.751) model_rfc = RandomForestClassifier() models = [model_lgbm,model_xgb,model_gb,model_rfc]
for model in models: model.fit(x_train,y_train)
ml process
kaggle machine learning projects
machine learning project manager
machine learning project management
machine learning projects for masters students
Evaluating models
for model in models: model_name = type(model).__name__ print(f"score for {model_name} on train data: {model.score(x_train,y_train)}")
score for LGBMClassifier on train data: 1.0 score for XGBClassifier on train data: 1.0 score for GradientBoostingClassifier on train data: 1.0 score for RandomForestClassifier on train data: 1.0
for model in models: model_name = type(model).__name__ print(f"score for {model_name} on test data: {model.score(x_test,y_test)}")
score for LGBMClassifier on test data: 0.9899244332493703 score for XGBClassifier on test data: 0.982367758186398 score for GradientBoostingClassifier on test data: 0.9773299748110831 score for RandomForestClassifier on test data: 0.947103274559194
print("scores of each model using kfold validation:-\n\n") for model in models: score = cross_val_score(model,x,y,cv=10) avg = np.mean(score) model_name = type(model).__name__ print(f"scores for {model_name}:{score}") print(f"average score for {model_name}:{avg}\n")
scores of each model using kfold validation:- scores for LGBMClassifier:[0.93969849 0.93434343 0.98989899 0.96969697 0.98484848 0.99494949 0.97474747 0.99494949 0.97474747 0.98484848] average score for LGBMClassifier:0.9742728795492613 scores for XGBClassifier:[0.91457286 0.93434343 0.97474747 0.97979798 0.97474747 0.98989899 0.97474747 0.98989899 0.97979798 0.97979798] average score for XGBClassifier:0.9692350642099384 scores for GradientBoostingClassifier:[0.90452261 0.94949495 0.97979798 0.97474747 0.97979798 0.99494949 0.98989899 0.98484848 0.97979798 0.97979798] average score for GradientBoostingClassifier:0.971765392619664 scores for RandomForestClassifier:[0.73869347 0.82323232 0.96464646 0.95454545 0.97979798 0.97979798 0.96969697 0.96969697 0.97979798 0.98484848] average score for RandomForestClassifier:0.9344754073397288
for model in models: y_predicted = model.predict(x_test) model_name = type(model).__name__ print(f"Report:{model_name}") print(classification_report(y_test,y_predicted))
step machine learning
step of machine learning
ml projects
ml project
machine learning python projects
machine learning projects in python
Report:LGBMClassifier precision recall f1-score support 0 1.00 1.00 1.00 46 1 0.96 1.00 0.98 65 2 1.00 0.99 0.99 67 3 1.00 1.00 1.00 53 4 1.00 1.00 1.00 63 5 0.98 0.95 0.97 63 6 1.00 1.00 1.00 40 accuracy 0.99 397 macro avg 0.99 0.99 0.99 397 weighted avg 0.99 0.99 0.99 397 Report:XGBClassifier precision recall f1-score support 0 0.98 1.00 0.99 46 1 0.95 0.95 0.95 65 2 1.00 0.99 0.99 67 3 1.00 1.00 1.00 53 4 1.00 1.00 1.00 63 5 0.98 0.95 0.97 63 6 0.95 1.00 0.98 40 accuracy 0.98 397 macro avg 0.98 0.98 0.98 397 weighted avg 0.98 0.98 0.98 397 Report:GradientBoostingClassifier precision recall f1-score support 0 0.96 0.98 0.97 46 1 0.94 0.95 0.95 65 2 0.99 0.99 0.99 67 3 1.00 0.98 0.99 53 4 1.00 1.00 1.00 63 5 0.98 0.95 0.97 63 6 0.98 1.00 0.99 40 accuracy 0.98 397 macro avg 0.98 0.98 0.98 397 weighted avg 0.98 0.98 0.98 397 Report:RandomForestClassifier precision recall f1-score support 0 0.96 1.00 0.98 46 1 0.86 0.91 0.88 65 2 0.97 0.97 0.97 67 3 1.00 1.00 1.00 53 4 1.00 1.00 1.00 63 5 0.92 0.87 0.89 63 6 0.95 0.88 0.91 40 accuracy 0.95 397 macro avg 0.95 0.95 0.95 397 weighted avg 0.95 0.95 0.95 397
for i,model in enumerate(models): plt.subplot(2,2,i+1) y_predicted = model.predict(x_test) model_name = type(model).__name__ cm = confusion_matrix(y_test, y_predicted) sns.heatmap(cm, annot=True,fmt='d') plt.xlabel('Predicted') plt.ylabel('Truth') plt.title(f"{model_name} confusion matrix") plt.show()
– light gradient boosting is the best model acheiving 99% accuracy and an average acc of 97% in kfold cross validation
– xg and gradient are almost the same in accuracy (98%) and kfold validation (97%)
– random forest acheived the worst results with 94% acc and an average acc of (92%) in kfold cross validation and it also seemed to overfit
step ml
step of ml
ml projects
ml project
ml python projects
ml projects in python
Obesity Levels-Multi Head Attention-Hyperparameter
import numpy as np import pandas as pd import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv
train_df=pd.read_csv("/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv")
train_df
Age | Gender | Height | Weight | CALC | FAVC | FCVC | NCP | SCC | SMOKE | CH2O | family_history_with_overweight | FAF | TUE | CAEC | MTRANS | NObeyesdad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.000000 | Female | 1.620000 | 64.000000 | no | no | 2.0 | 3.0 | no | no | 2.000000 | yes | 0.000000 | 1.000000 | Sometimes | Public_Transportation | Normal_Weight |
1 | 21.000000 | Female | 1.520000 | 56.000000 | Sometimes | no | 3.0 | 3.0 | yes | yes | 3.000000 | yes | 3.000000 | 0.000000 | Sometimes | Public_Transportation | Normal_Weight |
2 | 23.000000 | Male | 1.800000 | 77.000000 | Frequently | no | 2.0 | 3.0 | no | no | 2.000000 | yes | 2.000000 | 1.000000 | Sometimes | Public_Transportation | Normal_Weight |
3 | 27.000000 | Male | 1.800000 | 87.000000 | Frequently | no | 3.0 | 3.0 | no | no | 2.000000 | no | 2.000000 | 0.000000 | Sometimes | Walking | Overweight_Level_I |
4 | 22.000000 | Male | 1.780000 | 89.800000 | Sometimes | no | 2.0 | 1.0 | no | no | 2.000000 | no | 0.000000 | 0.000000 | Sometimes | Public_Transportation | Overweight_Level_II |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2106 | 20.976842 | Female | 1.710730 | 131.408528 | Sometimes | yes | 3.0 | 3.0 | no | no | 1.728139 | yes | 1.676269 | 0.906247 | Sometimes | Public_Transportation | Obesity_Type_III |
2107 | 21.982942 | Female | 1.748584 | 133.742943 | Sometimes | yes | 3.0 | 3.0 | no | no | 2.005130 | yes | 1.341390 | 0.599270 | Sometimes | Public_Transportation | Obesity_Type_III |
2108 | 22.524036 | Female | 1.752206 | 133.689352 | Sometimes | yes | 3.0 | 3.0 | no | no | 2.054193 | yes | 1.414209 | 0.646288 | Sometimes | Public_Transportation | Obesity_Type_III |
2109 | 24.361936 | Female | 1.739450 | 133.346641 | Sometimes | yes | 3.0 | 3.0 | no | no | 2.852339 | yes | 1.139107 | 0.586035 | Sometimes | Public_Transportation | Obesity_Type_III |
2110 | 23.664709 | Female | 1.738836 | 133.472641 | Sometimes | yes | 3.0 | 3.0 | no | no | 2.863513 | yes | 1.026452 | 0.714137 | Sometimes | Public_Transportation | Obesity_Type_III |
2111 rows Γ 17 columns
unique_values = {} mythreshold=7 for i, column in enumerate(train_df.columns, 1): unique_values[column] = train_df[column].unique() if train_df[column].nunique() <= mythreshold: print(f"{i}. Kolom:\" {column} \" , Kategorikal, Tipe Data:{train_df[column].dtype} , Unique:{unique_values[column]}, Jumlah Unique:{train_df[column].nunique()}") else: print(f"{i}. Kolom:\" {column} \" , Numerik, Tipe Data:{train_df[column].dtype} , Min:{train_df[column].min()}, Max:{train_df[column].max()}, Jumlah Unique:{train_df[column].nunique()}")
1. Kolom:" Age " , Numerik, Tipe Data:float64 , Min:14.0, Max:61.0, Jumlah Unique:1402 2. Kolom:" Gender " , Kategorikal, Tipe Data:object , Unique:['Female' 'Male'], Jumlah Unique:2 3. Kolom:" Height " , Numerik, Tipe Data:float64 , Min:1.45, Max:1.98, Jumlah Unique:1574 4. Kolom:" Weight " , Numerik, Tipe Data:float64 , Min:39.0, Max:173.0, Jumlah Unique:1525 5. Kolom:" CALC " , Kategorikal, Tipe Data:object , Unique:['no' 'Sometimes' 'Frequently' 'Always'], Jumlah Unique:4 6. Kolom:" FAVC " , Kategorikal, Tipe Data:object , Unique:['no' 'yes'], Jumlah Unique:2 7. Kolom:" FCVC " , Numerik, Tipe Data:float64 , Min:1.0, Max:3.0, Jumlah Unique:810 8. Kolom:" NCP " , Numerik, Tipe Data:float64 , Min:1.0, Max:4.0, Jumlah Unique:635 9. Kolom:" SCC " , Kategorikal, Tipe Data:object , Unique:['no' 'yes'], Jumlah Unique:2 10. Kolom:" SMOKE " , Kategorikal, Tipe Data:object , Unique:['no' 'yes'], Jumlah Unique:2 11. Kolom:" CH2O " , Numerik, Tipe Data:float64 , Min:1.0, Max:3.0, Jumlah Unique:1268 12. Kolom:" family_history_with_overweight " , Kategorikal, Tipe Data:object , Unique:['yes' 'no'], Jumlah Unique:2 13. Kolom:" FAF " , Numerik, Tipe Data:float64 , Min:0.0, Max:3.0, Jumlah Unique:1190 14. Kolom:" TUE " , Numerik, Tipe Data:float64 , Min:0.0, Max:2.0, Jumlah Unique:1129 15. Kolom:" CAEC " , Kategorikal, Tipe Data:object , Unique:['Sometimes' 'Frequently' 'Always' 'no'], Jumlah Unique:4 16. Kolom:" MTRANS " , Kategorikal, Tipe Data:object , Unique:['Public_Transportation' 'Walking' 'Automobile' 'Motorbike' 'Bike'], Jumlah Unique:5 17. Kolom:" NObeyesdad " , Kategorikal, Tipe Data:object , Unique:['Normal_Weight' 'Overweight_Level_I' 'Overweight_Level_II' 'Obesity_Type_I' 'Insufficient_Weight' 'Obesity_Type_II' 'Obesity_Type_III'], Jumlah Unique:7
numeric_features = [] categorical_features = [] for column in train_df.columns: if train_df[column].nunique() <= mythreshold: categorical_features.append(column) else: numeric_features.append(column) print("Fitur Numerik:", numeric_features) print("Fitur Kategorikal:", categorical_features)
Fitur Numerik: ['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE'] Fitur Kategorikal: ['Gender', 'CALC', 'FAVC', 'SCC', 'SMOKE', 'family_history_with_overweight', 'CAEC', 'MTRANS', 'NObeyesdad']
numerik_df=train_df.drop(categorical_features,axis=1)
numerik_df
Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | |
---|---|---|---|---|---|---|---|---|
0 | 21.000000 | 1.620000 | 64.000000 | 2.0 | 3.0 | 2.000000 | 0.000000 | 1.000000 |
1 | 21.000000 | 1.520000 | 56.000000 | 3.0 | 3.0 | 3.000000 | 3.000000 | 0.000000 |
2 | 23.000000 | 1.800000 | 77.000000 | 2.0 | 3.0 | 2.000000 | 2.000000 | 1.000000 |
3 | 27.000000 | 1.800000 | 87.000000 | 3.0 | 3.0 | 2.000000 | 2.000000 | 0.000000 |
4 | 22.000000 | 1.780000 | 89.800000 | 2.0 | 1.0 | 2.000000 | 0.000000 | 0.000000 |
… | … | … | … | … | … | … | … | … |
2106 | 20.976842 | 1.710730 | 131.408528 | 3.0 | 3.0 | 1.728139 | 1.676269 | 0.906247 |
2107 | 21.982942 | 1.748584 | 133.742943 | 3.0 | 3.0 | 2.005130 | 1.341390 | 0.599270 |
2108 | 22.524036 | 1.752206 | 133.689352 | 3.0 | 3.0 | 2.054193 | 1.414209 | 0.646288 |
2109 | 24.361936 | 1.739450 | 133.346641 | 3.0 | 3.0 | 2.852339 | 1.139107 | 0.586035 |
2110 | 23.664709 | 1.738836 | 133.472641 | 3.0 | 3.0 | 2.863513 | 1.026452 | 0.714137 |
2111 rows Γ 8 columns
numerik_df.columns
Index(['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE'], dtype='object')
import matplotlib.pyplot as plt num_cols = 2 num_rows = (len(numerik_df.columns) + num_cols - 1) // num_cols default_subplot_size = (8, 6) fig_width = default_subplot_size[0] * num_cols fig_height = default_subplot_size[1] * num_rows fig, axes = plt.subplots(num_rows, num_cols, figsize=(fig_width, fig_height)) for i, column in enumerate(numerik_df.columns): row_index = i // num_cols col_index = i % num_cols axes[row_index, col_index].hist(numerik_df[column], bins=10, color='skyblue', edgecolor='black') axes[row_index, col_index].set_title(f'Distribusi {column} sebelum discaler') axes[row_index, col_index].set_xlabel(column) axes[row_index, col_index].set_ylabel('Frequency') axes[row_index, col_index].grid(True) if len(numerik_df.columns) % num_cols != 0: fig.delaxes(axes[num_rows-1, num_cols-1]) plt.tight_layout() plt.show()
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() numerik_scaled = scaler.fit_transform(numerik_df)
numerik_scaled.shape
(2111, 8)
numerik_scaled
array([[-0.52212439, -0.87558934, -0.86255819, ..., -0.01307326, -1.18803911, 0.56199675], [-0.52212439, -1.94759928, -1.16807699, ..., 1.61875854, 2.33975012, -1.08062463], [-0.20688898, 1.05402854, -0.36609013, ..., -0.01307326, 1.16382038, 0.56199675], ..., [-0.28190933, 0.54167211, 1.79886776, ..., 0.0753606 , 0.47497132, -0.01901815], [ 0.00777624, 0.40492652, 1.78577968, ..., 1.37780063, 0.15147069, -0.11799101], [-0.10211908, 0.39834438, 1.7905916 , ..., 1.39603472, 0.01899633, 0.09243207]])
kolom_numerik=numerik_df.columns numerik_scaled_df=pd.DataFrame(data=numerik_scaled,columns=kolom_numerik)
numerik_scaled_df
github artificial intelligence-projects
machine learning project life cycle
machine learning project python
machine learning projects python
deep learning projects for masters students
Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | |
---|---|---|---|---|---|---|---|---|
0 | -0.522124 | -0.875589 | -0.862558 | -0.785019 | 0.404153 | -0.013073 | -1.188039 | 0.561997 |
1 | -0.522124 | -1.947599 | -1.168077 | 1.088342 | 0.404153 | 1.618759 | 2.339750 | -1.080625 |
2 | -0.206889 | 1.054029 | -0.366090 | -0.785019 | 0.404153 | -0.013073 | 1.163820 | 0.561997 |
3 | 0.423582 | 1.054029 | 0.015808 | 1.088342 | 0.404153 | -0.013073 | 1.163820 | -1.080625 |
4 | -0.364507 | 0.839627 | 0.122740 | -0.785019 | -2.167023 | -0.013073 | -1.188039 | -1.080625 |
… | … | … | … | … | … | … | … | … |
2106 | -0.525774 | 0.097045 | 1.711763 | 1.088342 | 0.404153 | -0.456705 | 0.783135 | 0.407996 |
2107 | -0.367195 | 0.502844 | 1.800914 | 1.088342 | 0.404153 | -0.004702 | 0.389341 | -0.096251 |
2108 | -0.281909 | 0.541672 | 1.798868 | 1.088342 | 0.404153 | 0.075361 | 0.474971 | -0.019018 |
2109 | 0.007776 | 0.404927 | 1.785780 | 1.088342 | 0.404153 | 1.377801 | 0.151471 | -0.117991 |
2110 | -0.102119 | 0.398344 | 1.790592 | 1.088342 | 0.404153 | 1.396035 | 0.018996 | 0.092432 |
2111 rows Γ 8 columns
numerik_scaled_df.columns
Index(['Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE'], dtype='object')
num_cols = 2 num_rows = (len(numerik_scaled_df.columns) + num_cols - 1) // num_cols default_subplot_size = (8, 6) fig_width = default_subplot_size[0] * num_cols fig_height = default_subplot_size[1] * num_rows fig, axes = plt.subplots(num_rows, num_cols, figsize=(fig_width, fig_height)) for i, column in enumerate(numerik_scaled_df.columns): row_index = i // num_cols col_index = i % num_cols axes[row_index, col_index].hist(numerik_scaled_df[column], bins=10, color='skyblue', edgecolor='black') axes[row_index, col_index].set_title(f'Distribusi {column} setelah discaler') axes[row_index, col_index].set_xlabel(column) axes[row_index, col_index].set_ylabel('Frequency') axes[row_index, col_index].grid(True) if len(numerik_scaled_df.columns) % num_cols != 0: fig.delaxes(axes[num_rows-1, num_cols-1]) plt.tight_layout() plt.show()
kategori_d=train_df.drop(numeric_features,axis=1)
kategori_d
Gender | CALC | FAVC | SCC | SMOKE | family_history_with_overweight | CAEC | MTRANS | NObeyesdad | |
---|---|---|---|---|---|---|---|---|---|
0 | Female | no | no | no | no | yes | Sometimes | Public_Transportation | Normal_Weight |
1 | Female | Sometimes | no | yes | yes | yes | Sometimes | Public_Transportation | Normal_Weight |
2 | Male | Frequently | no | no | no | yes | Sometimes | Public_Transportation | Normal_Weight |
3 | Male | Frequently | no | no | no | no | Sometimes | Walking | Overweight_Level_I |
4 | Male | Sometimes | no | no | no | no | Sometimes | Public_Transportation | Overweight_Level_II |
… | … | … | … | … | … | … | … | … | … |
2106 | Female | Sometimes | yes | no | no | yes | Sometimes | Public_Transportation | Obesity_Type_III |
2107 | Female | Sometimes | yes | no | no | yes | Sometimes | Public_Transportation | Obesity_Type_III |
2108 | Female | Sometimes | yes | no | no | yes | Sometimes | Public_Transportation | Obesity_Type_III |
2109 | Female | Sometimes | yes | no | no | yes | Sometimes | Public_Transportation | Obesity_Type_III |
2110 | Female | Sometimes | yes | no | no | yes | Sometimes | Public_Transportation | Obesity_Type_III |
2111 rows Γ 9 columns
kategori_df=kategori_d.drop(["NObeyesdad"],axis=1)
kategori_df
github artificial intelligence-projects
ml project life cycle
ml project python
ml projects python
deep learning projects for masters students
Gender | CALC | FAVC | SCC | SMOKE | family_history_with_overweight | CAEC | MTRANS | |
---|---|---|---|---|---|---|---|---|
0 | Female | no | no | no | no | yes | Sometimes | Public_Transportation |
1 | Female | Sometimes | no | yes | yes | yes | Sometimes | Public_Transportation |
2 | Male | Frequently | no | no | no | yes | Sometimes | Public_Transportation |
3 | Male | Frequently | no | no | no | no | Sometimes | Walking |
4 | Male | Sometimes | no | no | no | no | Sometimes | Public_Transportation |
… | … | … | … | … | … | … | … | … |
2106 | Female | Sometimes | yes | no | no | yes | Sometimes | Public_Transportation |
2107 | Female | Sometimes | yes | no | no | yes | Sometimes | Public_Transportation |
2108 | Female | Sometimes | yes | no | no | yes | Sometimes | Public_Transportation |
2109 | Female | Sometimes | yes | no | no | yes | Sometimes | Public_Transportation |
2110 | Female | Sometimes | yes | no | no | yes | Sometimes | Public_Transportation |
2111 rows Γ 8 columns
kategori_df.columns
Index(['Gender', 'CALC', 'FAVC', 'SCC', 'SMOKE', 'family_history_with_overweight', 'CAEC', 'MTRANS'], dtype='object')
kategori_encode_df = pd.get_dummies(kategori_df, columns=kategori_df.columns)
kategori_encode_df
Gender_Female | Gender_Male | CALC_Always | CALC_Frequently | CALC_Sometimes | CALC_no | FAVC_no | FAVC_yes | SCC_no | SCC_yes | … | family_history_with_overweight_yes | CAEC_Always | CAEC_Frequently | CAEC_Sometimes | CAEC_no | MTRANS_Automobile | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | False | False | False | False | True | True | False | True | False | … | True | False | False | True | False | False | False | False | True | False |
1 | True | False | False | False | True | False | True | False | False | True | … | True | False | False | True | False | False | False | False | True | False |
2 | False | True | False | True | False | False | True | False | True | False | … | True | False | False | True | False | False | False | False | True | False |
3 | False | True | False | True | False | False | True | False | True | False | … | False | False | False | True | False | False | False | False | False | True |
4 | False | True | False | False | True | False | True | False | True | False | … | False | False | False | True | False | False | False | False | True | False |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2106 | True | False | False | False | True | False | False | True | True | False | … | True | False | False | True | False | False | False | False | True | False |
2107 | True | False | False | False | True | False | False | True | True | False | … | True | False | False | True | False | False | False | False | True | False |
2108 | True | False | False | False | True | False | False | True | True | False | … | True | False | False | True | False | False | False | False | True | False |
2109 | True | False | False | False | True | False | False | True | True | False | … | True | False | False | True | False | False | False | False | True | False |
2110 | True | False | False | False | True | False | False | True | True | False | … | True | False | False | True | False | False | False | False | True | False |
2111 rows Γ 23 columns
kategori_encode_df.columns
Index(['Gender_Female', 'Gender_Male', 'CALC_Always', 'CALC_Frequently', 'CALC_Sometimes', 'CALC_no', 'FAVC_no', 'FAVC_yes', 'SCC_no', 'SCC_yes', 'SMOKE_no', 'SMOKE_yes', 'family_history_with_overweight_no', 'family_history_with_overweight_yes', 'CAEC_Always', 'CAEC_Frequently', 'CAEC_Sometimes', 'CAEC_no', 'MTRANS_Automobile', 'MTRANS_Bike', 'MTRANS_Motorbike', 'MTRANS_Public_Transportation', 'MTRANS_Walking'], dtype='object')
kolom_numerik = list(numerik_df.columns) kolom_kategorikal = list(kategori_df.columns) train_df_filtered = train_df.drop(kolom_numerik + kolom_kategorikal, axis=1) data_preprocessed_df = pd.concat([train_df_filtered, numerik_scaled_df, kategori_encode_df ], axis=1)
data_preprocessed_df
NObeyesdad | Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | Gender_Female | … | family_history_with_overweight_yes | CAEC_Always | CAEC_Frequently | CAEC_Sometimes | CAEC_no | MTRANS_Automobile | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Normal_Weight | -0.522124 | -0.875589 | -0.862558 | -0.785019 | 0.404153 | -0.013073 | -1.188039 | 0.561997 | True | … | True | False | False | True | False | False | False | False | True | False |
1 | Normal_Weight | -0.522124 | -1.947599 | -1.168077 | 1.088342 | 0.404153 | 1.618759 | 2.339750 | -1.080625 | True | … | True | False | False | True | False | False | False | False | True | False |
2 | Normal_Weight | -0.206889 | 1.054029 | -0.366090 | -0.785019 | 0.404153 | -0.013073 | 1.163820 | 0.561997 | False | … | True | False | False | True | False | False | False | False | True | False |
3 | Overweight_Level_I | 0.423582 | 1.054029 | 0.015808 | 1.088342 | 0.404153 | -0.013073 | 1.163820 | -1.080625 | False | … | False | False | False | True | False | False | False | False | False | True |
4 | Overweight_Level_II | -0.364507 | 0.839627 | 0.122740 | -0.785019 | -2.167023 | -0.013073 | -1.188039 | -1.080625 | False | … | False | False | False | True | False | False | False | False | True | False |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2106 | Obesity_Type_III | -0.525774 | 0.097045 | 1.711763 | 1.088342 | 0.404153 | -0.456705 | 0.783135 | 0.407996 | True | … | True | False | False | True | False | False | False | False | True | False |
2107 | Obesity_Type_III | -0.367195 | 0.502844 | 1.800914 | 1.088342 | 0.404153 | -0.004702 | 0.389341 | -0.096251 | True | … | True | False | False | True | False | False | False | False | True | False |
2108 | Obesity_Type_III | -0.281909 | 0.541672 | 1.798868 | 1.088342 | 0.404153 | 0.075361 | 0.474971 | -0.019018 | True | … | True | False | False | True | False | False | False | False | True | False |
2109 | Obesity_Type_III | 0.007776 | 0.404927 | 1.785780 | 1.088342 | 0.404153 | 1.377801 | 0.151471 | -0.117991 | True | … | True | False | False | True | False | False | False | False | True | False |
2110 | Obesity_Type_III | -0.102119 | 0.398344 | 1.790592 | 1.088342 | 0.404153 | 1.396035 | 0.018996 | 0.092432 | True | … | True | False | False | True | False | False | False | False | True | False |
2111 rows Γ 32 columns
data_preprocessed_df.columns
Index(['NObeyesdad', 'Age', 'Height', 'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE', 'Gender_Female', 'Gender_Male', 'CALC_Always', 'CALC_Frequently', 'CALC_Sometimes', 'CALC_no', 'FAVC_no', 'FAVC_yes', 'SCC_no', 'SCC_yes', 'SMOKE_no', 'SMOKE_yes', 'family_history_with_overweight_no', 'family_history_with_overweight_yes', 'CAEC_Always', 'CAEC_Frequently', 'CAEC_Sometimes', 'CAEC_no', 'MTRANS_Automobile', 'MTRANS_Bike', 'MTRANS_Motorbike', 'MTRANS_Public_Transportation', 'MTRANS_Walking'], dtype='object')
X_df = data_preprocessed_df.drop(columns=['NObeyesdad'],axis=1) y_df = data_preprocessed_df['NObeyesdad']
X_df
Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | Gender_Female | Gender_Male | … | family_history_with_overweight_yes | CAEC_Always | CAEC_Frequently | CAEC_Sometimes | CAEC_no | MTRANS_Automobile | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.522124 | -0.875589 | -0.862558 | -0.785019 | 0.404153 | -0.013073 | -1.188039 | 0.561997 | True | False | … | True | False | False | True | False | False | False | False | True | False |
1 | -0.522124 | -1.947599 | -1.168077 | 1.088342 | 0.404153 | 1.618759 | 2.339750 | -1.080625 | True | False | … | True | False | False | True | False | False | False | False | True | False |
2 | -0.206889 | 1.054029 | -0.366090 | -0.785019 | 0.404153 | -0.013073 | 1.163820 | 0.561997 | False | True | … | True | False | False | True | False | False | False | False | True | False |
3 | 0.423582 | 1.054029 | 0.015808 | 1.088342 | 0.404153 | -0.013073 | 1.163820 | -1.080625 | False | True | … | False | False | False | True | False | False | False | False | False | True |
4 | -0.364507 | 0.839627 | 0.122740 | -0.785019 | -2.167023 | -0.013073 | -1.188039 | -1.080625 | False | True | … | False | False | False | True | False | False | False | False | True | False |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
2106 | -0.525774 | 0.097045 | 1.711763 | 1.088342 | 0.404153 | -0.456705 | 0.783135 | 0.407996 | True | False | … | True | False | False | True | False | False | False | False | True | False |
2107 | -0.367195 | 0.502844 | 1.800914 | 1.088342 | 0.404153 | -0.004702 | 0.389341 | -0.096251 | True | False | … | True | False | False | True | False | False | False | False | True | False |
2108 | -0.281909 | 0.541672 | 1.798868 | 1.088342 | 0.404153 | 0.075361 | 0.474971 | -0.019018 | True | False | … | True | False | False | True | False | False | False | False | True | False |
2109 | 0.007776 | 0.404927 | 1.785780 | 1.088342 | 0.404153 | 1.377801 | 0.151471 | -0.117991 | True | False | … | True | False | False | True | False | False | False | False | True | False |
2110 | -0.102119 | 0.398344 | 1.790592 | 1.088342 | 0.404153 | 1.396035 | 0.018996 | 0.092432 | True | False | … | True | False | False | True | False | False | False | False | True | False |
2111 rows Γ 31 columns
y_df
0 Normal_Weight 1 Normal_Weight 2 Normal_Weight 3 Overweight_Level_I 4 Overweight_Level_II ... 2106 Obesity_Type_III 2107 Obesity_Type_III 2108 Obesity_Type_III 2109 Obesity_Type_III 2110 Obesity_Type_III Name: NObeyesdad, Length: 2111, dtype: object
y_df.unique()
github artificial intelligence-projects
ml project life cycle
ml project python
ml projects python
deep learning projects for masters students
array(['Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Insufficient_Weight', 'Obesity_Type_II', 'Obesity_Type_III'], dtype=object)
jumlah_kelas=y_df.nunique()
jumlah_kelas
7
from sklearn.preprocessing import LabelEncoder label_encoder = LabelEncoder() y=label_encoder.fit_transform(y_df)
y.shape
(2111,)
np.unique(y)
array([0, 1, 2, 3, 4, 5, 6])
from sklearn.model_selection import train_test_split X_train,X_val,y_train,y_val=train_test_split(X_df ,y, test_size=0.3,random_state=42,stratify=y)
X_train.shape
(1477, 31)
X_train
Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | Gender_Female | Gender_Male | … | family_history_with_overweight_yes | CAEC_Always | CAEC_Frequently | CAEC_Sometimes | CAEC_no | MTRANS_Automobile | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
90 | 0.108346 | -0.768388 | 0.244947 | 1.088342 | 1.689740 | -1.644905 | 1.163820 | -1.080625 | True | False | … | False | True | False | False | False | False | False | False | True | False |
513 | -0.483801 | -1.111228 | -1.594060 | 1.088342 | -1.233352 | 0.711664 | 0.362036 | -1.080625 | True | False | … | False | False | True | False | False | False | False | False | True | False |
1100 | -0.813763 | -0.019932 | -0.327900 | -2.477684 | -0.249806 | 0.721364 | -1.188039 | 0.112112 | False | True | … | True | False | False | True | False | False | False | False | True | False |
339 | -0.837360 | -1.840398 | -1.702735 | -0.785019 | 0.404153 | -1.644905 | 1.163820 | -1.080625 | True | False | … | False | False | False | True | False | False | False | False | True | False |
612 | -0.203982 | -1.253098 | -1.611971 | -0.401141 | -0.717141 | 0.183223 | -0.017125 | -1.080625 | True | False | … | False | False | True | False | False | False | False | False | True | False |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
1567 | 0.978531 | 0.883032 | 1.300675 | 0.149990 | 0.404153 | 0.053754 | -0.201741 | -0.275980 | False | True | … | True | False | False | True | False | True | False | False | False | False |
1336 | -0.520371 | 1.657731 | 1.206713 | -0.785019 | 0.404153 | 1.618759 | -0.260719 | 0.923421 | False | True | … | True | False | False | True | False | False | False | False | True | False |
609 | -0.682924 | 0.554043 | -1.206367 | -0.785019 | 1.040325 | 1.580691 | 1.103930 | 2.204618 | False | True | … | True | False | False | True | False | False | False | False | True | False |
1659 | -0.184602 | 1.582604 | 1.339420 | 1.088342 | -0.225612 | -0.513455 | -0.282915 | -1.080625 | False | True | … | True | False | False | True | False | False | False | False | True | False |
237 | -0.837360 | -0.661187 | -1.282647 | 1.088342 | 0.404153 | -1.644905 | -0.012109 | 0.561997 | True | False | … | True | False | False | True | False | False | False | False | True | False |
1477 rows Γ 31 columns
X_train.shape[0]
1477
X_train.shape[1]
31
X_val.shape
(634, 31)
X_val
Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | Gender_Female | Gender_Male | … | family_history_with_overweight_yes | CAEC_Always | CAEC_Frequently | CAEC_Sometimes | CAEC_no | MTRANS_Automobile | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
332 | 0.423582 | 1.590034 | -0.442470 | -0.785019 | -2.167023 | -0.013073 | -0.012109 | -1.080625 | False | True | … | True | False | False | True | False | False | False | False | False | True |
1235 | -0.149256 | 0.600472 | 0.335144 | -0.785019 | 0.404153 | 1.618759 | 2.339750 | 2.204618 | False | True | … | True | False | False | True | False | False | False | False | True | False |
16 | 0.423582 | 2.447641 | 0.588656 | -0.785019 | -2.167023 | -1.644905 | -0.012109 | -1.080625 | False | True | … | True | False | False | True | False | False | False | False | True | False |
1214 | 0.219669 | -1.244929 | -0.238106 | -0.785019 | -2.167023 | -0.013073 | -1.188039 | -1.080625 | True | False | … | True | False | False | True | False | False | False | False | True | False |
521 | -0.837360 | -1.473782 | -1.699066 | 1.088342 | -1.017214 | 0.731990 | 0.689422 | 0.557726 | True | False | … | False | False | False | True | False | False | False | False | True | False |
… | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
445 | -0.837360 | -2.054800 | -1.588165 | -0.785019 | 1.689740 | -1.644905 | 2.339750 | -1.080625 | True | False | … | False | False | False | True | False | False | False | False | True | False |
1576 | 0.416906 | 0.847463 | 1.007138 | -0.291866 | 0.404153 | 0.137886 | -1.188039 | 1.936629 | False | True | … | True | False | False | True | False | True | False | False | False | False |
1219 | -0.036737 | 0.340488 | 0.432531 | -0.785019 | 0.404153 | 1.363662 | 0.351610 | 1.118279 | False | True | … | True | False | False | True | False | False | False | False | True | False |
1771 | 0.213427 | 1.038806 | 1.197146 | -0.715626 | 0.404153 | 0.700021 | 0.007004 | -1.047700 | False | True | … | True | False | False | True | False | False | False | False | True | False |
50 | -0.522124 | -0.982790 | -1.225362 | 1.088342 | 0.404153 | 1.618759 | -1.188039 | 0.561997 | True | False | … | True | False | False | True | False | False | False | False | False | True |
ml projects
ml projects with source code
ml projects github
ml projects for final year
ml projects for students
634 rows Γ 31 columns
y_train.shape
(1477,)
y_train
array([3, 0, 6, ..., 0, 3, 1])
y_val.shape
(634,)
y_val
Model
import tensorflow as tf from tensorflow.keras.layers import Input, Dense, GlobalMaxPooling1D, Dropout, BatchNormalization from tensorflow.keras.models import Model from tensorflow.keras import regularizers from sklearn.model_selection import RandomizedSearchCV from keras.optimizers import Adam class MultiHeadAttention(tf.keras.layers.Layer): def __init__(self, d_model, num_heads): super(MultiHeadAttention, self).__init__() self.num_heads = num_heads self.d_model = d_model assert d_model % self.num_heads == 0, "d_model harus habis dibagi dengan num_heads" self.depth = d_model // self.num_heads self.wq = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01)) self.wk = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01)) self.wv = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01)) self.dense = tf.keras.layers.Dense(d_model, kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01)) def split_heads(self, x, batch_size): x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth)) return tf.transpose(x, perm=[0, 2, 1, 3]) def call(self, v, k, q, mask): batch_size = tf.shape(q)[0] q = self.wq(q) k = self.wk(k) v = self.wv(v) q = self.split_heads(q, batch_size) k = self.split_heads(k, batch_size) v = self.split_heads(v, batch_size) scaled_attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask) scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3]) concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model)) output = self.dense(concat_attention) return output, attention_weights def scaled_dot_product_attention(self, q, k, v, mask): matmul_qk = tf.matmul(q, k, transpose_b=True) dk = tf.cast(tf.shape(k)[-1], tf.float32) scaled_attention_logits = matmul_qk / tf.math.sqrt(dk) if mask is not None: scaled_attention_logits += (mask * -1e9) attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) output = tf.matmul(attention_weights, v) return output, attention_weights def create_model(units=128, dropout_rate=0.2, d_model=128, num_heads=8): input_layer = Input(shape=(X_train.shape[1],)) dense_layer = Dense(units, activation='relu', kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))(input_layer) dropout_layer = Dropout(dropout_rate)(dense_layer) mha_layer = MultiHeadAttention(d_model=d_model, num_heads=num_heads) mha_output, _ = mha_layer(dropout_layer, dropout_layer, dropout_layer, mask=None) pooling_layer = GlobalMaxPooling1D()(mha_output) batchnorm_layer = BatchNormalization()(pooling_layer) dense_layer2 = Dense(jumlah_kelas * 2, activation='relu', kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))(batchnorm_layer) dropout_layer2 = Dropout(0.1)(dense_layer2) output_layer = Dense(jumlah_kelas, activation='softmax', kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.01))(dropout_layer2) model = Model(inputs=input_layer, outputs=output_layer) optimizer = Adam(learning_rate=0.001) model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy']) return model
from scipy.stats import randint param_dist = { 'units': randint(64, 512), 'dropout_rate': [0.1, 0.2, 0.3, 0.4], 'd_model': [32, 64, 128, 256], 'num_heads': [8, 16, 32], 'batch_size': [64, 128, 256] }
from sklearn.metrics import accuracy_score class CustomEstimator: def __init__(self, units=128, dropout_rate=0.2, d_model=128, num_heads=8, batch_size=32): self.units = units self.dropout_rate = dropout_rate self.d_model = d_model self.num_heads = num_heads self.batch_size = batch_size self.model = None def fit(self, X, y, **kwargs): if self.model is None: self.model = self._init_model() if self.d_model % self.num_heads != 0: print("Peringatan: d_model tidak dapat dibagi habis dengan num_heads. Melewati inisialisasi model.") return self.model.fit(X, y, batch_size=self.batch_size, **kwargs) def predict(self, X): if self.model is not None: y_pred_prob = self.model.predict(X) y_pred = np.argmax(y_pred_prob, axis=1) return y_pred else: print("Tidak ada model untuk melakukan prediksi. Melanjutkan ke langkah berikutnya dari algoritma.") return None def score(self, X, y): if self.model is not None: y_pred = self.predict(X) return accuracy_score(y, y_pred) else: print("Tidak ada model untuk melakukan perhitungan skor. Melanjutkan ke langkah berikutnya dari algoritma.") return 0.0 def _init_model(self): if self.d_model % self.num_heads != 0: print("Peringatan: d_model tidak dapat dibagi habis dengan num_heads. Melewati inisialisasi model.") return None return create_model(units=self.units, dropout_rate=self.dropout_rate, d_model=self.d_model, num_heads=self.num_heads) def get_params(self, deep=True): return { 'units': self.units, 'dropout_rate': self.dropout_rate, 'd_model': self.d_model, 'num_heads': self.num_heads, 'batch_size': self.batch_size } def set_params(self, **params): for param, value in params.items(): setattr(self, param, value) return self
custom_estimator = CustomEstimator() random_search = RandomizedSearchCV(estimator=custom_estimator, param_distributions=param_dist, n_iter=10, cv=3) random_search.fit(X_train, y_train, validation_data=(X_val, y_val))
best_model = random_search.best_estimator_
best_model_params = random_search.best_estimator_.get_params() print(best_model_params)
{'units': 407, 'dropout_rate': 0.1, 'd_model': 32, 'num_heads': 32, 'batch_size': 64}
from keras.callbacks import EarlyStopping,ReduceLROnPlateau early_stopping = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True) reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=5, min_lr=0.0001)
history = best_model.model.fit(X_train, y_train, epochs=1000, validation_data=(X_val, y_val), callbacks=[early_stopping,reduce_lr])
loss, accuracy = best_model.model.evaluate(X_val, y_val, verbose=0) print(f'Loss: {loss:.2f}') print(f'Accuracy: {accuracy * 100:.2f}%')
Loss: 0.56 Accuracy: 93.22%
loss = history.history['loss'] val_loss = history.history['val_loss'] plt.plot(loss, label='Training Loss') plt.plot(val_loss, label='Validation Loss') plt.title('Training and Validation Loss') plt.xlabel('Epochs') plt.ylabel('Loss') plt.legend() plt.show()
accuracy = history.history['accuracy'] val_accuracy = history.history['val_accuracy'] plt.plot(accuracy, label='Training Accuracy') plt.plot(val_accuracy, label='Validation Accuracy') plt.title('Training and Validation Accuracy') plt.xlabel('Epochs') plt.ylabel('Accuracy') plt.legend() plt.show()
ml projects reddit
reddit ai subreddit
ml interesting projects
good ml projects
best_model.model.summary()
Model: "functional_61"
βββββββββββββββββββββββ³ββββββββββββββββββββ³βββββββββββββ³ββββββββββββββββββββ β Layer (type) β Output Shape β Param # β Connected to β β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ© β input_layer_30 β (None, 31) β 0 β - β β (InputLayer) β β β β βββββββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββΌββββββββββββββββββββ€ β dense_210 (Dense) β (None, 407) β 13,024 β input_layer_30[0β¦ β βββββββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββΌββββββββββββββββββββ€ β dropout_60 β (None, 407) β 0 β dense_210[0][0] β β (Dropout) β β β β βββββββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββΌββββββββββββββββββββ€ β multi_head_attentiβ¦ β [(None, None, β 40,224 β dropout_60[0][0], β β (MultiHeadAttentioβ¦ β 32), (None, 32, β β dropout_60[0][0], β β β None, None)] β β dropout_60[0][0] β βββββββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββΌββββββββββββββββββββ€ β global_max_poolingβ¦ β (None, 32) β 0 β multi_head_attenβ¦ β β (GlobalMaxPooling1β¦ β β β β βββββββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββΌββββββββββββββββββββ€ β batch_normalizatioβ¦ β (None, 32) β 128 β global_max_pooliβ¦ β β (BatchNormalizatioβ¦ β β β β βββββββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββΌββββββββββββββββββββ€ β dense_215 (Dense) β (None, 14) β 462 β batch_normalizatβ¦ β βββββββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββΌββββββββββββββββββββ€ β dropout_61 β (None, 14) β 0 β dense_215[0][0] β β (Dropout) β β β β βββββββββββββββββββββββΌββββββββββββββββββββΌβββββββββββββΌββββββββββββββββββββ€ β dense_216 (Dense) β (None, 7) β 105 β dropout_61[0][0] β βββββββββββββββββββββββ΄ββββββββββββββββββββ΄βββββββββββββ΄ββββββββββββββββββββ
Total params: 161,703 (631.66 KB)
Trainable params: 53,879 (210.46 KB)
Non-trainable params: 64 (256.00 B)
Optimizer params: 107,760 (420.94 KB)
deep learning projects github
deep learning project github
github artificial intelligence projects
from keras.utils import plot_model file_name = 'arsitektur_model.png' plot_model(best_model.model, to_file=file_name, show_shapes=True, show_layer_names=True) plt.figure(figsize=(15,15)) img = plt.imread(file_name) plt.imshow(img) plt.title('Arsitektur Model', fontsize=18) plt.axis('off') plt.savefig(file_name) plt.show()
label_encoder.classes_
array(['Insufficient_Weight', 'Normal_Weight', 'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III', 'Overweight_Level_I', 'Overweight_Level_II'], dtype=object)
from sklearn.metrics import classification_report y_pred_prob = best_model.model.predict(X_val) y_pred = np.argmax(y_pred_prob, axis=1) target_names =[str(cls) for cls in label_encoder.classes_] report = classification_report(y_val, y_pred,target_names=target_names,zero_division=1) print("Classification Report:\n", report)
20/20 ββββββββββββββββββββ 0s 10ms/step Classification Report: precision recall f1-score support Insufficient_Weight 0.96 0.98 0.97 82 Normal_Weight 0.82 0.90 0.86 86 Obesity_Type_I 0.95 0.96 0.96 106 Obesity_Type_II 0.99 0.99 0.99 89 Obesity_Type_III 1.00 0.99 0.99 97 Overweight_Level_I 0.87 0.83 0.85 87 Overweight_Level_II 0.93 0.87 0.90 87 accuracy 0.93 634 macro avg 0.93 0.93 0.93 634 weighted avg 0.93 0.93 0.93 634
ml projects github
ml project github
github artificial intelligence projects
ml projects
from sklearn.metrics import confusion_matrix import seaborn as sns conf_matrix = confusion_matrix(y_val, y_pred) plt.figure(figsize=(8, 6)) sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names) plt.xlabel('Predicted') plt.ylabel('Actual') plt.title('Confusion Matrix') plt.show()
train_df.to_csv("train_data.csv", index=False)
Obesity Levels Analysis with ML
# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np
df = pd.read_csv('/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv')
df.head()
Age | Gender | Height | Weight | CALC | FAVC | FCVC | NCP | SCC | SMOKE | CH2O | family_history_with_overweight | FAF | TUE | CAEC | MTRANS | NObeyesdad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21.0 | Female | 1.62 | 64.0 | no | no | 2.0 | 3.0 | no | no | 2.0 | yes | 0.0 | 1.0 | Sometimes | Public_Transportation | Normal_Weight |
1 | 21.0 | Female | 1.52 | 56.0 | Sometimes | no | 3.0 | 3.0 | yes | yes | 3.0 | yes | 3.0 | 0.0 | Sometimes | Public_Transportation | Normal_Weight |
2 | 23.0 | Male | 1.80 | 77.0 | Frequently | no | 2.0 | 3.0 | no | no | 2.0 | yes | 2.0 | 1.0 | Sometimes | Public_Transportation | Normal_Weight |
3 | 27.0 | Male | 1.80 | 87.0 | Frequently | no | 3.0 | 3.0 | no | no | 2.0 | no | 2.0 | 0.0 | Sometimes | Walking | Overweight_Level_I |
4 | 22.0 | Male | 1.78 | 89.8 | Sometimes | no | 2.0 | 1.0 | no | no | 2.0 | no | 0.0 | 0.0 | Sometimes | Public_Transportation | Overweight_Level_II |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2111 entries, 0 to 2110 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 2111 non-null float64 1 Gender 2111 non-null object 2 Height 2111 non-null float64 3 Weight 2111 non-null float64 4 CALC 2111 non-null object 5 FAVC 2111 non-null object 6 FCVC 2111 non-null float64 7 NCP 2111 non-null float64 8 SCC 2111 non-null object 9 SMOKE 2111 non-null object 10 CH2O 2111 non-null float64 11 family_history_with_overweight 2111 non-null object 12 FAF 2111 non-null float64 13 TUE 2111 non-null float64 14 CAEC 2111 non-null object 15 MTRANS 2111 non-null object 16 NObeyesdad 2111 non-null object dtypes: float64(8), object(9) memory usage: 280.5+ KB
df.isnull().sum()
Age 0 Gender 0 Height 0 Weight 0 CALC 0 FAVC 0 FCVC 0 NCP 0 SCC 0 SMOKE 0 CH2O 0 family_history_with_overweight 0 FAF 0 TUE 0 CAEC 0 MTRANS 0 NObeyesdad 0 dtype: int64
#df = df.drop_duplicates()
# Define the order of categories and corresponding colors order_colors = {"Male": "blue", "Female": "pink"} plt.figure(figsize=(6, 6)) sns.countplot(x="Gender", data=df, order=order_colors.keys(), palette=order_colors.values()) plt.title("Gender Distribution", fontsize=14, fontweight="bold") plt.xticks(rotation=45) # Annotate each bar with its count for i, count in enumerate(df["Gender"].value_counts()): plt.text(i, count, str(count), ha='center', va='bottom') plt.tight_layout() plt.show()
# Group the data by gender grouped = df.groupby('Gender') # Create a figure with multiple subplots fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(16, 12)) fig.suptitle('Measures by Gender', fontsize=16) # Visualize CALC calc_counts = grouped['CALC'].value_counts().unstack() calc_counts.plot(kind='bar', ax=axes[0, 0]) # Set title, labels, and annotations axes[0, 0].set_title('How often do you drink alcohol?') axes[0, 0].set_xlabel('CALC Values') axes[0, 0].set_ylabel('Count') for p in axes[0, 0].patches: axes[0, 0].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height() + 0.5), ha='center', va='bottom') # Visualize FAVC favc_counts = grouped['FAVC'].value_counts().unstack() favc_counts.plot(kind='bar', ax=axes[0, 1]) # Set title, labels, and annotations axes[0, 1].set_title('Do you eat high caloric food frequently?') axes[0, 1].set_xlabel('FAVC Values') axes[0, 1].set_ylabel('Count') for p in axes[0, 1].patches: axes[0, 1].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height() + 0.5), ha='center', va='bottom') # Visualize FCVC fcvc_means = grouped['FCVC'].mean().reset_index() fcvc_means.columns = ['Gender', 'FCVC Mean'] fcvc_means.set_index('Gender', inplace=True) fcvc_means.plot(kind='bar', ax=axes[1, 0]) # Set title, labels, and annotations axes[1, 0].set_title('Do you usually eat vegetables in your meals?') axes[1, 0].set_xlabel('Gender') axes[1, 0].set_ylabel('FCVC Mean') for p in axes[1, 0].patches: bar_width = p.get_width() bar_height = p.get_height() bar_x = p.get_x() bar_middle = bar_x + bar_width / 2 axes[1, 0].annotate(str(round(bar_height, 2)), (bar_middle, bar_height), ha='center', va='bottom') # Visualize NCP ncp_means = grouped['NCP'].mean().reset_index() ncp_means.columns = ['Gender', 'NCP Mean'] ncp_means.set_index('Gender', inplace=True) ncp_means.plot(kind='bar', ax=axes[1, 1]) # Set title, labels, and annotations axes[1, 1].set_title('How many main meals do you have daily?') axes[1, 1].set_xlabel('Gender') axes[1, 1].set_ylabel('NCP Mean') for p in axes[1, 1].patches: bar_width = p.get_width() bar_height = p.get_height() bar_x = p.get_x() bar_middle = bar_x + bar_width / 2 axes[1, 1].annotate(str(round(bar_height, 2)), (bar_middle, bar_height), ha='center', va='bottom') # Visualize SCC scc_counts = grouped['SCC'].value_counts().unstack() scc_counts.plot(kind='bar', ax=axes[2, 0]) # Set title, labels, and annotations axes[2, 0].set_title('Do you monitor the calories you eat daily? ') axes[2, 0].set_xlabel('SCC Values') axes[2, 0].set_ylabel('Count') for p in axes[2, 0].patches: axes[2, 0].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom') # Visualize SMOKE smoke_counts = grouped['SMOKE'].value_counts().unstack() smoke_counts.plot(kind='bar', ax=axes[2, 1]) # Set title, labels, and annotations axes[2, 1].set_title('Do you smoke?') axes[2, 1].set_xlabel('SMOKE Values') axes[2, 1].set_ylabel('Count') for p in axes[2, 1].patches: axes[2, 1].annotate(str(p.get_height()), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom') # Adjust spacing between subplots plt.subplots_adjust(hspace=0.5, wspace=0.5) # Display the plot plt.show()
ml projects for resume
ml project for resume
best ml projects
cool ml projects
# Sort NObeyesdad in descending order sorted_obesity_levels = df['NObeyesdad'].value_counts().index plt.figure(figsize=(6, 6)) sns.countplot(x="NObeyesdad", data=df, order=sorted_obesity_levels[::-1], palette="Greens") plt.title("Obesity Level Distribution", fontsize=14, fontweight="bold") plt.xticks(rotation=45) # Annotate each bar with its count for i, count in enumerate(df['NObeyesdad'].value_counts()[::-1]): plt.text(i, count, str(count), ha='center', va='bottom') plt.tight_layout() plt.show()
plt.figure(figsize=(16, 6)) plt.subplot(1, 3, 1) sns.histplot(df["Age"].dropna(), kde=True, color="Red") plt.title("Age Distribution", fontsize=14, fontweight="bold") plt.subplot(1, 3, 2) sns.histplot(df["Height"].dropna(), kde=True, color="Orange") plt.title("Height Distribution", fontsize=14, fontweight="bold") plt.subplot(1, 3, 3) sns.histplot(df["Weight"].dropna(), kde=True, color="Purple") plt.title("Weight Distribution", fontsize=14, fontweight="bold") plt.tight_layout()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True):
# Correlation heatmap plt.figure(figsize=(12, 8)) # Select only numerical columns numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns # Calculate correlation matrix corr_matrix = df[numeric_cols].corr() # Plot correlation heatmap sns.heatmap(corr_matrix, annot=True, cmap="YlGnBu") plt.title("Feature Correlation Heatmap", fontsize=16, fontweight="bold") plt.show()
ml projects
ml project
# Define BMI categories and corresponding colors bmi_colors = { "Normal": "green", "Overweight": "red", "Underweight": "blue" } # Calculate BMI for each person in the dataset df['BMI'] = df['Weight'] / (df['Height'] ** 2) # Create a new column to categorize BMI df['BMI Category'] = pd.cut(df['BMI'], bins=[0, 18.5, 24.9, np.inf], labels=["Underweight", "Normal", "Overweight"], right=False) # Plot the scatterplot with colors based on BMI categories plt.figure(figsize=(8, 8)) for category, color in bmi_colors.items(): subset = df[df['BMI Category'] == category] plt.scatter(subset['Height'], subset['Weight'], color=color, label=category) plt.title("Height vs Weight", fontsize=14, fontweight="bold") plt.xlabel("Height (m)") plt.ylabel("Weight (kg)") plt.legend() plt.tight_layout() plt.show()
Conclusion
Alright, everyone, we’ve reached the end of our journey through Machine Learning Project 6: Obesity Type – Best EDA and Classification! We’ve had an exciting adventure exploring the world of obesity types and harnessing the potential of machine learning to expertly classify them.
But this is more than just crunching numbers and making predictions. It’s about making a genuine difference in people’s lives.
ml project
ml projects
ml projects github
ml project with source code
By understanding the intricacies of obesity types, we’re paving the way for personalized interventions and treatments that can truly transform lives for the better.
As we conclude our journey, let’s maintain the momentum. Let’s continue pushing the boundaries of data science to address real-world issues and have a positive impact on society.
Remember, we hold the power of data in our hands β let’s use it wisely to create a healthier, happier world for everyone. Take care, and keep coding!
ml project with source code
ml project source code
ml projects for resume
ml project for resume
best ml projects
cool ml projects
4 Comments
Machine Learning Project 3: Best Explore Indian Cuisine · May 27, 2024 at 1:38 pm
[…] Machine Learning Project 6: Obesity type Best EDA and classification […]
Machine Learning Project 4: Best Explore Video Game Data · May 27, 2024 at 1:38 pm
[…] Machine Learning Project 6: Obesity type Best EDA and classification […]
Machine Learning Project 2: Diversity Tech Company Best EDA · May 27, 2024 at 1:39 pm
[…] Machine Learning Project 6: Obesity type Best EDA and classification […]
Machine Learning Project 1: Honda Motor Stocks Best Prices · May 27, 2024 at 1:40 pm
[…] Machine Learning Project 6: Obesity type Best EDA and classification […]