Machine Learning Project 9: Best Anemia Types Classification

Introduction

Hey everyone, data enthusiasts! Welcome to another thrilling exploration into the realm of machine learning.

Also, check Machine Learning projects:

Today, we’re delving into a critical health issue that impacts millions globally – anemia. And not just any anemia, but the various types of it.

In this project, we’ll be rolling up our sleeves and diving into top-notch Exploratory Data Analysis (EDA) and classification techniques.

project machine learning
machine learning certification
certification machine learning

This project is ideal for those seeking machine learning projects for their final year or for students aiming to make a tangible impact in the real world.

Additionally, we’ll be sharing some fantastic resources and code snippets related to machine learning projects on GitHub, so you can follow along and perhaps even contribute your own enhancements.

Anemia Types EDA

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import seaborn as sns

from IPython.core.display import display, HTML

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from tensorflow import keras

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from collections import Counter

Exploring data

df = pd.read_csv("/kaggle/input/anemia-types-classification/diagnosed_cbc_data_v4.csv")

df.shape

(1281, 15)

df.head()

	WBC	LYMp	NEUTp	LYMn	NEUTn	RBC	HGB	HCT	MCV	MCH	MCHC	PLT	PDW	PCT	Diagnosis
0	10.0	43.2	50.1	4.3	5.0	2.77	7.3	24.2	87.7	26.3	30.1	189.0	12.5	0.17	Normocytic hypochromic anemia
1	10.0	42.4	52.3	4.2	5.3	2.84	7.3	25.0	88.2	25.7	20.2	180.0	12.5	0.16	Normocytic hypochromic anemia
2	7.2	30.7	60.7	2.2	4.4	3.97	9.0	30.5	77.0	22.6	29.5	148.0	14.3	0.14	Iron deficiency anemia
3	6.0	30.2	63.5	1.8	3.8	4.22	3.8	32.8	77.9	23.2	29.8	143.0	11.3	0.12	Iron deficiency anemia
4	4.2	39.1	53.7	1.6	2.3	3.93	0.4	316.0	80.6	23.9	29.7	236.0	12.8	0.22	Normocytic hypochromic anemia

df.isna().sum()

WBC          0
LYMp         0
NEUTp        0
LYMn         0
NEUTn        0
RBC          0
HGB          0
HCT          0
MCV          0
MCH          0
MCHC         0
PLT          0
PDW          0
PCT          0
Diagnosis    0
dtype: int64

df.describe()

	WBC	LYMp	NEUTp	LYMn	NEUTn	RBC	HGB	HCT	MCV	MCH	MCHC	PLT	PDW	PCT
count	1281.000000	1281.000000	1281.000000	1281.000000	1281.000000	1281.000000	1281.000000	1281.0000	1281.000000	1281.000000	1281.000000	1281.000000	1281.000000	1281.000000
mean	7.862717	25.845000	77.511000	1.880760	5.140940	4.708267	12.184551	46.1526	85.793919	32.084840	31.739149	229.981421	14.312512	0.260280
std	3.564466	7.038728	147.746273	1.335689	2.872294	2.817200	3.812897	104.8861	27.177663	111.170756	3.300352	93.019336	3.005079	0.685351
min	0.800000	6.200000	0.700000	0.200000	0.500000	1.360000	-10.000000	2.0000	-79.300000	10.900000	11.500000	10.000000	8.400000	0.010000
25%	6.000000	25.845000	71.100000	1.880760	5.100000	4.190000	10.800000	39.2000	81.200000	25.500000	30.600000	157.000000	13.300000	0.170000
50%	7.400000	25.845000	77.511000	1.880760	5.140940	4.600000	12.300000	46.1526	86.600000	27.800000	32.000000	213.000000	14.312512	0.260280
75%	8.680000	25.845000	77.511000	1.880760	5.140940	5.100000	13.500000	46.1526	90.200000	29.600000	32.900000	293.000000	14.700000	0.260280
max	45.700000	91.400000	5317.000000	41.800000	79.000000	90.800000	87.100000	3715.0000	990.000000	3117.000000	92.800000	660.000000	97.000000	13.600000

Diagnoses count

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 7))

counts = df["Diagnosis"].value_counts()
counts.plot(kind="bar", ax=axes[0])
for container in axes[0].containers:
    axes[0].bar_label(container)

axes[1].pie(counts, autopct="%0.2f%%", labels=counts.index)
plt.tight_layout()
plt.show()

machine learning projects github
machine learning projects for final year
machine learning projects for students

Correlation coefficient of patients features

corr = df[df.columns[:-1]].corr()
corr.style.background_gradient(cmap='coolwarm')

	WBC	LYMp	NEUTp	LYMn	NEUTn	RBC	HGB	HCT	MCV	MCH	MCHC	PLT	PDW	PCT
WBC	1.000000	-0.098685	-0.022998	0.379974	0.306216	0.029473	0.053450	-0.002289	-0.011190	-0.016255	-0.060900	0.211354	-0.026318	-0.004214
LYMp	-0.098685	1.000000	0.009017	0.465271	-0.296009	-0.031355	-0.012136	0.019576	-0.015431	-0.005443	-0.064747	0.047673	0.011715	-0.038542
NEUTp	-0.022998	0.009017	1.000000	-0.019970	-0.015514	-0.015369	-0.025144	-0.004367	0.000013	-0.003420	-0.004346	0.016326	-0.008418	-0.000491
LYMn	0.379974	0.465271	-0.019970	1.000000	0.047662	0.013261	0.016153	0.003046	-0.024092	-0.009649	-0.067572	0.072622	-0.014422	-0.019014
NEUTn	0.306216	-0.296009	-0.015514	0.047662	1.000000	0.029271	0.081665	0.091813	-0.014108	0.047501	-0.015920	0.022225	-0.038612	-0.043049
RBC	0.029473	-0.031355	-0.015369	0.013261	0.029271	1.000000	0.463446	-0.000816	-0.039550	-0.009109	-0.100370	0.004301	0.037160	-0.013501
HGB	0.053450	-0.012136	-0.025144	0.016153	0.081665	0.463446	1.000000	-0.000459	0.023494	0.001506	-0.029115	0.040231	0.135668	-0.053314
HCT	-0.002289	0.019576	-0.004367	0.003046	0.091813	-0.000816	-0.000459	1.000000	0.000813	0.608017	0.002065	-0.017894	0.088878	-0.012773
MCV	-0.011190	-0.015431	0.000013	-0.024092	-0.014108	-0.039550	0.023494	0.000813	1.000000	0.013114	0.094950	0.064139	0.021137	-0.043245
MCH	-0.016255	-0.005443	-0.003420	-0.009649	0.047501	-0.009109	0.001506	0.608017	0.013114	1.000000	0.015006	-0.032141	0.053705	0.008228
MCHC	-0.060900	-0.064747	-0.004346	-0.067572	-0.015920	-0.100370	-0.029115	0.002065	0.094950	0.015006	1.000000	0.062829	0.004920	-0.036572
PLT	0.211354	0.047673	0.016326	0.072622	0.022225	0.004301	0.040231	-0.017894	0.064139	-0.032141	0.062829	1.000000	0.087677	-0.059280
PDW	-0.026318	0.011715	-0.008418	-0.014422	-0.038612	0.037160	0.135668	0.088878	0.021137	0.053705	0.004920	0.087677	1.000000	-0.067446
PCT	-0.004214	-0.038542	-0.000491	-0.019014	-0.043049	-0.013501	-0.053314	-0.012773	-0.043245	0.008228	-0.036572	-0.059280	-0.067446	1.000000

Mean numerical values for each numerical features showin tendencies among diagnoses

index = 0
grouped = df.groupby("Diagnosis")
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=7, figsize=(15, 7))
    for j in range(7):
        mean = grouped[df.columns[index]].mean()
        sns.barplot(x=mean.index, y=mean, ax=axes[j])
        for container in axes[j].containers:
            axes[j].bar_label(container, label_type="center", rotation=90)
        axes[j].set_xticklabels(axes[j].get_xticklabels(), rotation=90)
        index += 1
    plt.tight_layout()
    plt.show()

General numerical density distribution

index = 0
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=7, figsize=(15, 6))
    for j in range(7):
        sns.kdeplot(df, x=df.columns[index], ax=axes[j])
        index += 1
    plt.tight_layout()
    plt.show()

Density distribution among diagnoses

index = 0
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=7, figsize=(20, 6))
    for j in range(7):
        sns.kdeplot(df, x=df.columns[index], hue="Diagnosis", ax=axes[j])
        index += 1
    plt.tight_layout()
    plt.show()

ml process
kaggle machine learning projects

Data distribution with outliers shown on boxplots

index = 0
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=7, figsize=(15, 6))
    for j in range(7):
        sns.boxplot(df, x=df.columns[index], ax=axes[j])
        index += 1
    plt.tight_layout()
    plt.show()

Boxplots with categorical diagnostic distribution

index = 0
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=7, figsize=(15, 6))
    for j in range(7):
        sns.boxplot(df, x="Diagnosis",y=df.columns[index], ax=axes[j])
        axes[j].set_xticklabels(axes[j].get_xticklabels(), rotation=90)
        index += 1
    plt.tight_layout()
    plt.show()

Encoding diagnoses

le = LabelEncoder()

df["Diagnosis"] = le.fit_transform(df["Diagnosis"].values)

x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

Scaling features

scaler = MinMaxScaler()
x = scaler.fit_transform(x)

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.2)

Handling class imbalances applying SMOTE

smote = SMOTE()
print("Before: ", Counter(y_train))
x_train, y_train = smote.fit_resample(x_train, y_train)
print("After: ", Counter(y_train))

Before:  Counter({0: 255, 5: 223, 6: 223, 1: 155, 8: 56, 7: 45, 2: 42, 4: 17, 3: 8})
After:  Counter({5: 255, 6: 255, 0: 255, 8: 255, 7: 255, 1: 255, 2: 255, 4: 255, 3: 255})

step machine learning
step of machine learning
ml projects

Machine Learning Models

rfc = RandomForestClassifier()
abc = AdaBoostClassifier()
etc = ExtraTreesClassifier()
gbc = GradientBoostingClassifier()
mnb = MultinomialNB()
xgb = XGBClassifier()
lgb = LGBMClassifier()

models = [rfc, abc, etc, gbc,
         mnb, xgb, lgb]

names = ["Random Forest", "Ada Boost", "Extra Trees",
        "Gradient Boosting", "Nave Bayes", "XGBoost", "LightGBM"]

Neural Network architecture

model = keras.models.Sequential([
    keras.layers.Dense(32, activation="relu"),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(df["Diagnosis"].nunique(), activation="softmax")
])
model.compile(optimizer='adam',
              loss="categorical_crossentropy",
              metrics=['accuracy'])
model.summary()

Model: "sequential"

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ ?                      │   0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ ?                      │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ ?                      │   0 (unbuilt) │
└─────────────────────────────────┴────────────────────────┴───────────────┘

 Total params: 0 (0.00 B)

 Trainable params: 0 (0.00 B)

 Non-trainable params: 0 (0.00 B)

Training models

history = model.fit(x_train, y_cat, validation_split=0.1, epochs=50, batch_size=32)

ANN model performance during training – history log

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 7))

axes[0].plot(history.history["loss"], label="Training loss")
axes[0].plot(history.history["val_loss"], label="Validation loss")
axes[0].legend()
axes[0].set_title("Loss log")


axes[1].plot(history.history["accuracy"], label="Training accuracy")
axes[1].plot(history.history["val_accuracy"], label="Validation accuracy")
axes[1].legend()
axes[1].set_title("Accuracy log")
plt.show()

ml project
machine learning python projects
machine learning projects in python

Training ML models

scores, reports, cms = [], dict(), dict()

for m, n in zip(models, names):
    score, report, cm = training(m)
    scores += [score*100]
    reports[n] = report
    cms[n] = cm
    
pred = model.predict(x_test)
pred = [np.argmax(i) for i in pred]
scores += [accuracy_score(pred, y_test)*100]
reports["ANN"] = classification_report(pred, y_test)
cms["ANN"] = confusion_matrix(pred, y_test)
names += ["ANN"]

ML models accuracies from worst to best

dt = pd.DataFrame({"scores": scores}, index=names)
dt = dt.sort_values("scores", ascending=False)
dt["scores"] = round(dt["scores"], 2)
fig, axes = plt.subplots()
sns.barplot(x=dt.index, y=dt.iloc[:, 0], ax=axes)
for container in axes.containers:
    axes.bar_label(container)
axes.set_xticklabels(axes.get_xticklabels(), rotation=90)
plt.show()

Confusion matrices from best to worst performing model

index = 0
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 7))
    for j in range(4):
        sns.heatmap(cms[dt.index[index]], annot=True, ax=axes[j])
        axes[j].set_title("{}: {}%".format(dt.index[index], dt.iloc[index, 0]))
        index += 1
    plt.tight_layout()
    plt.show()

Classification reports

for i in dt.index:
    print("*"*30)
    print(i)
    print(reports[i])
    print("\n")

******************************
LightGBM
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        81
           1       1.00      1.00      1.00        34
           2       1.00      1.00      1.00         5
           3       1.00      0.75      0.86         4
           4       1.00      1.00      1.00         1
           5       1.00      1.00      1.00        56
           6       0.98      1.00      0.99        45
           7       1.00      1.00      1.00        14
           8       1.00      1.00      1.00        17

    accuracy                           1.00       257
   macro avg       1.00      0.97      0.98       257
weighted avg       1.00      1.00      1.00       257



******************************
Random Forest
              precision    recall  f1-score   support

           0       0.99      1.00      0.99        80
           1       1.00      1.00      1.00        34
           2       1.00      0.83      0.91         6
           3       1.00      1.00      1.00         3
           4       1.00      0.50      0.67         2
           5       1.00      1.00      1.00        56
           6       0.98      1.00      0.99        45
           7       1.00      1.00      1.00        14
           8       1.00      1.00      1.00        17

    accuracy                           0.99       257
   macro avg       1.00      0.93      0.95       257
weighted avg       0.99      0.99      0.99       257



******************************
XGBoost
              precision    recall  f1-score   support

           0       0.99      1.00      0.99        80
           1       1.00      1.00      1.00        34
           2       1.00      1.00      1.00         5
           3       1.00      0.75      0.86         4
           4       1.00      1.00      1.00         1
           5       1.00      1.00      1.00        56
           6       0.98      1.00      0.99        45
           7       1.00      0.93      0.97        15
           8       1.00      1.00      1.00        17

    accuracy                           0.99       257
   macro avg       1.00      0.96      0.98       257
weighted avg       0.99      0.99      0.99       257



******************************
Gradient Boosting
              precision    recall  f1-score   support

           0       0.99      1.00      0.99        80
           1       0.97      1.00      0.99        33
           2       1.00      1.00      1.00         5
           3       1.00      0.75      0.86         4
           4       1.00      1.00      1.00         1
           5       1.00      1.00      1.00        56
           6       0.98      1.00      0.99        45
           7       1.00      0.88      0.93        16
           8       1.00      1.00      1.00        17

    accuracy                           0.99       257
   macro avg       0.99      0.96      0.97       257
weighted avg       0.99      0.99      0.99       257



******************************
Extra Trees
              precision    recall  f1-score   support

           0       0.91      0.95      0.93        78
           1       0.91      0.91      0.91        34
           2       0.60      0.60      0.60         5
           3       0.67      1.00      0.80         2
           4       0.00      0.00      0.00         3
           5       0.91      0.81      0.86        63
           6       0.76      0.97      0.85        36
           7       0.93      0.93      0.93        14
           8       0.94      0.73      0.82        22

    accuracy                           0.88       257
   macro avg       0.74      0.77      0.74       257
weighted avg       0.88      0.88      0.87       257



******************************
Ada Boost
              precision    recall  f1-score   support

           0       1.00      0.75      0.86       108
           1       0.00      0.00      0.00         0
           2       0.00      0.00      0.00         0
           3       0.00      0.00      0.00         0
           4       0.00      0.00      0.00         0
           5       0.00      0.00      0.00         0
           6       0.98      0.30      0.46       149
           7       0.00      0.00      0.00         0
           8       0.00      0.00      0.00         0

    accuracy                           0.49       257
   macro avg       0.22      0.12      0.15       257
weighted avg       0.99      0.49      0.63       257



******************************
ANN
              precision    recall  f1-score   support

           0       0.69      0.71      0.70        79
           1       0.44      0.60      0.51        25
           2       1.00      0.36      0.53        14
           3       1.00      0.20      0.33        15
           4       1.00      0.02      0.04        52
           5       0.21      0.38      0.27        32
           6       0.22      0.48      0.30        21
           7       0.00      0.00      0.00        18
           8       0.00      0.00      0.00         1

    accuracy                           0.40       257
   macro avg       0.51      0.30      0.30       257
weighted avg       0.62      0.40      0.38       257



******************************
Nave Bayes
              precision    recall  f1-score   support

           0       0.68      0.63      0.65        87
           1       0.26      0.35      0.30        26
           2       0.60      0.25      0.35        12
           3       1.00      0.25      0.40        12
           4       1.00      0.04      0.07        27
           5       0.05      0.19      0.08        16
           6       0.17      0.40      0.24        20
           7       0.00      0.00      0.00         0
           8       0.94      0.28      0.43        57

    accuracy                           0.38       257
   macro avg       0.52      0.26      0.28       257
weighted avg       0.66      0.38      0.41       257

ml projects ideas
project manager artificial intelligence
best machine learning courses reddit
machine learning projects for resume

Anemia Types Using Ensemble Learning

About dataset

About Dataset
CBC data labeled with the diagnosis of Anemia type, The data collected among several CBCs data and diagnosed manually
Data Dictionary:
HGB: The amount of hemoglobin in the blood, crucial for oxygen transport.
PlT: The number of platelets in the blood, involved in blood clotting.
WBC: The count of white blood cells, vital for immune response.
RBC: The count of red blood cells, responsible for oxygen transport.
MCV (Mean Corpuscular Volume): Average volume of a single red blood cell.
MCH (Mean Corpuscular Hemoglobin): Average amount of hemoglobin per red blood cell.
MCHC (Mean Corpuscular Hemoglobin Concentration): Average concentration of hemoglobin in red blood cells.
PDW: a measurement of the variability in platelet size distribution in the blood
PCT: A procalcitonin test can help your health care provider diagnose if you have sepsis from a bacterial infection or if you have a high risk of developing sepsis
Diagnosis: Anemia type based on the CBC parameters

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import warnings

df = pd.read_csv('/kaggle/input/anemia-types-classification/diagnosed_cbc_data_v4.csv')

df.sample(10)

	WBC	LYMp	NEUTp	LYMn	NEUTn	RBC	HGB	HCT	MCV	MCH	MCHC	PLT	PDW	PCT	Diagnosis
694	8.70	25.845	77.511	1.88076	5.14094	5.70	14.5	46.1526	95.0	31.0	33.0	380.0	14.700000	0.26028	Healthy
1077	4.30	25.845	77.511	1.88076	5.14094	2.71	8.0	46.1526	87.8	29.5	33.6	32.0	14.312512	0.26028	Normocytic normochromic anemia
445	8.50	27.800	62.400	2.40000	5.30000	3.85	11.2	34.8000	90.6	29.0	32.1	243.0	13.800000	0.23000	Normocytic normochromic anemia
339	7.40	25.900	67.100	1.90000	5.00000	5.29	10.3	34.8000	65.8	19.4	29.5	37.0	8.400000	0.03000	Iron deficiency anemia
719	8.70	25.845	77.511	1.88076	5.14094	5.70	14.5	46.1526	95.0	31.0	33.0	380.0	13.900000	0.26028	Healthy
730	7.20	25.845	77.511	1.88076	5.14094	5.50	13.5	46.1526	93.0	31.0	32.0	330.0	18.200000	0.26028	Healthy
438	4.20	9.500	84.700	0.40000	3.60000	3.34	9.2	28.4000	85.3	275.0	32.3	107.0	13.100000	0.10000	Normocytic normochromic anemia
770	8.20	25.845	77.511	1.88076	5.14094	5.30	14.8	46.1526	91.0	30.0	33.0	360.0	14.200000	0.26028	Healthy
320	6.20	16.300	77.200	1.00000	4.80000	3.85	9.3	31.5000	82.0	24.1	29.5	161.0	12.300000	0.15000	Normocytic hypochromic anemia
997	4.48	25.845	77.511	1.88076	5.14094	4.87	11.9	46.1526	82.5	24.4	29.6	384.0	14.312512	0.26028	Normocytic hypochromic anemia

df.shape

(1281, 15)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1281 entries, 0 to 1280
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   WBC        1281 non-null   float64
 1   LYMp       1281 non-null   float64
 2   NEUTp      1281 non-null   float64
 3   LYMn       1281 non-null   float64
 4   NEUTn      1281 non-null   float64
 5   RBC        1281 non-null   float64
 6   HGB        1281 non-null   float64
 7   HCT        1281 non-null   float64
 8   MCV        1281 non-null   float64
 9   MCH        1281 non-null   float64
 10  MCHC       1281 non-null   float64
 11  PLT        1281 non-null   float64
 12  PDW        1281 non-null   float64
 13  PCT        1281 non-null   float64
 14  Diagnosis  1281 non-null   object 
dtypes: float64(14), object(1)
memory usage: 150.2+ KB

df.describe()

	WBC	LYMp	NEUTp	LYMn	NEUTn	RBC	HGB	HCT	MCV	MCH	MCHC	PLT	PDW	PCT
count	1281.000000	1281.000000	1281.000000	1281.000000	1281.000000	1281.000000	1281.000000	1281.0000	1281.000000	1281.000000	1281.000000	1281.000000	1281.000000	1281.000000
mean	7.862717	25.845000	77.511000	1.880760	5.140940	4.708267	12.184551	46.1526	85.793919	32.084840	31.739149	229.981421	14.312512	0.260280
std	3.564466	7.038728	147.746273	1.335689	2.872294	2.817200	3.812897	104.8861	27.177663	111.170756	3.300352	93.019336	3.005079	0.685351
min	0.800000	6.200000	0.700000	0.200000	0.500000	1.360000	-10.000000	2.0000	-79.300000	10.900000	11.500000	10.000000	8.400000	0.010000
25%	6.000000	25.845000	71.100000	1.880760	5.100000	4.190000	10.800000	39.2000	81.200000	25.500000	30.600000	157.000000	13.300000	0.170000
50%	7.400000	25.845000	77.511000	1.880760	5.140940	4.600000	12.300000	46.1526	86.600000	27.800000	32.000000	213.000000	14.312512	0.260280
75%	8.680000	25.845000	77.511000	1.880760	5.140940	5.100000	13.500000	46.1526	90.200000	29.600000	32.900000	293.000000	14.700000	0.260280
max	45.700000	91.400000	5317.000000	41.800000	79.000000	90.800000	87.100000	3715.0000	990.000000	3117.000000	92.800000	660.000000	97.000000	13.600000

df.dtypes

WBC          float64
LYMp         float64
NEUTp        float64
LYMn         float64
NEUTn        float64
RBC          float64
HGB          float64
HCT          float64
MCV          float64
MCH          float64
MCHC         float64
PLT          float64
PDW          float64
PCT          float64
Diagnosis     object
dtype: object

df.isnull().sum()

WBC          0
LYMp         0
NEUTp        0
LYMn         0
NEUTn        0
RBC          0
HGB          0
HCT          0
MCV          0
MCH          0
MCHC         0
PLT          0
PDW          0
PCT          0
Diagnosis    0
dtype: int64

machine learning project for resume
best machine learning projects
cool machine learning projects

data visulization

plt.figure(figsize=(10, 6))
sns.countplot(x='Diagnosis', data=df)
plt.title('Count of Each Diagnosis Category')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()

columns = ['WBC', 'LYMp', 'NEUTp', 'LYMn', 'NEUTn', 'RBC', 'HGB', 'HCT', 'MCV', 'MCH', 'MCHC', 'PLT', 'PDW', 'PCT']
for column in columns:
    plt.figure(figsize=(8, 5))
    sns.histplot(df[column], kde=True)  
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.xticks(rotation=90)
    plt.show()

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

step machine learning
step of machine learning
ml projects
ml project

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):

import seaborn as sns
import matplotlib.pyplot as plt

# List of columns to plot
columns = ['WBC', 'LYMp', 'NEUTp', 'LYMn', 'NEUTn', 'RBC', 'HGB', 'HCT', 'MCV', 'MCH', 'MCHC', 'PLT', 'PDW', 'PCT']

# Loop through each column and create a boxplot
for column in columns:
    plt.figure(figsize=(8, 5))
    sns.boxplot(x=df[column])  # Correct usage of boxplot without kde
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.show()

projects on machine learning
machine learning project

Train_test_split

x = df.drop(columns = ['Diagnosis'])
y = df['Diagnosis']

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_label = label_encoder.fit_transform(y)

y_label

array([5, 5, 1, ..., 0, 0, 0])

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y_label,test_size =0.2,random_state =43)

Ensemble Learning

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('knn', KNeighborsClassifier(n_neighbors=10)),
    ('gbdt',GradientBoostingClassifier())
]

machine learning projects github
machine learning projects for final year
machine learning projects for students

from sklearn.ensemble import StackingClassifier

clf = StackingClassifier(
    estimators=estimators, 
    final_estimator=LogisticRegression(),
    cv=10
)

clf.fit(x_train,y_train)

StackingClassifier

StackingClassifier(cv=10,
                   estimators=[('rf',
                                RandomForestClassifier(n_estimators=10,
                                                       random_state=42)),
                               ('knn', KNeighborsClassifier(n_neighbors=10)),
                               ('gbdt', GradientBoostingClassifier())],
                   final_estimator=LogisticRegression())

rfRandomForestClassifier

RandomForestClassifier(n_estimators=10, random_state=42)

knnKNeighborsClassifier

KNeighborsClassifier(n_neighbors=10)

gbdtGradientBoostingClassifier

GradientBoostingClassifier()

final_estimatorLogisticRegression

LogisticRegression()

y_pred = clf.predict(x_test)

from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.980544747081712

# XGB boost

import xgboost as xgb
xgb_model = xgb.XGBClassifier()
xgb_model.fit(x_train, y_train)
y_pred = xgb_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.980544747081712

Conclusion

Hey there, data warriors! Great news! We successfully conquered anemia classification using machine learning. We delved into the data and created intelligent models.

To all the aspiring project managers in the field of machine learning, this project highlighted the significance of having clear objectives, well-structured processes, and effective communication. It’s not only about algorithms but also about seamless coordination.

machine learning project manager
machine learning project management
machine learning projects for masters students

For those of you who are working on machine learning projects for masters students, this is just the beginning. Utilize these skills to take on even more significant challenges in the future.

Our main goal was to demystify anemia using data, and we absolutely achieved that. Keep your curiosity alive and happy coding!

Learn more

More info about our us

Facebook: Click

Telegram group of exercises: Click

YouTube: Click