Sharing is caring!

Machine Learning Project 9: Best Anemia Types Classification

Table of Contents

Introduction

Hey everyone, data enthusiasts! Welcome to another thrilling exploration into the realm of machine learning.

Also, check Machine Learning projects:

Today, we’re delving into a critical health issue that impacts millions globally – anemia. And not just any anemia, but the various types of it.

In this project, we’ll be rolling up our sleeves and diving into top-notch Exploratory Data Analysis (EDA) and classification techniques.

project machine learning
machine learning certification
certification machine learning

This project is ideal for those seeking machine learning projects for their final year or for students aiming to make a tangible impact in the real world.

Additionally, we’ll be sharing some fantastic resources and code snippets related to machine learning projects on GitHub, so you can follow along and perhaps even contribute your own enhancements.

Anemia Types EDA

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import seaborn as sns

from IPython.core.display import display, HTML

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from tensorflow import keras

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

from collections import Counter

Exploring data

df = pd.read_csv("/kaggle/input/anemia-types-classification/diagnosed_cbc_data_v4.csv")
df.shape
(1281, 15)
df.head()
WBCLYMpNEUTpLYMnNEUTnRBCHGBHCTMCVMCHMCHCPLTPDWPCTDiagnosis
010.043.250.14.35.02.777.324.287.726.330.1189.012.50.17Normocytic hypochromic anemia
110.042.452.34.25.32.847.325.088.225.720.2180.012.50.16Normocytic hypochromic anemia
27.230.760.72.24.43.979.030.577.022.629.5148.014.30.14Iron deficiency anemia
36.030.263.51.83.84.223.832.877.923.229.8143.011.30.12Iron deficiency anemia
44.239.153.71.62.33.930.4316.080.623.929.7236.012.80.22Normocytic hypochromic anemia
df.isna().sum()
WBC          0
LYMp         0
NEUTp        0
LYMn         0
NEUTn        0
RBC          0
HGB          0
HCT          0
MCV          0
MCH          0
MCHC         0
PLT          0
PDW          0
PCT          0
Diagnosis    0
dtype: int64
df.describe()
WBCLYMpNEUTpLYMnNEUTnRBCHGBHCTMCVMCHMCHCPLTPDWPCT
count1281.0000001281.0000001281.0000001281.0000001281.0000001281.0000001281.0000001281.00001281.0000001281.0000001281.0000001281.0000001281.0000001281.000000
mean7.86271725.84500077.5110001.8807605.1409404.70826712.18455146.152685.79391932.08484031.739149229.98142114.3125120.260280
std3.5644667.038728147.7462731.3356892.8722942.8172003.812897104.886127.177663111.1707563.30035293.0193363.0050790.685351
min0.8000006.2000000.7000000.2000000.5000001.360000-10.0000002.0000-79.30000010.90000011.50000010.0000008.4000000.010000
25%6.00000025.84500071.1000001.8807605.1000004.19000010.80000039.200081.20000025.50000030.600000157.00000013.3000000.170000
50%7.40000025.84500077.5110001.8807605.1409404.60000012.30000046.152686.60000027.80000032.000000213.00000014.3125120.260280
75%8.68000025.84500077.5110001.8807605.1409405.10000013.50000046.152690.20000029.60000032.900000293.00000014.7000000.260280
max45.70000091.4000005317.00000041.80000079.00000090.80000087.1000003715.0000990.0000003117.00000092.800000660.00000097.00000013.600000

Diagnoses count

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 7))

counts = df["Diagnosis"].value_counts()
counts.plot(kind="bar", ax=axes[0])
for container in axes[0].containers:
    axes[0].bar_label(container)

axes[1].pie(counts, autopct="%0.2f%%", labels=counts.index)
plt.tight_layout()
plt.show()
machine learning projects github
machine learning projects for final year
machine learning projects for students

Correlation coefficient of patients features

corr = df[df.columns[:-1]].corr()
corr.style.background_gradient(cmap='coolwarm')
 WBCLYMpNEUTpLYMnNEUTnRBCHGBHCTMCVMCHMCHCPLTPDWPCT
WBC1.000000-0.098685-0.0229980.3799740.3062160.0294730.053450-0.002289-0.011190-0.016255-0.0609000.211354-0.026318-0.004214
LYMp-0.0986851.0000000.0090170.465271-0.296009-0.031355-0.0121360.019576-0.015431-0.005443-0.0647470.0476730.011715-0.038542
NEUTp-0.0229980.0090171.000000-0.019970-0.015514-0.015369-0.025144-0.0043670.000013-0.003420-0.0043460.016326-0.008418-0.000491
LYMn0.3799740.465271-0.0199701.0000000.0476620.0132610.0161530.003046-0.024092-0.009649-0.0675720.072622-0.014422-0.019014
NEUTn0.306216-0.296009-0.0155140.0476621.0000000.0292710.0816650.091813-0.0141080.047501-0.0159200.022225-0.038612-0.043049
RBC0.029473-0.031355-0.0153690.0132610.0292711.0000000.463446-0.000816-0.039550-0.009109-0.1003700.0043010.037160-0.013501
HGB0.053450-0.012136-0.0251440.0161530.0816650.4634461.000000-0.0004590.0234940.001506-0.0291150.0402310.135668-0.053314
HCT-0.0022890.019576-0.0043670.0030460.091813-0.000816-0.0004591.0000000.0008130.6080170.002065-0.0178940.088878-0.012773
MCV-0.011190-0.0154310.000013-0.024092-0.014108-0.0395500.0234940.0008131.0000000.0131140.0949500.0641390.021137-0.043245
MCH-0.016255-0.005443-0.003420-0.0096490.047501-0.0091090.0015060.6080170.0131141.0000000.015006-0.0321410.0537050.008228
MCHC-0.060900-0.064747-0.004346-0.067572-0.015920-0.100370-0.0291150.0020650.0949500.0150061.0000000.0628290.004920-0.036572
PLT0.2113540.0476730.0163260.0726220.0222250.0043010.040231-0.0178940.064139-0.0321410.0628291.0000000.087677-0.059280
PDW-0.0263180.011715-0.008418-0.014422-0.0386120.0371600.1356680.0888780.0211370.0537050.0049200.0876771.000000-0.067446
PCT-0.004214-0.038542-0.000491-0.019014-0.043049-0.013501-0.053314-0.012773-0.0432450.008228-0.036572-0.059280-0.0674461.000000

Mean numerical values for each numerical features showin tendencies among diagnoses

index = 0
grouped = df.groupby("Diagnosis")
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=7, figsize=(15, 7))
    for j in range(7):
        mean = grouped[df.columns[index]].mean()
        sns.barplot(x=mean.index, y=mean, ax=axes[j])
        for container in axes[j].containers:
            axes[j].bar_label(container, label_type="center", rotation=90)
        axes[j].set_xticklabels(axes[j].get_xticklabels(), rotation=90)
        index += 1
    plt.tight_layout()
    plt.show()

General numerical density distribution

index = 0
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=7, figsize=(15, 6))
    for j in range(7):
        sns.kdeplot(df, x=df.columns[index], ax=axes[j])
        index += 1
    plt.tight_layout()
    plt.show()

Density distribution among diagnoses

index = 0
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=7, figsize=(20, 6))
    for j in range(7):
        sns.kdeplot(df, x=df.columns[index], hue="Diagnosis", ax=axes[j])
        index += 1
    plt.tight_layout()
    plt.show()
ml process
kaggle machine learning projects

Data distribution with outliers shown on boxplots

index = 0
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=7, figsize=(15, 6))
    for j in range(7):
        sns.boxplot(df, x=df.columns[index], ax=axes[j])
        index += 1
    plt.tight_layout()
    plt.show()

Boxplots with categorical diagnostic distribution

index = 0
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=7, figsize=(15, 6))
    for j in range(7):
        sns.boxplot(df, x="Diagnosis",y=df.columns[index], ax=axes[j])
        axes[j].set_xticklabels(axes[j].get_xticklabels(), rotation=90)
        index += 1
    plt.tight_layout()
    plt.show()

Encoding diagnoses

le = LabelEncoder()

df["Diagnosis"] = le.fit_transform(df["Diagnosis"].values)
x = df.iloc[:, :-1].values
y = df.iloc[:, -1].values

Scaling features

scaler = MinMaxScaler()
x = scaler.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=42, test_size=0.2)

Handling class imbalances applying SMOTE

smote = SMOTE()
print("Before: ", Counter(y_train))
x_train, y_train = smote.fit_resample(x_train, y_train)
print("After: ", Counter(y_train))
Before:  Counter({0: 255, 5: 223, 6: 223, 1: 155, 8: 56, 7: 45, 2: 42, 4: 17, 3: 8})
After:  Counter({5: 255, 6: 255, 0: 255, 8: 255, 7: 255, 1: 255, 2: 255, 4: 255, 3: 255})
step machine learning
step of machine learning
ml projects

Machine Learning Models

rfc = RandomForestClassifier()
abc = AdaBoostClassifier()
etc = ExtraTreesClassifier()
gbc = GradientBoostingClassifier()
mnb = MultinomialNB()
xgb = XGBClassifier()
lgb = LGBMClassifier()

models = [rfc, abc, etc, gbc,
         mnb, xgb, lgb]

names = ["Random Forest", "Ada Boost", "Extra Trees",
        "Gradient Boosting", "Nave Bayes", "XGBoost", "LightGBM"]

Neural Network architecture

model = keras.models.Sequential([
    keras.layers.Dense(32, activation="relu"),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(df["Diagnosis"].nunique(), activation="softmax")
])
model.compile(optimizer='adam',
              loss="categorical_crossentropy",
              metrics=['accuracy'])
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ ?                      │   0 (unbuilt) │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ ?                      │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ ?                      │   0 (unbuilt) │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 0 (0.00 B)
 Trainable params: 0 (0.00 B)
 Non-trainable params: 0 (0.00 B)

Training models

history = model.fit(x_train, y_cat, validation_split=0.1, epochs=50, batch_size=32)

ANN model performance during training – history log

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 7))

axes[0].plot(history.history["loss"], label="Training loss")
axes[0].plot(history.history["val_loss"], label="Validation loss")
axes[0].legend()
axes[0].set_title("Loss log")


axes[1].plot(history.history["accuracy"], label="Training accuracy")
axes[1].plot(history.history["val_accuracy"], label="Validation accuracy")
axes[1].legend()
axes[1].set_title("Accuracy log")
plt.show()
ml project
machine learning python projects
machine learning projects in python

Training ML models

scores, reports, cms = [], dict(), dict()

for m, n in zip(models, names):
    score, report, cm = training(m)
    scores += [score*100]
    reports[n] = report
    cms[n] = cm
    
pred = model.predict(x_test)
pred = [np.argmax(i) for i in pred]
scores += [accuracy_score(pred, y_test)*100]
reports["ANN"] = classification_report(pred, y_test)
cms["ANN"] = confusion_matrix(pred, y_test)
names += ["ANN"]

ML models accuracies from worst to best

dt = pd.DataFrame({"scores": scores}, index=names)
dt = dt.sort_values("scores", ascending=False)
dt["scores"] = round(dt["scores"], 2)
fig, axes = plt.subplots()
sns.barplot(x=dt.index, y=dt.iloc[:, 0], ax=axes)
for container in axes.containers:
    axes.bar_label(container)
axes.set_xticklabels(axes.get_xticklabels(), rotation=90)
plt.show()

Confusion matrices from best to worst performing model

index = 0
for i in range(2):
    fig, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 7))
    for j in range(4):
        sns.heatmap(cms[dt.index[index]], annot=True, ax=axes[j])
        axes[j].set_title("{}: {}%".format(dt.index[index], dt.iloc[index, 0]))
        index += 1
    plt.tight_layout()
    plt.show()

Classification reports

for i in dt.index:
    print("*"*30)
    print(i)
    print(reports[i])
    print("\n")
******************************
LightGBM
precision recall f1-score support

0 1.00 1.00 1.00 81
1 1.00 1.00 1.00 34
2 1.00 1.00 1.00 5
3 1.00 0.75 0.86 4
4 1.00 1.00 1.00 1
5 1.00 1.00 1.00 56
6 0.98 1.00 0.99 45
7 1.00 1.00 1.00 14
8 1.00 1.00 1.00 17

accuracy 1.00 257
macro avg 1.00 0.97 0.98 257
weighted avg 1.00 1.00 1.00 257



******************************
Random Forest
precision recall f1-score support

0 0.99 1.00 0.99 80
1 1.00 1.00 1.00 34
2 1.00 0.83 0.91 6
3 1.00 1.00 1.00 3
4 1.00 0.50 0.67 2
5 1.00 1.00 1.00 56
6 0.98 1.00 0.99 45
7 1.00 1.00 1.00 14
8 1.00 1.00 1.00 17

accuracy 0.99 257
macro avg 1.00 0.93 0.95 257
weighted avg 0.99 0.99 0.99 257



******************************
XGBoost
precision recall f1-score support

0 0.99 1.00 0.99 80
1 1.00 1.00 1.00 34
2 1.00 1.00 1.00 5
3 1.00 0.75 0.86 4
4 1.00 1.00 1.00 1
5 1.00 1.00 1.00 56
6 0.98 1.00 0.99 45
7 1.00 0.93 0.97 15
8 1.00 1.00 1.00 17

accuracy 0.99 257
macro avg 1.00 0.96 0.98 257
weighted avg 0.99 0.99 0.99 257



******************************
Gradient Boosting
precision recall f1-score support

0 0.99 1.00 0.99 80
1 0.97 1.00 0.99 33
2 1.00 1.00 1.00 5
3 1.00 0.75 0.86 4
4 1.00 1.00 1.00 1
5 1.00 1.00 1.00 56
6 0.98 1.00 0.99 45
7 1.00 0.88 0.93 16
8 1.00 1.00 1.00 17

accuracy 0.99 257
macro avg 0.99 0.96 0.97 257
weighted avg 0.99 0.99 0.99 257



******************************
Extra Trees
precision recall f1-score support

0 0.91 0.95 0.93 78
1 0.91 0.91 0.91 34
2 0.60 0.60 0.60 5
3 0.67 1.00 0.80 2
4 0.00 0.00 0.00 3
5 0.91 0.81 0.86 63
6 0.76 0.97 0.85 36
7 0.93 0.93 0.93 14
8 0.94 0.73 0.82 22

accuracy 0.88 257
macro avg 0.74 0.77 0.74 257
weighted avg 0.88 0.88 0.87 257



******************************
Ada Boost
precision recall f1-score support

0 1.00 0.75 0.86 108
1 0.00 0.00 0.00 0
2 0.00 0.00 0.00 0
3 0.00 0.00 0.00 0
4 0.00 0.00 0.00 0
5 0.00 0.00 0.00 0
6 0.98 0.30 0.46 149
7 0.00 0.00 0.00 0
8 0.00 0.00 0.00 0

accuracy 0.49 257
macro avg 0.22 0.12 0.15 257
weighted avg 0.99 0.49 0.63 257



******************************
ANN
precision recall f1-score support

0 0.69 0.71 0.70 79
1 0.44 0.60 0.51 25
2 1.00 0.36 0.53 14
3 1.00 0.20 0.33 15
4 1.00 0.02 0.04 52
5 0.21 0.38 0.27 32
6 0.22 0.48 0.30 21
7 0.00 0.00 0.00 18
8 0.00 0.00 0.00 1

accuracy 0.40 257
macro avg 0.51 0.30 0.30 257
weighted avg 0.62 0.40 0.38 257



******************************
Nave Bayes
precision recall f1-score support

0 0.68 0.63 0.65 87
1 0.26 0.35 0.30 26
2 0.60 0.25 0.35 12
3 1.00 0.25 0.40 12
4 1.00 0.04 0.07 27
5 0.05 0.19 0.08 16
6 0.17 0.40 0.24 20
7 0.00 0.00 0.00 0
8 0.94 0.28 0.43 57

accuracy 0.38 257
macro avg 0.52 0.26 0.28 257
weighted avg 0.66 0.38 0.41 257
ml projects ideas
project manager artificial intelligence
best machine learning courses reddit
machine learning projects for resume

Anemia Types Using Ensemble Learning

Anemia Types Using Ensemble Learning

About dataset

  1. About Dataset
  2. CBC data labeled with the diagnosis of Anemia type, The data collected among several CBCs data and diagnosed manually
  3. Data Dictionary:
  4. HGB: The amount of hemoglobin in the blood, crucial for oxygen transport.
  5. PlT: The number of platelets in the blood, involved in blood clotting.
  6. WBC: The count of white blood cells, vital for immune response.
  7. RBC: The count of red blood cells, responsible for oxygen transport.
  8. MCV (Mean Corpuscular Volume): Average volume of a single red blood cell.
  9. MCH (Mean Corpuscular Hemoglobin): Average amount of hemoglobin per red blood cell.
  10. MCHC (Mean Corpuscular Hemoglobin Concentration): Average concentration of hemoglobin in red blood cells.
  11. PDW: a measurement of the variability in platelet size distribution in the blood
  12. PCT: A procalcitonin test can help your health care provider diagnose if you have sepsis from a bacterial infection or if you have a high risk of developing sepsis
  13. Diagnosis: Anemia type based on the CBC parameters
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import warnings
df = pd.read_csv('/kaggle/input/anemia-types-classification/diagnosed_cbc_data_v4.csv')
df.sample(10)
WBCLYMpNEUTpLYMnNEUTnRBCHGBHCTMCVMCHMCHCPLTPDWPCTDiagnosis
6948.7025.84577.5111.880765.140945.7014.546.152695.031.033.0380.014.7000000.26028Healthy
10774.3025.84577.5111.880765.140942.718.046.152687.829.533.632.014.3125120.26028Normocytic normochromic anemia
4458.5027.80062.4002.400005.300003.8511.234.800090.629.032.1243.013.8000000.23000Normocytic normochromic anemia
3397.4025.90067.1001.900005.000005.2910.334.800065.819.429.537.08.4000000.03000Iron deficiency anemia
7198.7025.84577.5111.880765.140945.7014.546.152695.031.033.0380.013.9000000.26028Healthy
7307.2025.84577.5111.880765.140945.5013.546.152693.031.032.0330.018.2000000.26028Healthy
4384.209.50084.7000.400003.600003.349.228.400085.3275.032.3107.013.1000000.10000Normocytic normochromic anemia
7708.2025.84577.5111.880765.140945.3014.846.152691.030.033.0360.014.2000000.26028Healthy
3206.2016.30077.2001.000004.800003.859.331.500082.024.129.5161.012.3000000.15000Normocytic hypochromic anemia
9974.4825.84577.5111.880765.140944.8711.946.152682.524.429.6384.014.3125120.26028Normocytic hypochromic anemia
df.shape
(1281, 15)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1281 entries, 0 to 1280
Data columns (total 15 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   WBC        1281 non-null   float64
 1   LYMp       1281 non-null   float64
 2   NEUTp      1281 non-null   float64
 3   LYMn       1281 non-null   float64
 4   NEUTn      1281 non-null   float64
 5   RBC        1281 non-null   float64
 6   HGB        1281 non-null   float64
 7   HCT        1281 non-null   float64
 8   MCV        1281 non-null   float64
 9   MCH        1281 non-null   float64
 10  MCHC       1281 non-null   float64
 11  PLT        1281 non-null   float64
 12  PDW        1281 non-null   float64
 13  PCT        1281 non-null   float64
 14  Diagnosis  1281 non-null   object 
dtypes: float64(14), object(1)
memory usage: 150.2+ KB
df.describe()
WBCLYMpNEUTpLYMnNEUTnRBCHGBHCTMCVMCHMCHCPLTPDWPCT
count1281.0000001281.0000001281.0000001281.0000001281.0000001281.0000001281.0000001281.00001281.0000001281.0000001281.0000001281.0000001281.0000001281.000000
mean7.86271725.84500077.5110001.8807605.1409404.70826712.18455146.152685.79391932.08484031.739149229.98142114.3125120.260280
std3.5644667.038728147.7462731.3356892.8722942.8172003.812897104.886127.177663111.1707563.30035293.0193363.0050790.685351
min0.8000006.2000000.7000000.2000000.5000001.360000-10.0000002.0000-79.30000010.90000011.50000010.0000008.4000000.010000
25%6.00000025.84500071.1000001.8807605.1000004.19000010.80000039.200081.20000025.50000030.600000157.00000013.3000000.170000
50%7.40000025.84500077.5110001.8807605.1409404.60000012.30000046.152686.60000027.80000032.000000213.00000014.3125120.260280
75%8.68000025.84500077.5110001.8807605.1409405.10000013.50000046.152690.20000029.60000032.900000293.00000014.7000000.260280
max45.70000091.4000005317.00000041.80000079.00000090.80000087.1000003715.0000990.0000003117.00000092.800000660.00000097.00000013.600000
df.dtypes
WBC          float64
LYMp         float64
NEUTp        float64
LYMn         float64
NEUTn        float64
RBC          float64
HGB          float64
HCT          float64
MCV          float64
MCH          float64
MCHC         float64
PLT          float64
PDW          float64
PCT          float64
Diagnosis     object
dtype: object
df.isnull().sum()
WBC          0
LYMp         0
NEUTp        0
LYMn         0
NEUTn        0
RBC          0
HGB          0
HCT          0
MCV          0
MCH          0
MCHC         0
PLT          0
PDW          0
PCT          0
Diagnosis    0
dtype: int64
machine learning project for resume
best machine learning projects
cool machine learning projects

data visulization

plt.figure(figsize=(10, 6))
sns.countplot(x='Diagnosis', data=df)
plt.title('Count of Each Diagnosis Category')
plt.xlabel('Diagnosis')
plt.ylabel('Count')
plt.xticks(rotation=90)
plt.show()
columns = ['WBC', 'LYMp', 'NEUTp', 'LYMn', 'NEUTn', 'RBC', 'HGB', 'HCT', 'MCV', 'MCH', 'MCHC', 'PLT', 'PDW', 'PCT']
for column in columns:
    plt.figure(figsize=(8, 5))
    sns.histplot(df[column], kde=True)  
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.xticks(rotation=90)
    plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
step machine learning
step of machine learning
ml projects
ml project
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
  with pd.option_context('mode.use_inf_as_na', True):
import seaborn as sns
import matplotlib.pyplot as plt

# List of columns to plot
columns = ['WBC', 'LYMp', 'NEUTp', 'LYMn', 'NEUTn', 'RBC', 'HGB', 'HCT', 'MCV', 'MCH', 'MCHC', 'PLT', 'PDW', 'PCT']

# Loop through each column and create a boxplot
for column in columns:
    plt.figure(figsize=(8, 5))
    sns.boxplot(x=df[column])  # Correct usage of boxplot without kde
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.show()
projects on machine learning
machine learning project

Train_test_split

x = df.drop(columns = ['Diagnosis'])
y = df['Diagnosis']
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
y_label = label_encoder.fit_transform(y)
y_label
array([5, 5, 1, ..., 0, 0, 0])
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y_label,test_size =0.2,random_state =43)

Ensemble Learning

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
estimators = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('knn', KNeighborsClassifier(n_neighbors=10)),
    ('gbdt',GradientBoostingClassifier())
]
machine learning projects github
machine learning projects for final year
machine learning projects for students
from sklearn.ensemble import StackingClassifier

clf = StackingClassifier(
    estimators=estimators, 
    final_estimator=LogisticRegression(),
    cv=10
)
clf.fit(x_train,y_train)

StackingClassifier

StackingClassifier(cv=10,
                   estimators=[('rf',
                                RandomForestClassifier(n_estimators=10,
                                                       random_state=42)),
                               ('knn', KNeighborsClassifier(n_neighbors=10)),
                               ('gbdt', GradientBoostingClassifier())],
                   final_estimator=LogisticRegression())

rfRandomForestClassifier

RandomForestClassifier(n_estimators=10, random_state=42)

knnKNeighborsClassifier

KNeighborsClassifier(n_neighbors=10)

gbdtGradientBoostingClassifier

GradientBoostingClassifier()

final_estimatorLogisticRegression

LogisticRegression()
y_pred = clf.predict(x_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)
0.980544747081712

# XGB boost

import xgboost as xgb
xgb_model = xgb.XGBClassifier()
xgb_model.fit(x_train, y_train)
y_pred = xgb_model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Accuracy: 0.980544747081712

Conclusion

Hey there, data warriors! Great news! We successfully conquered anemia classification using machine learning. We delved into the data and created intelligent models.

To all the aspiring project managers in the field of machine learning, this project highlighted the significance of having clear objectives, well-structured processes, and effective communication. It’s not only about algorithms but also about seamless coordination.

machine learning project manager
machine learning project management
machine learning projects for masters students

For those of you who are working on machine learning projects for masters students, this is just the beginning. Utilize these skills to take on even more significant challenges in the future.

Our main goal was to demystify anemia using data, and we absolutely achieved that. Keep your curiosity alive and happy coding!


0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *