Table of Contents
Introduction
Welcome to “Diversity Tech Company Best EDA,” with ML that focuses on the connection between data science and diversity in the tech industry.
Also, check Machine Learning projects:
- Machine Learning Project 1: Honda Motor Stocks best Prices analysis
- Machine Learning Project 2: Diversity in Tech Companies Best EDA Analysis
- Machine Learning Project 3: Exploring Indian Cuisine Best Analysis
- Machine Learning Project 4: Exploring Video Game Data
- Machine Learning Project 5: Best Students Performance EDA
- Machine Learning Project 6: Obesity type Best EDA and classification
In this series, we will be exploring a ML project that analyzes diversity metrics in various tech companies using advanced exploratory data analysis (EDA) techniques. Our aim is to uncover valuable insights and patterns that can help shape better practices and policies for creating inclusive work environments.
By utilizing data visualizations, statistical analyses, and machine learning models, we hope to offer a more profound understanding of the current diversity landscape in the tech industry.
ml model
machine learning projects
projects machine learning
Whether you’re a data enthusiast, a tech professional, or simply passionate about diversity and inclusion, this blog provides unique perspectives and practical insights.
Machine Learning Project Source Code
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
projects on machine learning
machine learning project
project machine learning
machine learning certification
certification machine learning
Output: /kaggle/input/diversity-in-tech-companies/Diversity in tech companies.csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import keras
import sklearn
import os
Output:
2024-05-15 06:26:29.436227: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2024-05-15 06:26:29.436364: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2024-05-15 06:26:29.602469: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
df= pd.read_csv('/kaggle/input/diversity-in-tech-companies/Diversity in tech companies.csv')
df.head()
Output:
Year | Company | Female % | Male % | % White | % Asian | % Latino | % Black | % Multi | % Other | % Undeclared | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018 | Yahoo! | 37 | 63 | 45 | 44 | 4 | 2 | 2 | 3 | – |
1 | 2018 | 31 | 69 | 53 | 36 | 4 | 3 | 4 | 0 | – | |
2 | 2018 | Apple | 32 | 68 | 54 | 21 | 13 | 9 | 3 | 1 | 2 |
3 | 2018 | Cisco | 24 | 76 | 53 | 37 | 5 | 4 | 1 | <1 | – |
4 | 2018 | eBay | 40 | 60 | 50 | 39 | 6 | 3 | 1 | 1 | – |
df.columns
Output:
Index(['Year', 'Company', 'Female %', 'Male %', '% White', '% Asian', '% Latino', '% Black', '% Multi', '% Other', '% Undeclared'], dtype='object')
df.shape
(94, 11)
df.isnull().sum()
Output:
Year 0 Company 0 Female % 0 Male % 0 % White 0 % Asian 0 % Latino 0 % Black 0 % Multi 0 % Other 1 % Undeclared 0 dtype: int64
df.Company.unique()
array(['Yahoo!', 'Google', 'Apple', 'Cisco', 'eBay', 'HP', 'Indiegogo', 'Nvidia', 'Dell', 'Ingram Micro', 'Intel', 'Groupon', 'Amazon', 'Etsy ', 'Microsoft', 'Salesforce', 'Pandora', 'Uber', 'Slack', 'AirBnB ', 'Netflix', 'Yelp', 'Apple (excluding undeclared)'], dtype=object)
df.Year.unique()
Output: array([2018, 2017, 2016, 2015, 2014])
# Create a pivot table that counts the number of times each company appears in each year pivot_table = df.pivot_table(index='Company', columns='Year', aggfunc='size', fill_value=0) # Sort the years in descending order pivot_table = pivot_table.sort_index(axis=1, ascending=False) # Plot a stacked histogram pivot_table.plot(kind='bar', stacked=True, figsize=(14, 8)) plt.title('Distribution of Companies Across Different Years (Descending Order)') plt.xlabel('Company') plt.ylabel('Count') plt.legend(title='Year', bbox_to_anchor=(1.05, 1), loc='upper left') plt.tight_layout() plt.show()
machine learning projects
machine learning projects with source code
machine learning projects github
machine learning projects for final year
machine learning projects for students
This visual representation displays how data is distributed among different companies over the course of several years. The chart showcases data from 2014 to 2018, with the years listed in descending order.
Each bar represents a company, and the various colored sections within the bars indicate the amount of data for that particular company in a specific year.
Here are some key points to note:
- The stacked columns for each company consist of different colored sections representing data from 2014 (purple), 2015 (red), 2016 (green), 2017 (orange), and 2018 (blue).
- It is evident from the chart that not every company has data for all years. For instance, Netflix and Slack did not have data for certain years.
- Most companies have data available for 2018 (blue section), while the amount of data for 2014 (purple section) is relatively low.
- The height of each bar in the chart indicates the amount of data. This allows us to observe that some companies have data for multiple years, while others only have data for a select few.
In summary, this chart provides a clear overview of how data is distributed among different companies over the years. It helps us quickly understand the completeness and coverage of the data, as some companies have data for all years while others have data for only a portion of the time period.
df = df.replace('-', 0)
df = df.replace('<1', 0.5)
percentage_columns = ['% White', '% Asian', '% Latino', '% Black', '% Multi', '% Other', '% Undeclared']
for col in percentage_columns:
df[col] = pd.to_numeric(df[col])
df_long = pd.melt(df, id_vars=['Year', 'Company'], value_vars=percentage_columns,
var_name='Race', value_name='Percentage')
g = sns.catplot(x='Year', y='Percentage', hue='Race', col='Company', col_wrap=3,
data=df_long, kind='bar', height=4, aspect=1.5)
g.fig.subplots_adjust(top=0.92)
g.fig.suptitle('Racial Distribution of Companies Across Different Years', fontsize=20)
plt.show()
This graph provides a visual representation of the racial makeup of various companies over the years. Each individual chart represents a specific company, and the different colored bars indicate the percentage of each race within that company for a particular year.
Upon examining the data, it is evident that the majority of companies have a significant number of White employees, as indicated by the blue bars. The information presented spans from 2014 to 2018, allowing for a comprehensive view of the racial distribution over time.
Interestingly, one company stands out from the rest in terms of diversity. Nvidia exhibits a more balanced racial distribution among its employees, with a relatively equal proportion of Whites and Asians. This sets Nvidia apart from the other companies, where Whites dominate the workforce.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = df.replace('-', 0)
df = df.replace('<1', 0.5)
gender_columns = ['Female %', 'Male %']
for col in gender_columns:
df[col] = pd.to_numeric(df[col])
df_long_gender = pd.melt(df, id_vars=['Year', 'Company'], value_vars=gender_columns,
var_name='Gender', value_name='Percentage')
g = sns.catplot(x='Year', y='Percentage', hue='Gender', col='Company', col_wrap=3,
data=df_long_gender, kind='bar', height=4, aspect=1.5, margin_titles=True)
g.fig.subplots_adjust(top=0.92, hspace=0.4, wspace=0.3)
g.fig.suptitle('Gender Distribution of Companies Across Different Years', fontsize=16)
plt.show()
This chart shows the gender ratio of different companies in different years. Each sub-chart represents a company, and the different colored bars indicate the gender ratio of that company in a particular year. Below is a detailed description of this chart:
Overall Observation.
- Overall, most companies have a significantly higher percentage of male employees (Male %, orange bars) than female employees (Female %, blue bars).
- Data for each company is shown for different years (2017 and 2018).
Overall, most companies show a trend of male dominance in the gender ratio. Although the exact ratio varies from company to company, this trend is more consistent across all companies.
ml projects ideas
project manager artificial intelligence
best machine learning courses reddit
machine learning projects for resume
machine learning project for resume
best machine learning projects
cool machine learning projects
Data Cleaning & Preparation EDA
# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load import pandas as pd df= pd.read_csv('/kaggle/input/diversity-in-tech-companies/Diversity in tech companies.csv') df.head(5) # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
Year | Company | Female % | Male % | % White | % Asian | % Latino | % Black | % Multi | % Other | % Undeclared | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018 | Yahoo! | 37 | 63 | 45 | 44 | 4 | 2 | 2 | 3 | – |
1 | 2018 | 31 | 69 | 53 | 36 | 4 | 3 | 4 | 0 | – | |
2 | 2018 | Apple | 32 | 68 | 54 | 21 | 13 | 9 | 3 | 1 | 2 |
3 | 2018 | Cisco | 24 | 76 | 53 | 37 | 5 | 4 | 1 | <1 | – |
4 | 2018 | eBay | 40 | 60 | 50 | 39 | 6 | 3 | 1 | 1 | – |
#1. Check all columns df.columns
Index(['Year', 'Company', 'Female %', 'Male %', '% White', '% Asian', '% Latino', '% Black', '% Multi', '% Other', '% Undeclared'], dtype='object')
#2.Check for missing values print(df.isnull().sum())
Year 0 Company 0 Female % 0 Male % 0 % White 0 % Asian 0 % Latino 0 % Black 0 % Multi 0 % Other 1 % Undeclared 0 dtype: int64
# 3. Summary statistics for numerical columns print(df.describe())
Year Female % Male % % White count 94.000000 94.000000 94.000000 94.000000 mean 2016.106383 35.234043 64.744681 59.393617 std 1.432856 9.446426 9.464065 9.897559 min 2014.000000 16.000000 46.000000 37.000000 25% 2015.000000 29.000000 57.250000 53.000000 50% 2016.000000 33.000000 67.000000 60.000000 75% 2017.000000 42.750000 71.000000 66.500000 max 2018.000000 54.000000 84.000000 79.000000
# 4. Distribution of each numerical variable import seaborn as sns import matplotlib.pyplot as plt df.hist(figsize=(8, 8)) plt.tight_layout() plt.show()
# 5. Distribution of categorical variables sns.countplot(x='Company', data=df) plt.xticks(rotation=90) plt.show()
# 7. Visualize relationships sns.pairplot(df) plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True):
cv machine learning
machine learning cv
machine learning projects github
machine learning project github
machine learning ideas
ml project ideas
# 8. Checking data type df.dtypes
Year int64 Company object Female % int64 Male % int64 % White int64 % Asian object % Latino object % Black object % Multi object % Other object % Undeclared object dtype: object
percentage_columns = ['Female %', 'Male %', '% White', '% Asian', '% Latino', '% Black', '% Multi','% Other', '% Undeclared'] for col in percentage_columns: if col in ['Female %', 'Male %', '% White']: df[col] = pd.to_numeric(df[col].replace('<', '').replace('>', '').replace('-', '0')) else: df[col] = pd.to_numeric(df[col].replace('<', '').replace('>', '').replace('-', '0').replace('', '0'), errors='coerce').fillna(0).astype(int)
df.head(10)
Year | Company | Female % | Male % | % White | % Asian | % Latino | % Black | % Multi | % Other | % Undeclared | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018 | Yahoo! | 37 | 63 | 45 | 44 | 4 | 2 | 2 | 3 | 0 |
1 | 2018 | 31 | 69 | 53 | 36 | 4 | 3 | 4 | 0 | 0 | |
2 | 2018 | Apple | 32 | 68 | 54 | 21 | 13 | 9 | 3 | 1 | 2 |
3 | 2018 | Cisco | 24 | 76 | 53 | 37 | 5 | 4 | 1 | 0 | 0 |
4 | 2018 | eBay | 40 | 60 | 50 | 39 | 6 | 3 | 1 | 1 | 0 |
5 | 2018 | HP | 37 | 63 | 73 | 12 | 8 | 4 | 2 | 0 | 0 |
6 | 2018 | Indiegogo | 50 | 50 | 58 | 28 | 7 | 4 | 0 | 3 | 0 |
7 | 2018 | Nvidia | 17 | 83 | 37 | 45 | 3 | 1 | 14 | 0 | 0 |
8 | 2018 | Dell | 28 | 72 | 69 | 9 | 11 | 10 | 0 | 1 | 0 |
9 | 2018 | Ingram Micro | 31 | 69 | 52 | 14 | 19 | 14 | 1 | 0 | 0 |
# Group by company and year grouped = df.groupby(['Company', 'Year']).mean().reset_index() # Set a color palette palette = sns.color_palette("husl", len(grouped['Company'].unique())) # Plotting trends for each diversity metric for col in percentage_columns: plt.figure(figsize=(12, 6)) for i, company in enumerate(grouped['Company'].unique()): company_data = grouped[grouped['Company'] == company] plt.plot(company_data['Year'], company_data[col], marker='o', label=company, color=palette[i]) plt.title(f'Trend in {col} Across Companies Over Years') plt.xlabel('Year') plt.ylabel(f'{col}') plt.xticks(grouped['Year'].unique()) plt.legend(bbox_to_anchor=(1, 1), loc='upper left') plt.grid(True) plt.show()
# Company Comparison for female population plt.figure(figsize=(12, 6)) sns.barplot(x='Company', y='Female %', data=df, hue='Year') plt.title('Female % Across Companies') plt.xlabel('Company') plt.ylabel('Female %') plt.xticks(rotation=45) plt.legend(title='Year') plt.show()
ai for project managers
artificial intelligence for project management
artificial intelligence for project managers
ai project manager
# Company Comparison for black population plt.figure(figsize=(12, 6)) sns.barplot(x='Company', y='% Black', data=df, hue='Year') plt.title('Black % Across Companies') plt.xlabel('Company') plt.ylabel('Female %') plt.xticks(rotation=45) plt.legend(title='Year') plt.show()
# Overall Diversity Distribution plt.figure(figsize=(12, 6)) sns.boxplot(data=df[percentage_columns]) plt.title('Overall Diversity Distribution') plt.ylabel('Percentage') plt.xticks(rotation=45) plt.show()
# Correlation Analysis correlation_matrix = df[percentage_columns].corr() plt.figure(figsize=(10, 8)) sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', square=True) plt.title('Correlation Matrix of Diversity Metrics') plt.show()
step machine learning
step of machine learning
ml projects
ml project
machine learning python projects
machine learning projects in python
# Ranking Companies df['Total Diversity'] = df[['Female %', 'Male %', '% White', '% Asian', '% Latino', '% Black', '% Multi', '% Other', '% Undeclared']].sum(axis=1) ranked_df = df.groupby('Company')['Total Diversity'].mean().sort_values(ascending=False).reset_index() plt.figure(figsize=(12, 6)) sns.barplot(x='Company', y='Total Diversity', data=ranked_df) plt.title('Ranking of Companies Based on Total Diversity') plt.xlabel('Company') plt.ylabel('Total Diversity') plt.xticks(rotation=45) plt.show()
# Company Comparison for FAANG comapnies(Facebook, Amazon, Apple, Netflix, and Google) selected_companies = ['Amazon','Apple','Netflix','Google'] selected_df = df[df['Company'].isin(selected_companies)] plt.figure(figsize=(12, 6)) sns.lineplot(x='Year', y='Female %', hue='Company', data=selected_df) plt.title('Trend in Female % Among FAANG Companies (2014-2018)') plt.xlabel('Year') plt.xticks(grouped['Year'].unique()) # Set x-axis ticks to be the unique years plt.ylabel('Female %') plt.legend(title='Company') plt.grid(True) plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key)
# Melt the DataFrame to have a single 'Ethnicity' column melted_df = df.melt(id_vars=['Year', 'Company'], var_name='Ethnicity', value_name='Percentage') # Plotting plt.figure(figsize=(12, 6)) sns.lineplot(x='Year', y='Percentage', hue='Ethnicity', data=melted_df, marker='o') plt.title('Rise of Multiple Ethnicities in FAANG Companies (2014-2018)') plt.xlabel('Year') plt.xticks(grouped['Year'].unique()) # Set x-axis ticks to be the unique years plt.ylabel('Percentage') plt.grid(True) plt.show()
/opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead. with pd.option_context('mode.use_inf_as_na', True): /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key) /opt/conda/lib/python3.10/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning. data_subset = grouped_data.get_group(pd_key)
Shifting Paradigms: Diversity Trends in Tech
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore')
data = pd.read_csv("/kaggle/input/diversity-in-tech-companies/Diversity in tech companies.csv")
data.head()
Year | Company | Female % | Male % | % White | % Asian | % Latino | % Black | % Multi | % Other | % Undeclared | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2018 | Yahoo! | 37 | 63 | 45 | 44 | 4 | 2 | 2 | 3 | – |
1 | 2018 | 31 | 69 | 53 | 36 | 4 | 3 | 4 | 0 | – | |
2 | 2018 | Apple | 32 | 68 | 54 | 21 | 13 | 9 | 3 | 1 | 2 |
3 | 2018 | Cisco | 24 | 76 | 53 | 37 | 5 | 4 | 1 | <1 | – |
4 | 2018 | eBay | 40 | 60 | 50 | 39 | 6 | 3 | 1 | 1 | – |
plt.figure(figsize=(10, 6)) sns.lineplot(data=data, x='Year', y='Female %', hue='Company', marker='o') plt.title('Gender Distribution Over the Years') plt.xlabel('Year') plt.ylabel('Female %') plt.legend(title='Company', bbox_to_anchor=(1, 1)) plt.show()
plt.figure(figsize=(10, 6)) sns.boxplot(data=data, x='Company', y='% White') plt.title('Ethnic Diversity Comparison Between Companies') plt.xlabel('Company') plt.ylabel('% White') plt.xticks(rotation=90) plt.show()
ethnic_groups = ['% Asian', '% Latino', '% Black', '% Multi', '% Other', '% Undeclared'] plt.figure(figsize=(12, 8)) for group in ethnic_groups: sns.lineplot(data=data, x='Year', y=group, label=group) plt.title('Ethnic Diversity Distribution Over the Years') plt.xlabel('Year') plt.ylabel('Percentage') plt.legend(title='Ethnicity') plt.show()
plt.figure(figsize=(10, 6)) sns.barplot(data=data, x='Company', y='Female %', hue='Year') plt.title('Gender Distribution by Company') plt.xlabel('Company') plt.ylabel('Female %') plt.legend(title='Year', bbox_to_anchor=(1, 1)) plt.xticks(rotation=90) plt.show()
What is the best automated EDA?
Automated EDA tools are your data’s best pals, making sense of numbers effortlessly. Picture a genius friend who uncovers patterns, spots outliers, and transforms your data into a compelling story.
Here’s the scoop:
- Pandas Profiling: A timeless tool that gives your data a thorough check-up. It delivers a detailed report with summaries and visualizations, perfect for a quick data overview.
- Sweetviz: This tool is as sweet as its name suggests. It creates stunning visualizations with ease, like a data Instagram feed highlighting key insights in vibrant colors.
- D-Tale: Imagine your data coming alive with this interactive tool. It’s like having a data whisperer that lets you explore, filter, and plot data without coding. Ideal for non-coders or those seeking a coding break.
- AutoViz: Need speed? AutoViz is your solution. It instantly visualizes datasets with a single line of code, no need for endless adjustments. Perfect for quick insights and overviews.
- Exploratory: The Swiss Army knife of EDA tools, offering simple summaries to advanced ML models in a user-friendly interface. Collaborative features allow you to share findings effortlessly.
These tools each bring something unique to the table, like different ice cream flavors. Give them a try and discover which one makes you say, “Wow, this is amazing!”
Hope that helps! If you need more deets, just holler! 🚀
step machine learning
step of machine learning
ml projects
ml project
machine learning python projects
machine learning projects in python
What is EDA analysis in machine learning?
Exploratory Data Analysis (EDA) is similar to being a detective for data. Instead of jumping straight into complex machine learning tasks, it’s important to familiarize yourself with the data first.
It involves exploring and comprehending your dataset, just like getting to know someone new. Here’s how it unfolds:
EDA Step | What it’s Like |
---|---|
Peek at Your Data | Opening a mystery box to see what’s inside. |
Summarize It | Reading the blurb on the back of a book to get the gist of the story. |
Draw Some Pictures | Doodling a map to see the lay of the land. |
Spot Connections | Discovering hidden friendships in a TV show. |
Clean It Up | Tidying up your room before a big game night with friends. |
Test Ideas | Making a guess about a movie plot and watching to see if you’re right. |
See? Easy peasy, just like planning a fun get-together. Each step gets you closer to understanding your data and making it work for you. 🎉📊🔍
Do a project on machine learning?
Welcome to the Movie Magic Predictor project! 🍿🎬 Let’s explore the fascinating world of machine learning to forecast the success of upcoming movies. Picture having a magical crystal ball that reveals which movies will be hits and which ones might not do so well!
Objective
Our aim is straightforward: to develop a machine learning model that can anticipate the box office performance of movies by considering various factors like genre, cast, budget, and more.
Through analyzing past movie data and deriving valuable insights, we strive to create a predictive model that can assist filmmakers, studios, and investors in making well-informed choices.
Dataset
We will be delving into a comprehensive dataset that contains details about thousands of movies released in recent decades.
This dataset comprises information such as movie title, release date, genre, cast and crew, budget, box office earnings, and critical reception.
Code
# Importing necessary libraries
import pandas as pd
# Step 1: Data Collection
# Load the movie dataset
movie_data = pd.read_csv('movie_dataset.csv')
# Display the first few rows of the dataset to understand its structure
print("First few rows of the dataset:")
print(movie_data.head())
# Step 2: Exploratory Data Analysis (EDA)
# Get basic information about the dataset
print("\nDataset information:")
print(movie_data.info())
# Summary statistics of numerical features
print("\nSummary statistics of numerical features:")
print(movie_data.describe())
# Summary of categorical features
print("\nSummary of categorical features:")
print(movie_data.describe(include=['object']))
# Check for missing values
print("\nMissing values in the dataset:")
print(movie_data.isnull().sum())
# Visualize the distribution of target variable (box office earnings)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.hist(movie_data['BoxOffice'], bins=20, color='skyblue', edgecolor='black')
plt.title('Distribution of Box Office Earnings')
plt.xlabel('Box Office Earnings')
plt.ylabel('Frequency')
plt.show()
github artificial intelligence-projects
machine learning project life cycle
machine learning project python
machine learning projects python
deep learning projects for masters students
Which ML project is best?
HealthSage:
Objective: Deciphering clues from medical data to predict whether someone’s health will improve or decline.
Impact: Revolutionizing the way doctors approach patient care, providing insights to diagnose and treat conditions more effectively.
FinWise:
Objective: Uncovering financial mysteries to anticipate market trends and identify fraudulent activities.
Impact: Serving as a financial guardian, safeguarding investors and promoting fairness in the financial world.
EcoVision:
Objective: Monitoring the planet through advanced technology to track environmental changes and prevent degradation.
Impact: Acting as a protector of Earth, aiding in the preservation of natural resources and combating pollution.
EduSmart:
Objective: Tailoring education to individual learning styles, offering personalized lessons for academic success.
Impact: Acting as a personal mentor, guiding students to excel in their studies and fostering a love for learning.
JustiScan:
Objective: Investigating social media and news for injustices, highlighting disparities and advocating for fairness.
Impact: Serving as a champion of equality, bringing attention to issues that require attention for the betterment of society.
Each of these initiatives plays a crucial role in addressing significant challenges and making a positive impact on the world. Which one resonates with you the most?
Conclusion
In conclusion, our exploration of diversity within tech companies has been eye-opening and empowering. Through the lens of exploratory data analysis (EDA), we have delved deep into the numbers, uncovering valuable insights that shed light on the landscape of diversity in the tech industry.
ml process
kaggle machine learning projects
machine learning project manager
machine learning project management
machine learning projects for masters students
From analyzing demographic trends to scrutinizing hiring practices, our EDA journey has revealed both successes and challenges. We have celebrated companies that are making progress in fostering inclusive cultures, while also recognizing areas where improvement is necessary.
However, our work does not end here. With the power of data in our hands, we have the opportunity to drive meaningful change. By continuing to advocate for diversity, equity, and inclusion in the tech sector, we can shape a future where everyone is included.
As we conclude this chapter of our machine learning project, let’s carry forward the lessons we have learned and the insights we have gained. Let’s strive to build a tech industry that truly represents the diverse tapestry of humanity, where diversity is not just a buzzword but a fundamental pillar of success.
machine learning projects reddit
reddit ai subreddit
machine learning interesting projects
good machine learning projects
deep learning projects github
deep learning project github
github artificial intelligence projects
Together, we can create a future where diversity is not an afterthought, but the driving force behind innovation, creativity, and progress. Thank you for joining us on this journey, and here’s to a brighter, more inclusive future for all in the tech industry.
Keep coding, keep advocating, and keep pushing for change. The future is ours to shape.
8 Comments
Machine Learning Project 1: Honda Motor Stocks Best Prices · May 24, 2024 at 2:52 pm
[…] Machine Learning Project 2: Diversity in Tech Companies Best EDA Analysis […]
Machine Learning Project 3: Best Explore Indian Cuisine · May 24, 2024 at 9:29 pm
[…] Machine Learning Project 2: Diversity in Tech Companies Best EDA Analysis […]
Machine Learning Project 4: Best Explore Video Game Data · May 27, 2024 at 12:13 pm
[…] Machine Learning Project 2: Diversity in Tech Companies Best EDA Analysis […]
Machine Learning Project 5: Best Students Performance EDA · May 27, 2024 at 1:11 pm
[…] Machine Learning Project 2: Diversity in Tech Companies Best EDA Analysis […]
ML Project 6: Obesity Type Best EDA And Classification · May 27, 2024 at 1:29 pm
[…] Machine Learning Project 2: Diversity in Tech Companies Best EDA Analysis […]
Machine Learning Project 7: Best ChatGPT Reviews Analysis · May 27, 2024 at 6:38 pm
[…] Machine Learning Project 2: Diversity in Tech Companies Best EDA Analysis […]
Best ML Project: Machine Learning Engineer Salary In 2024 · May 27, 2024 at 6:45 pm
[…] Machine Learning Project 2: Diversity in Tech Companies Best EDA Analysis […]
Machine Learning Project 9: Best Anemia Types Classification · May 28, 2024 at 6:25 pm
[…] Machine Learning Project 2: Diversity in Tech Companies Best EDA Analysis […]