Sharing is caring!

Data analysis project

Welcome to SalesSight: Unveiling E-commerce Trends! ๐Ÿ›๏ธ๐Ÿ“ˆ Our innovative Data Analysis project is at the forefront of providing a comprehensive and insightful analysis of the ever-evolving landscape of online retail. With a keen focus on data-driven decision-making, SalesSight empowers businesses with the knowledge to navigate the dynamic e-commerce world effectively. Our expert team meticulously examines sales patterns, consumer behavior, and market trends using advanced data analysis techniques to unearth valuable insights that drive growth and profitability.

1. Data Exploration and Understanding

By harnessing the power of cutting-edge analytics, SalesSight’s Data Analysis project is your ultimate companion in staying ahead of the curve, optimizing strategies, and making informed choices based on solid data-driven evidence. Join us in revolutionizing the way you perceive and capitalize on e-commerce opportunities. Elevate your business with SalesSight – where foresight meets success through the power of data! ๐Ÿš€๐Ÿ“Š๐Ÿ›’

1.1 Load the sales_data.csv dataset

This section is dedicated to importing necessary libraries and loading the dataset. It’s the first step in any data analysis project.

In [1]:

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('/kaggle/input/sales-orders/sales_data.csv')

# Display the first few rows of the dataset
df.head()

Out[1]:

Order DateOrder IDProductProduct_eancatรฉgoriePurchase AddressQuantity OrderedPrice EachCost priceturnovermargin
02019-01-22 21:25:00141234iPhone5.638009e+12Vรชtements944 Walnut St, Boston, MA 022151700.00231.0000700.00469.0000
12019-01-28 14:15:00141235Lightning Charging Cable5.563320e+12Alimentation185 Maple St, Portland, OR 97035114.957.475014.957.4750
22019-01-17 13:33:00141236Wired Headphones2.113973e+12Vรชtements538 Adams St, San Francisco, CA 94016211.995.995023.9811.9900
32019-01-05 20:33:0014123727in FHD Monitor3.069157e+12Sports738 10th St, Los Angeles, CA 900011149.9997.4935149.9952.4965
42019-01-25 11:59:00141238Wired Headphones9.692681e+12ร‰lectronique387 10th St, Austin, TX 73301111.995.995011.995.9950

1.2 Performing preliminary analysis

Here, we’re getting a quick overview of the dataset. We’ll check its structure, get some basic statistics, and identify any missing values.

In [2]:

# Get a summary of the dataset
df.info()

# Descriptive statistics for numerical columns
df.describe()

# Check for missing values
df.isnull().sum()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185950 entries, 0 to 185949
Data columns (total 11 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Order Date        185950 non-null  object 
 1   Order ID          185950 non-null  int64  
 2   Product           185950 non-null  object 
 3   Product_ean       185950 non-null  float64
 4   catรฉgorie         185950 non-null  object 
 5   Purchase Address  185950 non-null  object 
 6   Quantity Ordered  185950 non-null  int64  
 7   Price Each        185950 non-null  float64
 8   Cost price        185950 non-null  float64
 9   turnover          185950 non-null  float64
 10  margin            185950 non-null  float64
dtypes: float64(5), int64(2), object(4)
memory usage: 15.6+ MB

Out[2]:

Order Date          0
Order ID            0
Product             0
Product_ean         0
catรฉgorie           0
Purchase Address    0
Quantity Ordered    0
Price Each          0
Cost price          0
turnover            0
margin              0
dtype: int64

1.3 Understanding the distribution of key columns

Visualizing the distribution of key columns helps in understanding the spread and central tendencies of the data.

In [3]:

import plotly.express as px
import plotly.graph_objects as go

In [4]:

fig1 = px.histogram(df, x='Quantity Ordered', marginal='box', nbins=50,
                    title='Distribution of Quantity Ordered')
fig1.show()

2468050k100k150kDistribution of Quantity OrderedQuantity Orderedcount

In [5]:

fig2 = px.histogram(df, x='Price Each', marginal='box', nbins=50,
                    title='Distribution of Price Each')
fig2.show()

050010001500020k40k60k80k100kDistribution of Price EachPrice Eachcount

1.4 Identify patterns, correlations, or anomalies

Visualizing the distribution of key columns helps in understanding the spread and central tendencies of the data.

In [6]:

# Correlation matrix
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
/tmp/ipykernel_20/761733960.py:2: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

2. Data Preprocessing

2.1 Convert the ‘Order Date’ column to a date-time format

Converting ‘Order Date’ to a datetime format is essential for time series analysis. It allows for easier data manipulation and aggregation based on time periods.

In [7]:

# (Note: This step was already done in the previous phase, but for the sake of continuity, I'll include it again.)

# Convert 'Order Date' to datetime format
df['Order Date'] = pd.to_datetime(df['Order Date'])

2.2 Aggregate sales data on a monthly basis

Aggregating data on a monthly basis helps in understanding monthly sales trends and patterns.

In [8]:

# (Note: This step was also covered in the previous phase. However, I'll provide an alternative approach to aggregate the data.)

# Extract month and year from 'Order Date'
df['Month'] = df['Order Date'].dt.month
df['Year'] = df['Order Date'].dt.year

# Aggregate sales data by month
monthly_sales_data = df.groupby(['Year', 'Month']).agg({'Price Each': 'sum', 'Quantity Ordered': 'sum'}).reset_index()

2.3 Create new features

Creating new features can provide additional insights. Here, we’re calculating the total sales for each order.

In [9]:

# Create 'Total Sales' feature
df['Total Sales'] = df['Quantity Ordered'] * df['Price Each']

3. Data Analysis

3.1 Plot the aggregated sales data over time to visualize the trend

Visualizing aggregated sales over time helps in identifying trends, seasonality, and any anomalies in the data.

In [10]:

# Using Plotly for interactive visualization
fig4 = go.Figure(data=go.Scatter(x=pd.to_datetime(monthly_sales_data['Year'].astype(str) + '-' + monthly_sales_data['Month'].astype(str) + '-01'),
                                 y=monthly_sales_data['Price Each'],
                                 mode='lines+markers'))

fig4.update_layout(title='Aggregated Monthly Sales Over Time',
                   xaxis_title='Time',
                   yaxis_title='Total Sales',
                   xaxis=dict(showgrid=True, showline=True, showticklabels=True),
                   yaxis=dict(showgrid=True, showline=True, showticklabels=True))

fig4.show()

Jan 2019Mar 2019May 2019Jul 2019Sep 2019Nov 2019Jan 202001M2M3M4MAggregated Monthly Sales Over TimeTimeTotal Sales

3.2 Identify significant seasonality or cyclical patterns in sales data

Decomposing the time series data helps in understanding its underlying patterns like trend, seasonality, and residuals.

In [11]:

# Using seasonal decomposition to identify patterns
from statsmodels.tsa.seasonal import seasonal_decompose

# Decompose the time series data with a bi-monthly period
result = seasonal_decompose(monthly_sales_data['Price Each'], model='multiplicative', period=6)

# Plot the decomposed components
result.plot()
plt.show()

3.3 Use statistical methods to identify peak and low sales periods

Identifying peak and low sales periods can guide marketing and promotional strategies.

In [12]:

# Identify months with peak sales
peak_sales = monthly_sales_data[monthly_sales_data['Price Each'] == monthly_sales_data['Price Each'].max()]

# Identify months with lowest sales
low_sales = monthly_sales_data[monthly_sales_data['Price Each'] == monthly_sales_data['Price Each'].min()]

print(f"Peak Sales were in: {peak_sales['Month'].values[0]}, {peak_sales['Year'].values[0]}")
print(f"Lowest Sales were in: {low_sales['Month'].values[0]}, {low_sales['Year'].values[0]}")
Peak Sales were in: 12, 2019
Lowest Sales were in: 1, 2020

3.4 Monthly Sales Growth Rate

This visualization will help in understanding the dynamics of sales growth and identifying periods of rapid growth or decline.

In [13]:

# Calculate the month-over-month growth rate
monthly_sales_data['Sales Growth'] = monthly_sales_data['Price Each'].pct_change() * 100

# Plotting the growth rate
plt.figure(figsize=(12, 6))
sns.lineplot(x=pd.to_datetime(monthly_sales_data['Year'].astype(str) + '-' + monthly_sales_data['Month'].astype(str) + '-01'),
             y='Sales Growth', data=monthly_sales_data)
plt.title('Monthly Sales Growth Rate')
plt.ylabel('Growth Rate (%)')
plt.xlabel('Date')
plt.grid(True)
plt.show()

3.5 Product-wise Sales

Understanding which products are the best-sellers can guide inventory decisions and marketing strategies.

In [14]:

# Calculate total sales for each product
product_sales = df.groupby('Product')['Total Sales'].sum().sort_values(ascending=False)

# Plotting product-wise sales
plt.figure(figsize=(12, 8))
product_sales.plot(kind='bar', color='skyblue')
plt.title('Product-wise Sales')
plt.ylabel('Total Sales ($)')
plt.xlabel('Product')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

3.6 Sales by Category

Sales distribution across different categories can provide insights into which categories are most popular and deserve more attention.

In [15]:

# Calculate total sales for each category
category_sales = df.groupby('catรฉgorie')['Total Sales'].sum()

# Plotting sales by category
plt.figure(figsize=(10, 7))
category_sales.plot(kind='pie', autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel'))
plt.title('Sales by Category')
plt.ylabel('')
plt.show()

3.7 Geographical Sales Distribution

Knowing which regions or cities generate the most sales can guide regional marketing efforts and distribution strategies.

In [16]:

# Extract city from 'Purchase Address'
df['City'] = df['Purchase Address'].apply(lambda x: x.split(',')[1].strip())

# Calculate total sales for each city
city_sales = df.groupby('City')['Total Sales'].sum().sort_values(ascending=False)

# Plotting sales by city
plt.figure(figsize=(12, 7))
city_sales.plot(kind='bar', color='lightgreen')
plt.title('Geographical Sales Distribution')
plt.ylabel('Total Sales ($)')
plt.xlabel('City')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

3.8 Day-wise Sales Distribution

Identifying peak sales days can help in optimizing promotions and marketing efforts for specific days of the week.

In [17]:

# Extract day of the week from 'Order Date'
df['Day of Week'] = df['Order Date'].dt.day_name()

# Calculate total sales for each day of the week
day_sales = df.groupby('Day of Week')['Total Sales'].sum().reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])

# Plotting day-wise sales distribution
plt.figure(figsize=(12, 6))
day_sales.plot(kind='bar', color='coral')
plt.title('Day-wise Sales Distribution')
plt.ylabel('Total Sales ($)')
plt.xlabel('Day of the Week')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

4. Interpretation and Communication

4.1 Summarizing the main findings from the analysis

From our analysis, we can derive the following key insights:

  • The dataset is comprehensive, covering a wide range of products, with no missing values ensuring a reliable analysis.
  • There’s a noticeable seasonality in the sales data, suggesting potential holiday or promotional sales.
  • The correlation matrix indicates relationships between numerical variables, but further analysis might be needed to derive actionable insights.

A heatmap provides a visual representation of monthly sales over the years, helping in identifying patterns and trends.

In [18]:

import seaborn as sns

# Pivot the data to get months on one axis and years on another axis
heatmap_data = monthly_sales_data.pivot("Month", "Year", "Price Each")

plt.figure(figsize=(10, 8))
sns.heatmap(heatmap_data, cmap="YlGnBu", annot=True, fmt=".0f")
plt.title('Monthly Sales Heatmap')
plt.show()
/tmp/ipykernel_20/3132487066.py:4: FutureWarning:

In a future version of pandas all arguments of DataFrame.pivot will be keyword-only.

4.3 Insights

linkcode

Based on our analysis, here are some insights and recommendations:

  • Seasonal Trends: Sales tend to peak during specific months. Aligning marketing strategies with these months could be beneficial.
  • Product Analysis: Understanding the distribution of ‘Price Each’ can help in optimizing the product mix, focusing on products within the most common price range.
  • Sales Peaks: Identifying months with peak and low sales can guide marketing and promotional efforts. For instance, introducing promotions during low sales months can boost revenue.


1 Comment

Machine Learning Project 1: Honda Motor Stocks Best Prices · May 24, 2024 at 2:52 pm

[…] SalesSight […]

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *