Welcome to SalesSight: Unveiling E-commerce Trends! ๐๏ธ๐ Our innovative Data Analysis project is at the forefront of providing a comprehensive and insightful analysis of the ever-evolving landscape of online retail. With a keen focus on data-driven decision-making, SalesSight empowers businesses with the knowledge to navigate the dynamic e-commerce world effectively. Our expert team meticulously examines sales patterns, consumer behavior, and market trends using advanced data analysis techniques to unearth valuable insights that drive growth and profitability.
1. Data Exploration and Understanding
By harnessing the power of cutting-edge analytics, SalesSight’s Data Analysis project is your ultimate companion in staying ahead of the curve, optimizing strategies, and making informed choices based on solid data-driven evidence. Join us in revolutionizing the way you perceive and capitalize on e-commerce opportunities. Elevate your business with SalesSight – where foresight meets success through the power of data! ๐๐๐
1.1 Load the sales_data.csv dataset
This section is dedicated to importing necessary libraries and loading the dataset. It’s the first step in any data analysis project.
In [1]:
# Import necessary libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load the dataset df = pd.read_csv('/kaggle/input/sales-orders/sales_data.csv') # Display the first few rows of the dataset df.head()
Out[1]:
Order Date | Order ID | Product | Product_ean | catรฉgorie | Purchase Address | Quantity Ordered | Price Each | Cost price | turnover | margin | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-01-22 21:25:00 | 141234 | iPhone | 5.638009e+12 | Vรชtements | 944 Walnut St, Boston, MA 02215 | 1 | 700.00 | 231.0000 | 700.00 | 469.0000 |
1 | 2019-01-28 14:15:00 | 141235 | Lightning Charging Cable | 5.563320e+12 | Alimentation | 185 Maple St, Portland, OR 97035 | 1 | 14.95 | 7.4750 | 14.95 | 7.4750 |
2 | 2019-01-17 13:33:00 | 141236 | Wired Headphones | 2.113973e+12 | Vรชtements | 538 Adams St, San Francisco, CA 94016 | 2 | 11.99 | 5.9950 | 23.98 | 11.9900 |
3 | 2019-01-05 20:33:00 | 141237 | 27in FHD Monitor | 3.069157e+12 | Sports | 738 10th St, Los Angeles, CA 90001 | 1 | 149.99 | 97.4935 | 149.99 | 52.4965 |
4 | 2019-01-25 11:59:00 | 141238 | Wired Headphones | 9.692681e+12 | รlectronique | 387 10th St, Austin, TX 73301 | 1 | 11.99 | 5.9950 | 11.99 | 5.9950 |
1.2 Performing preliminary analysis
Here, we’re getting a quick overview of the dataset. We’ll check its structure, get some basic statistics, and identify any missing values.
In [2]:
# Get a summary of the dataset df.info() # Descriptive statistics for numerical columns df.describe() # Check for missing values df.isnull().sum()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 185950 entries, 0 to 185949 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Order Date 185950 non-null object 1 Order ID 185950 non-null int64 2 Product 185950 non-null object 3 Product_ean 185950 non-null float64 4 catรฉgorie 185950 non-null object 5 Purchase Address 185950 non-null object 6 Quantity Ordered 185950 non-null int64 7 Price Each 185950 non-null float64 8 Cost price 185950 non-null float64 9 turnover 185950 non-null float64 10 margin 185950 non-null float64 dtypes: float64(5), int64(2), object(4) memory usage: 15.6+ MB
Out[2]:
Order Date 0 Order ID 0 Product 0 Product_ean 0 catรฉgorie 0 Purchase Address 0 Quantity Ordered 0 Price Each 0 Cost price 0 turnover 0 margin 0 dtype: int64
1.3 Understanding the distribution of key columns
Visualizing the distribution of key columns helps in understanding the spread and central tendencies of the data.
In [3]:
import plotly.express as px import plotly.graph_objects as goIn [4]:
fig1 = px.histogram(df, x='Quantity Ordered', marginal='box', nbins=50, title='Distribution of Quantity Ordered') fig1.show()
2468050k100k150kDistribution of Quantity OrderedQuantity Orderedcount
In [5]:
fig2 = px.histogram(df, x='Price Each', marginal='box', nbins=50, title='Distribution of Price Each') fig2.show()
050010001500020k40k60k80k100kDistribution of Price EachPrice Eachcount
1.4 Identify patterns, correlations, or anomalies
Visualizing the distribution of key columns helps in understanding the spread and central tendencies of the data.
In [6]:
# Correlation matrix correlation_matrix = df.corr() sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm') plt.title('Correlation Matrix') plt.show()
/tmp/ipykernel_20/761733960.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
2. Data Preprocessing
2.1 Convert the ‘Order Date’ column to a date-time format
Converting ‘Order Date’ to a datetime format is essential for time series analysis. It allows for easier data manipulation and aggregation based on time periods.
In [7]:
# (Note: This step was already done in the previous phase, but for the sake of continuity, I'll include it again.) # Convert 'Order Date' to datetime format df['Order Date'] = pd.to_datetime(df['Order Date'])
2.2 Aggregate sales data on a monthly basis
Aggregating data on a monthly basis helps in understanding monthly sales trends and patterns.
In [8]:
# (Note: This step was also covered in the previous phase. However, I'll provide an alternative approach to aggregate the data.) # Extract month and year from 'Order Date' df['Month'] = df['Order Date'].dt.month df['Year'] = df['Order Date'].dt.year # Aggregate sales data by month monthly_sales_data = df.groupby(['Year', 'Month']).agg({'Price Each': 'sum', 'Quantity Ordered': 'sum'}).reset_index()
2.3 Create new features
Creating new features can provide additional insights. Here, we’re calculating the total sales for each order.
In [9]:
# Create 'Total Sales' feature df['Total Sales'] = df['Quantity Ordered'] * df['Price Each']
3. Data Analysis
3.1 Plot the aggregated sales data over time to visualize the trend
Visualizing aggregated sales over time helps in identifying trends, seasonality, and any anomalies in the data.
In [10]:
# Using Plotly for interactive visualization fig4 = go.Figure(data=go.Scatter(x=pd.to_datetime(monthly_sales_data['Year'].astype(str) + '-' + monthly_sales_data['Month'].astype(str) + '-01'), y=monthly_sales_data['Price Each'], mode='lines+markers')) fig4.update_layout(title='Aggregated Monthly Sales Over Time', xaxis_title='Time', yaxis_title='Total Sales', xaxis=dict(showgrid=True, showline=True, showticklabels=True), yaxis=dict(showgrid=True, showline=True, showticklabels=True)) fig4.show()
Jan 2019Mar 2019May 2019Jul 2019Sep 2019Nov 2019Jan 202001M2M3M4MAggregated Monthly Sales Over TimeTimeTotal Sales
3.2 Identify significant seasonality or cyclical patterns in sales data
Decomposing the time series data helps in understanding its underlying patterns like trend, seasonality, and residuals.
In [11]:
# Using seasonal decomposition to identify patterns from statsmodels.tsa.seasonal import seasonal_decompose # Decompose the time series data with a bi-monthly period result = seasonal_decompose(monthly_sales_data['Price Each'], model='multiplicative', period=6) # Plot the decomposed components result.plot() plt.show()
3.3 Use statistical methods to identify peak and low sales periods
Identifying peak and low sales periods can guide marketing and promotional strategies.
In [12]:
# Identify months with peak sales peak_sales = monthly_sales_data[monthly_sales_data['Price Each'] == monthly_sales_data['Price Each'].max()] # Identify months with lowest sales low_sales = monthly_sales_data[monthly_sales_data['Price Each'] == monthly_sales_data['Price Each'].min()] print(f"Peak Sales were in: {peak_sales['Month'].values[0]}, {peak_sales['Year'].values[0]}") print(f"Lowest Sales were in: {low_sales['Month'].values[0]}, {low_sales['Year'].values[0]}")
Peak Sales were in: 12, 2019 Lowest Sales were in: 1, 2020
3.4 Monthly Sales Growth Rate
This visualization will help in understanding the dynamics of sales growth and identifying periods of rapid growth or decline.
In [13]:
# Calculate the month-over-month growth rate monthly_sales_data['Sales Growth'] = monthly_sales_data['Price Each'].pct_change() * 100 # Plotting the growth rate plt.figure(figsize=(12, 6)) sns.lineplot(x=pd.to_datetime(monthly_sales_data['Year'].astype(str) + '-' + monthly_sales_data['Month'].astype(str) + '-01'), y='Sales Growth', data=monthly_sales_data) plt.title('Monthly Sales Growth Rate') plt.ylabel('Growth Rate (%)') plt.xlabel('Date') plt.grid(True) plt.show()
3.5 Product-wise Sales
Understanding which products are the best-sellers can guide inventory decisions and marketing strategies.
In [14]:
# Calculate total sales for each product product_sales = df.groupby('Product')['Total Sales'].sum().sort_values(ascending=False) # Plotting product-wise sales plt.figure(figsize=(12, 8)) product_sales.plot(kind='bar', color='skyblue') plt.title('Product-wise Sales') plt.ylabel('Total Sales ($)') plt.xlabel('Product') plt.xticks(rotation=45) plt.grid(axis='y') plt.show()
3.6 Sales by Category
Sales distribution across different categories can provide insights into which categories are most popular and deserve more attention.
In [15]:
# Calculate total sales for each category category_sales = df.groupby('catรฉgorie')['Total Sales'].sum() # Plotting sales by category plt.figure(figsize=(10, 7)) category_sales.plot(kind='pie', autopct='%1.1f%%', startangle=140, colors=sns.color_palette('pastel')) plt.title('Sales by Category') plt.ylabel('') plt.show()
3.7 Geographical Sales Distribution
Knowing which regions or cities generate the most sales can guide regional marketing efforts and distribution strategies.
In [16]:
# Extract city from 'Purchase Address' df['City'] = df['Purchase Address'].apply(lambda x: x.split(',')[1].strip()) # Calculate total sales for each city city_sales = df.groupby('City')['Total Sales'].sum().sort_values(ascending=False) # Plotting sales by city plt.figure(figsize=(12, 7)) city_sales.plot(kind='bar', color='lightgreen') plt.title('Geographical Sales Distribution') plt.ylabel('Total Sales ($)') plt.xlabel('City') plt.xticks(rotation=45) plt.grid(axis='y') plt.show()
3.8 Day-wise Sales Distribution
Identifying peak sales days can help in optimizing promotions and marketing efforts for specific days of the week.
In [17]:
# Extract day of the week from 'Order Date' df['Day of Week'] = df['Order Date'].dt.day_name() # Calculate total sales for each day of the week day_sales = df.groupby('Day of Week')['Total Sales'].sum().reindex(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']) # Plotting day-wise sales distribution plt.figure(figsize=(12, 6)) day_sales.plot(kind='bar', color='coral') plt.title('Day-wise Sales Distribution') plt.ylabel('Total Sales ($)') plt.xlabel('Day of the Week') plt.xticks(rotation=45) plt.grid(axis='y') plt.show()
4. Interpretation and Communication
4.1 Summarizing the main findings from the analysis
From our analysis, we can derive the following key insights:
- The dataset is comprehensive, covering a wide range of products, with no missing values ensuring a reliable analysis.
- There’s a noticeable seasonality in the sales data, suggesting potential holiday or promotional sales.
- The correlation matrix indicates relationships between numerical variables, but further analysis might be needed to derive actionable insights.
4.2 Visual Representations of Sales Trends
A heatmap provides a visual representation of monthly sales over the years, helping in identifying patterns and trends.
In [18]:
import seaborn as sns # Pivot the data to get months on one axis and years on another axis heatmap_data = monthly_sales_data.pivot("Month", "Year", "Price Each") plt.figure(figsize=(10, 8)) sns.heatmap(heatmap_data, cmap="YlGnBu", annot=True, fmt=".0f") plt.title('Monthly Sales Heatmap') plt.show()
/tmp/ipykernel_20/3132487066.py:4: FutureWarning: In a future version of pandas all arguments of DataFrame.pivot will be keyword-only.
4.3 Insights
linkcode
Based on our analysis, here are some insights and recommendations:
- Seasonal Trends: Sales tend to peak during specific months. Aligning marketing strategies with these months could be beneficial.
- Product Analysis: Understanding the distribution of ‘Price Each’ can help in optimizing the product mix, focusing on products within the most common price range.
- Sales Peaks: Identifying months with peak and low sales can guide marketing and promotional efforts. For instance, introducing promotions during low sales months can boost revenue.
1 Comment
Machine Learning Project 1: Honda Motor Stocks Best Prices · May 24, 2024 at 2:52 pm
[…] SalesSight […]