Sharing is caring!

1. Data Exploration and Understanding


1.1 Import Necessary Libraries

We began by importing essential libraries such as pandas, numpy, matplotlib, and seaborn to handle data manipulation, analysis, and visualization.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

1.2 Load the Dataset

The dataset named ‘reviews.csv’ was loaded, which contains reviews and ratings of various cafes.

# Assuming the dataset is named 'reviews.csv'
data = pd.read_csv('/kaggle/input/zomato-cafe-reviews/reviews.csv')

1.3 Initial Data Exploration

A preliminary examination of the dataset was conducted to understand its structure. The dataset contains columns such as ‘Name’ (cafe name), ‘Overall_Rating’, ‘Cuisine’, ‘Rate for two’, ‘City’, and ‘Review’.

# Display the first few rows of the dataset to get a feel for its structure
IndexNameOverall_RatingCuisineRate for twoCityReview
00Oliver Brown3.9Cafe, Coffee, Shake, Juices, Beverages, Waffle…500ahmedabadBeen to this place 3-4 times. Prakash is alway…
11Oliver Brown3.9Cafe, Coffee, Shake, Juices, Beverages, Waffle…500ahmedabadI recently visited Oliver Brown on a weekend f…
22Crush Coffee3Cafe, Shake, Beverages, Desserts600ahmedabadVery watery ans thin shake
33The Mohalla3.8Cafe550ahmedabadit was not cheese burst pizza.. only cheeze wa…
44The Mohalla3.8Cafe550ahmedabadYammi.,….test burger is best I love 💗 this B…

1.4 Examine the Distribution of ‘Overall Rating

A distribution plot was created to visualize the frequency of different ratings given by users.

# Plotting the distribution of 'Overall Rating'
plt.figure(figsize=(10, 6))

# Sorting the ratings from low to high
sorted_ratings = sorted(data['Overall_Rating'].unique())

sns.countplot(data=data, x='Overall_Rating', order=sorted_ratings)
plt.title('Distribution of Overall Ratings')
plt.ylabel('Number of Reviews')

1.5 Analyze the ‘Review’ Column

Random reviews were displayed to understand the content and nature of user feedback.

# Displaying some random reviews to understand the content
random_indices = np.random.choice(data.index, 5, replace=False)
for idx in random_indices:
    print(f"Review {idx}:\n{data.loc[idx, 'Review']}\n{'-'*80}\n")
Review 64:
Good and Fresh staff members

Review 162:
Ambrosia is a really quiet and nice area, surrounded by lots of trees that make you feel like you're in the jungle. Nice location if you want to escape the bustle of the city. The food is top-notch. There is no compromising on food quality or quantity. Excellent vegetarian and non-vegetarian meal options, must visit place with friends and family!

Review 589:
Very testy Awsm 

Review 231:

Review 413:
Waste of time and money. Takes more than an hour for service even with late night ‘minimum rush hours’ . Its been the fourth experience (giving multiple chances). Sad to see waste of a beautiful space at FS!

1.6 Check for Missing Values or Anomalies

The dataset was checked for missing values and anomalies. The ‘Overall_Rating’ column was also examined for unique values to ensure data consistency.

# Checking for missing values in the dataset
missing_values = data.isnull().sum()
print("Missing Values per Column:")

# Checking for any anomalies in 'Overall Rating' (e.g., ratings outside the expected range)
print("\nUnique values in 'Overall Rating':", data['Overall_Rating'].unique())
Missing Values per Column:
Index             0
Name              0
Overall_Rating    0
Cuisine           0
Rate for two      0
City              0
Review            0
dtype: int64

2. Data Preprocessing

2.1 Tokenize and Clean the ‘Review’ Column

The ‘Review’ column was tokenized and cleaned by removing stopwords and non-alphabetic characters.

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Downloading the stopwords from nltk'stopwords')'punkt')

stop_words = set(stopwords.words('english'))

# Function to clean and tokenize reviews
def clean_review(review):
    tokens = word_tokenize(review)
    tokens = [word.lower() for word in tokens if word.isalpha()]  # Remove non-alphabetic tokens and convert to lowercase
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return ' '.join(tokens)

data['Cleaned_Review'] = data['Review'].apply(clean_review)

2.2 Convert Categorical Data

Categorical columns like ‘Cuisine’ and ‘City’ were one-hot encoded to convert them into a numerical format.

# One-hot encoding for 'Cuisine' and 'City' columns
data = pd.get_dummies(data, columns=['Cuisine', 'City'], drop_first=True)

2.3 Extract Sentiment from the ‘Review’ Column

Sentiment scores were extracted from the ‘Review’ column using the TextBlob library. This score represents the sentiment of the review, where a positive score indicates a positive sentiment and vice versa.

from textblob import TextBlob

# Function to get the polarity of the review
def get_sentiment(review):
    analysis = TextBlob(review)
    return analysis.sentiment.polarity

data['Sentiment'] = data['Cleaned_Review'].apply(get_sentiment)

3. Feature Engineering

3.1 Generate a ‘Sentiment Score’ for Each Review

The sentiment of each review was quantified using the TextBlob library, resulting in a ‘Sentiment_Score’ for each review.

from textblob import TextBlob

# Function to get the polarity of the review
def get_sentiment(review):
    analysis = TextBlob(review)
    return analysis.sentiment.polarity

data['Sentiment_Score'] = data['Cleaned_Review'].apply(get_sentiment)

3.2 Create a ‘User Profile’

A ‘User_Profile’ was created for each user based on their average rating and sentiment score. This profile gives an idea of the user’s general sentiment and rating habits.

3.2.1 Convert ‘Overall_Rating’ to Numeric

# Convert 'Overall_Rating' to numeric, setting errors='coerce' to turn problematic values into NaNs
data['Overall_Rating'] = pd.to_numeric(data['Overall_Rating'], errors='coerce')

3.2.2 Handle Potential NaN Values

# If there are any NaN values after conversion, we can handle them. 
# For this example, we'll replace NaNs with the column median.

data['Overall_Rating'].fillna(data['Overall_Rating'].median(), inplace=True)

3.2.3 Recreate ‘User Profile’

In [13]:

# Now, create the 'User_Profile' feature
data['User_Profile'] = data[['Overall_Rating', 'Sentiment_Score']].mean(axis=1)

3.3 Develop a ‘Cafe Profile’

A profile was developed for each cafe, capturing the average ratings and sentiments associated with it.

# Grouping by cafe name to get the average rating and sentiment score for each cafe
cafe_profile = data.groupby('Name').agg({
    'Overall_Rating': 'mean',
    'Sentiment_Score': 'mean'

# Renaming columns for clarity
cafe_profile.columns = ['Name', 'Average_Rating', 'Average_Sentiment']

# Merging the cafe profile back to the main dataset
data = pd.merge(data, cafe_profile, on='Name', how='left')

4. Data Analysis & Model Building

4.1 Split the Data into Training and Testing Sets

The dataset was split into training and testing sets to build and validate the recommendation model.

from sklearn.model_selection import train_test_split

# Assuming each row is a unique user (since the dataset doesn't seem to have a user identifier)
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

4.2 Implement Collaborative Filtering

Collaborative filtering was implemented using the SVD algorithm from the Surprise library.

!pip install surprise

from surprise import Dataset, Reader
from surprise import SVD
from surprise.model_selection import cross_validate

# Define a reader and the rating scale (assuming rating scale is 0-5; adjust if different)
reader = Reader(rating_scale=(0, 5))

# Create the dataset
data_surprise = Dataset.load_from_df(train_data[['Name', 'Index', 'Overall_Rating']], reader)

# Use the SVD algorithm
algo = SVD()

# Train on the dataset
trainset = data_surprise.build_full_trainset()

4.3 Incorporate Sentiment Scores to Weigh the Recommendations

Sentiment scores were incorporated to weigh the recommendations, giving more weight to cafes with positive sentiments.

def weighted_recommendation(user_id, algo, data, sentiment_weight=0.2):
    # Get the list of all cafe names
    cafes = data['Name'].unique()
    # Predict ratings for all cafes
    predictions = [algo.predict(user_id, cafe).est for cafe in cafes]
    # Incorporate sentiment scores
    average_sentiments = data.groupby('Name')['Sentiment_Score'].mean().to_dict()
    weighted_scores = [(pred + sentiment_weight * average_sentiments[cafe]) for pred, cafe in zip(predictions, cafes)]
    # Get the top cafes based on weighted scores
    top_cafes = sorted([(score, cafe) for cafe, score in zip(cafes, weighted_scores)], reverse=True)
    return top_cafes

# Example: Get top 5 recommendations for a user
user_id = train_data['Index'].iloc[0]
recommendations = weighted_recommendation(user_id, algo, train_data)
print("Top 5 Recommendations:", recommendations[:5])
Top 5 Recommendations: [(3.9993548387096776, 'Uphoria- Cafe & Restro'), (3.9993548387096776, 'Caffeinate And Chill'), (3.9693548387096773, 'TROT - The Republic Of Taste'), (3.9685215053763443, 'Pannacottas'), (3.961021505376344, 'Roastery Coffee House')]

4.4 Validate the Model Using the Testing Set and Measure its Accuracy

The model was validated using the testing set, and its accuracy was measured using the RMSE metric.

from surprise import accuracy

# Create a test set
testset = [[row['Name'], row['Index'], row['Overall_Rating']] for _, row in test_data.iterrows()]
predictions = algo.test(testset)

# Calculate RMSE
rmse = accuracy.rmse(predictions)
print(f"RMSE: {rmse}")
RMSE: 0.3299
RMSE: 0.32993642711295224

5. Interpretation and Communication of Results

A function was developed to generate a list of top-recommended cafes for a given user based on their past ratings and the sentiment of their reviews.

def get_top_recommendations(user_id, algo, data, top_n=5):
    recommendations = weighted_recommendation(user_id, algo, data)
    return recommendations[:top_n]

user_id = train_data['Index'].iloc[0]
top_recommendations = get_top_recommendations(user_id, algo, train_data)
print("Top Recommended Cafes:")
for idx, (score, cafe) in enumerate(top_recommendations, 1):
    print(f"{idx}. {cafe} (Score: {score:.2f})")
Top Recommended Cafes:
1. Uphoria- Cafe & Restro (Score: 4.00)
2. Caffeinate And Chill (Score: 4.00)
3. TROT - The Republic Of Taste (Score: 3.97)
4. Pannacottas (Score: 3.97)
5. Roastery Coffee House (Score: 3.96)

5.2 Visualize User Preferences and How They Align with the Recommendations

A visualization was created to display a user’s ratings for various cafes, helping understand their preferences and how they align with the model’s recommendations.

def visualize_user_preferences(user_id, data):
    user_data = data[data['Index'] == user_id]
    plt.figure(figsize=(12, 6))
    # Plotting user ratings
    sns.barplot(x='Name', y='Overall_Rating', data=user_data, palette="Blues_d")
    plt.xticks(rotation=45, ha='right')
    plt.title(f"User {user_id}'s Ratings for Cafes")
    plt.xlabel('Cafe Name')

visualize_user_preferences(user_id, train_data)


The cafe recommendation system effectively leverages user ratings and review sentiments to provide personalized cafe recommendations. By understanding user preferences and sentiments, cafes can enhance their services, target users with personalized offers, and improve overall customer satisfaction. With further refinement and additional data, the accuracy and utility of this recommendation system can be further enhanced.


Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *

Cheap flights with cashback