Customer Segmentation with Mixed Data Types – Advice Needed

Hi everyone!

I’m new to using unsupervised learning for customer segmentation. I usually handle segmentation by looking at demographic factors and creating bins for some continuous features (like RFM factors). My lead data scientist suggested that we try clustering to find patterns in our customers for more personalized messaging.

For this, I’d have to work with mixed data types. Would you approach this by ignoring categorical variables, or would you cluster categorical and numerical data separately? Or maybe combine them in some way?

Also, if anyone has examples of how machine learning-based segmentation added value beyond traditional RFM analysis, I’d love to hear your insights!

I’d suggest looking into customer lifetime value (CLV) models, which build on RFM data, to drive segmentation. It helps to identify low, mid, and high-value customers.

For more insights, check out Byron Sharp’s work on customer growth, which emphasizes reaching a broad audience rather than focusing solely on high-value customers. Peter Fader also talks about customer-centricity and how mid-value customers can drive growth.

Byron Sharp TEDx talk

Peter Fader Google Talk

@Taran
Thanks! I’m thinking of targeting low-propensity users, who also have low CLV. I’ll try clustering within this group and use strategies from higher-propensity users in similar clusters to gradually increase their engagement.

You could start with RFM segmentation, then try clustering on top of that. Run an A/B test to compare results between RFM and clustering. In my experience, RFM is simple and often yields good outcomes for marketing campaigns. Stakeholders also find it easier to understand.

RFM is simple and reliable, but clustering can sometimes reveal new audience segments. It’s worth trying both to see if clustering brings any added insights.

I’m doing something similar. For categorical data, I include it in clustering along with continuous data. Are your categorical variables binary or have a few categories?

Peyton said:
I’m doing something similar. For categorical data, I include it in clustering along with continuous data. Are your categorical variables binary or have a few categories?

Most of them are binary or have 3-4 categories with a fairly even distribution.

We segmented by treating each month of a customer’s history as a separate entry to capture changes over time. DBSCAN and agglomerative clustering gave us interesting results.

For mixed data types, I’ve found that k-prototypes (for both categorical and numerical data) or Gower distance work well. Preprocessing is key – I standardize numerical features and encode categorical ones based on the model. Some people run separate clusters for categorical and numerical data, then combine results or use dimensionality reduction like PCA.

In my experience, machine learning-based segmentation helped us uncover unexpected customer personas, which led to more personalized strategies and better engagement.

@Vanya
In a case like an online gaming company where behavior patterns (e.g., looking at items but not purchasing) are already known, would you still find ML useful, or would rule-based targeting be enough?

I’m working on a similar project with mixed data. I normalize continuous features and avoid high-correlation features. For segmentation, we discuss with the business team to pick key features, build a base model, then iterate. Binning continuous features is interesting; it simplifies the job for the model.

@Kai
I used bins like 1-20 impressions, 20-80, etc. and made combinations with other features (e.g., device, market). It’s not clustering but helps with basic grouping.

One approach is to one-hot encode categorical data, then apply a dimensionality reduction technique like UMAP before clustering.