With General Segmentation, you can uncover hidden similarities in data, such as the relationship between product prices and customer purchase histories. This unsupervised algorithm groups data based on similarities among numerical variables.
To run this model in Graphite, first identify an ID column to distinguish between values (e.g., customers or products within groups). Next, select the numeric columns (features) from your dataset for segmentation.
Now comes the tricky part: data preprocessing! We rarely encounter high-quality data, so we must clean and transform it for optimal model results. What should you do with missing values? Either remove them or replace them with relevant values, such as the mean or a prediction.
For instance, if you have chosen Age and Height as numeric columns, Age might range between 10 and 80, while Height could range from 100 to 210. The algorithm could prioritize Height due to its higher values. To avoid this, you should transform/scale your data; consider standardizing or normalizing it.
In the end, you need to determine the number of groups you want to get. In case you are not sure, Graphite will try to determine the best number of groups. But what about the model result? More about that in the next post!
After reviewing all the steps, you can finish and Run Scenario. The training duration may vary depending on the data volume, typically ranging from 1 to 10 minutes. The training will utilize 80% of the data to train various machine learning models and the remaining 20% to test these models and calculate relevant scores. Once completed, you will receive information about the best model based on the F1 value and details about training time.
Let's see how to interpret the results after we have run our model. The results consist of 5 tabs: Cluster Summary, By Cluster, By Numeric Value, Cluster Visualization, and Details Tabs.
As the model divided your data into clusters, a group of objects where objects in the same cluster are more similar to each other than to those in other clusters, it is essential to compare the average values ​​of the variables across all clusters. That's why in the Cluster Summary Tab you can see the differences between the clusters through the graph.
For example, in the picture above, you can see that customers in Cluster2 have the highest average value of the Total spend, unlike the customers in Cluster0.
Wouldn't it be interesting to explore each cluster by a numeric value or each numeric value by a cluster? That's why we have the By Cluster and By Numeric Value Tab - each variable and cluster are analyzed by their minimum and maximum, first and the third quartile, etc.
You can also have a Cluster Visualization Tab that shows the link between two arguments and how they are distributed. You can change the measures to see different cluster and their distribution.
The devil is in the details - details are important, so be conscientious and pay attention to the small things. Last but not least, on the Details Tab, you can find a detailed table where you can see all relevant values which were used for the above results.
With the right dataset and a few clicks, you will get results that will considerably help you in your business - general segmentation helps you in creating marketing and business strategies for each detected group. It's all up to you now, collect your data and start modeling.