Tuesday, May 26, 2015

Replicating PROC FASTCLUS in R using kmeans

A lot of businesses have bought into the idea of making decisions driven by data. And R is one of the foremost statistical tool that is helping these executives to take those ‘data-driven’ decisions. A flip-side of being data-driven in your approach is that you get accustomed to looking at a certain type of data and in a particular format, freaking out even if there is a slight deviation from this standard. Hence, even if far better methods are available to solve a problem, the data scientists must usually prescribe to what is widely accepted in the industry.

It is in this context that we explore the classical analytics problem of classification techniques. There are new ‘state of the art’ machine learning/ neural network algorithms in this space like trees, random forests, Bayesian networks, fuzzy logic, etc. and R has an implementation for each of these. However,even before we venture into these classification techniques (with labeled data), we would want to find labels in the data using segmentation techniques. To this end, the one which still finds wide-use in the industry, is the ‘k-means’ clustering solution, and its proprietary implementation on SAS: PROC FASTCLUS. In this post, we discuss how to replicate the FASTCLUS procedure on R. This post covers the following topics:

- Kmeans clustering in R
- Replicating SAS PROC FASTCLUS in R
- Statistics for variables, Total STD, Within STD, R-Square and RSQ/(1-RSQ) ratios in R kmeans
- Visualization of k-means cluster results

K Means clustering in R
R implements k-means solution using the function kmeans. At the very basic level, a k-means algorithm is a minimization problem. It tries to partition ‘n’ observations into ‘k’ clusters such that the ‘within-cluster-sum-of-squares’ is minimum. It might not be the most efficient way to cluster data when you know nothing about the data, but if you have an idea that there are fixed patterns/finite number of segments in the data, you can use k-means to validate this intuition. Usually, a k-means solution is run after a more generalized hierarchical clustering technique. You can read more about the details of the algorithm, its drawbacks and overall efficiency here

If you want to run k-means clustering on the famous Fisher’s Iris data on R, you just have to use the command:

kmeans(x = iris[-5],centers = 3,iter.max = 100)

and right away, you’ll have the following output on the console:


Note: You might have a different starting solution because k-means is an optimizing algorithm which is highly dependent on the initial seeds. Run the command some two/three times and you’ll have the same solution. You could also use the ‘set.seed(XX)’ to always get the solution with 50, 62, 38 observations in the three clusters.

The ‘cluster means’ tells the individual means of the variables in the respective clusters. The clustering vector tells us the cluster to which every observation belongs to. The ratio (between_SS/total_SS = 88.4%) tells us that upto 88.4% of the variance is between the clusters and only 100-88.4 = 11.6% is within the clusters. This tells us that the clusters are more or less tightly packed – a desired objective of k-means clustering. The higher the between_SS/total_SS ratio(also known as the overall R-SQUARE), the better is our cluster solution.

Replicating SAS PROC FASTCLUS in R
Due its dominance in business circles, the ODS output for SAS is something that most people are accustomed to looking at. And if you ran a PROC FASTCLUS on SAS on the same famous IRIS data, this how you would do it:

proc fastclus data=sashelp.iris maxc=3 maxiter=10 out=clus;
var SepalLength SepalWidth PetalLength PetalWidth;
run;
And this is what the output would look like:



The first thing you notice is that there is a lot of output as compared to R. And it might be a bit overwhelming! But if you look at it closely, the crux of the output still remains that the final solution of 3 clusters has 38, 50 and 62 observations and the overall R_SQUARE value is 88.4%, both of which were already reported in R. However, the statistics for variables, the pseudo-F, CCC, cluster means/std deviations, etc are some of the additional outputs which SAS presents in a nice format and something which the businesses have been used to looking at.

Although it is not possible to replicate the SAS result 100% because the initial seeds chosen by SAS and R would vary considerably, there are some statistics like the RSQ/(1-RSQ) ratio per variable and the pseudo-F, which would definitely enhance the R output, and help to take a call on what variables are performing better in terms of separating the observations. I have tried to search for ready-made packages that could help but there seems to be none as of now. However, there is some help in the materials presented here and here which are the basis for the code below:

In order to replicate the exact output of SAS FASTCLUS, we first export the SAS’s IRIS data (which has all the variables SepalLength, SepalWidth,PetalLength and PetalWidth in mm as compared to IRIS data in R datasets package which has these in cm) and then import the same into R and then run the k-means:


sas_iris<-read.csv(file = 'sas_iris.csv',header = T,sep = ',',fill = T)
sas_clus<-kmeans(x = sas_iris[-1],centers = 3,iter.max = 100)
sas_clus

would produce the following output as we’ve seen before:


We see that although the absolute values for the within_SS have changed (due to change in variable scale), the overall R-SQ value still remains at 88.4% with the observations grouped into 3 clusters of 62, 38 and 50 as before.

Statistics for variables and RSQ/(1-RSQ) ratio
We now know that the variables SepalLength, SepalWidth,PetalLength and PetalWidth can together create a great deal of separation between the species. But is there a way to statistically know which of these would be the variable with the highest degree of separation and which of these is the least. This is where the table with statistics for variables comes in – something that the R output seems to miss out. According the post here , a good way to find that out would be to run a simple linear regression of the variable against the classified cluster and get the adjusted R-Square as the proxy for the strength of the variable:

Something like this:

sas_iris$classified<-sas_clus$cluster
summary(lm(formula = SepalLength~classified,data = sas_iris))
summary(lm(formula = SepalWidth~classified,data = sas_iris))
summary(lm(formula = PetalLength~classified,data = sas_iris))
summary(lm(formula = PetalWidth~classified,data = sas_iris))

Although it gives the R-Square, the variation is just too high to make any inference w.r.t strength of the variable.

An almost similar replica of the SAS output table can be generated by getting the Total Standard Deviation (Total STD), the within cluster Standard Deviation (Within STD) and the subsequent use of these in the formulae to get

Variable RSquare = 1- (within STD/Total STD)^2

and the RSQ/(1-RSQ) ratio as well. Although the exact SAS output upto 4 decimal places cannot be reproduced because the exact formulae used by SAS is not available anywhere on the internet, the ones used here are come close to the actual numbers and help us decide the strength of the individual variables.


sas_iris$classified<-sas_clus$cluster
variable.stats<-function(v,classified){
  tot<-sd(v)
  wth<-sqrt(sum(tapply(v,classified,FUN = function (x) {sum((x-mean(x))^2)})/(length(v)-unique(classified))))
  RSq<-1-(wth/tot)^2
  Ratio<-RSq/(1-RSq)
  a<-c(tot,wth,RSq,Ratio)
  a
}


vapply(X = sas_iris[,2:5],FUN = variable.stats, FUN.VALUE = c(Tot.STD=0,Within.STD=0,RSQ=0,RSQRatio=0),
       classified=sas_iris$classified)
And this will give the results in a tabular format, similar to SAS


By looking at this, we can now make out that the variable ‘PetalLength’ produces the highest degree of separation, while ‘SepalWidth’ has the least. So, if we were iteratively drop variables from the clustering, we would have to do away with the ‘SepalWidth’ variable first and so on.

The cluster means is already stored in the k-means result object. But if we want to generate the mean and standard deviations of each cluster, we can do it programmatically:


# Cluster means and standard deviations 
# SD 
sapply(X = sas_iris[,2:5],FUN = tapply,sas_iris$classified,sd)
# Mean
sapply(X = sas_iris[,2:5],FUN = tapply,sas_iris$classified,mean)
# Mean is same as cluster centers
sas_clus$centers

to produce the result like this:


The Pseudo-F statistic can also be generated programmatically by using the formula:


pseudo_F <- (sas_clus$betweenss/(length(sas_clus$size)-1))/(sas_clus$tot.withinss/(sum(sas_clus$size)-length(sas_clus$size)))

and the output would be:


> pseudo_F
[1] 561.6278

As for the cubic clustering criterion and approximate overall R-Squared, the results which are displayed on SAS seem to be a closely guarded secret and hence it is not exactly available on the internet, nor reproducible exactly in R. However, to get the ccc (cubic clustering criterion) , we could use the package NbClust :


library(NbClust)
NbClust(data = sas_iris[,2:5],min.nc = 3,max.nc = 3,method = 'kmeans',index = "ccc")


$All.index
  nc.Ward index.CCC 
  3.00000  37.67012
37.6 which is completely different from what the SAS output reports. However, there are a lot of arguments on which one is correct. Refer to the link here for more details. For now, let us just make peace with whatever is available on Nbclust and move forward.

Visualization of k-means cluster results
An area where R has a definite edge over SAS is the visualization of the results. Although the ODS has improved the plotting options on SAS, R is way ahead when it comes to creating colorful plots. So, once we have a cluster solution, we can use the powerful visualization features in R to create pretty plots. The reference used for this section of the post can be found here

Using the code as mentioned in the above article, we can create pretty plots for the IRIS data results like this:


As we can see, these plots have used the principal components decomposition to generate a 2-d plot for the 3 clusters.
A pair-wise plot can be created to confirm the strength of each variable like this:


pairs(x = sas_iris[,2:5], col=c(1:3)[sas_iris$classified])


The output confirms that PetalLength has a very high separating power, the species ‘Setosa’ (colored green) has PetalLengths between 10-20, while Versicolor and Verginica(colored black and red respectively) have lengths from 30-70mm.

Notes:
1. Although the results turn out to be similar on both the software in this example, there might cases when it will be impossible to match the results, as the internal implementations are way different. In some cases, even the cluster sizes will be lot different even when you run multiple iterations.
2. A major factor that influences the results is the scaling of variables. It is always recommended to have variables on the same scale in order to arrive at optimal results.