Blog [R]

Tuesday, May 26, 2015

Replicating PROC FASTCLUS in R using kmeans

A lot of businesses have bought into the idea of making decisions driven by data. And R is one of the foremost statistical tool that is helping these executives to take those ‘data-driven’ decisions. A flip-side of being data-driven in your approach is that you get accustomed to looking at a certain type of data and in a particular format, freaking out even if there is a slight deviation from this standard. Hence, even if far better methods are available to solve a problem, the data scientists must usually prescribe to what is widely accepted in the industry.

It is in this context that we explore the classical analytics problem of classification techniques. There are new ‘state of the art’ machine learning/ neural network algorithms in this space like trees, random forests, Bayesian networks, fuzzy logic, etc. and R has an implementation for each of these. However,even before we venture into these classification techniques (with labeled data), we would want to find labels in the data using segmentation techniques. To this end, the one which still finds wide-use in the industry, is the ‘k-means’ clustering solution, and its proprietary implementation on SAS: PROC FASTCLUS. In this post, we discuss how to replicate the FASTCLUS procedure on R. This post covers the following topics:

- Kmeans clustering in R
- Replicating SAS PROC FASTCLUS in R
- Statistics for variables, Total STD, Within STD, R-Square and RSQ/(1-RSQ) ratios in R kmeans
- Visualization of k-means cluster results

K Means clustering in R

R implements k-means solution using the function kmeans. At the very basic level, a k-means algorithm is a minimization problem. It tries to partition ‘n’ observations into ‘k’ clusters such that the ‘within-cluster-sum-of-squares’ is minimum. It might not be the most efficient way to cluster data when you know nothing about the data, but if you have an idea that there are fixed patterns/finite number of segments in the data, you can use k-means to validate this intuition. Usually, a k-means solution is run after a more generalized hierarchical clustering technique. You can read more about the details of the algorithm, its drawbacks and overall efficiency here

If you want to run k-means clustering on the famous Fisher’s Iris data on R, you just have to use the command:

kmeans(x = iris[-5],centers = 3,iter.max = 100)

and right away, you’ll have the following output on the console:

Note: You might have a different starting solution because k-means is an optimizing algorithm which is highly dependent on the initial seeds. Run the command some two/three times and you’ll have the same solution. You could also use the ‘set.seed(XX)’ to always get the solution with 50, 62, 38 observations in the three clusters.

The ‘cluster means’ tells the individual means of the variables in the respective clusters. The clustering vector tells us the cluster to which every observation belongs to. The ratio (between_SS/total_SS = 88.4%) tells us that upto 88.4% of the variance is between the clusters and only 100-88.4 = 11.6% is within the clusters. This tells us that the clusters are more or less tightly packed – a desired objective of k-means clustering. The higher the between_SS/total_SS ratio(also known as the overall R-SQUARE), the better is our cluster solution.

Replicating SAS PROC FASTCLUS in R

Due its dominance in business circles, the ODS output for SAS is something that most people are accustomed to looking at. And if you ran a PROC FASTCLUS on SAS on the same famous IRIS data, this how you would do it:

proc fastclus data=sashelp.iris maxc=3 maxiter=10 out=clus;
var SepalLength SepalWidth PetalLength PetalWidth;
run;

And this is what the output would look like:

The first thing you notice is that there is a lot of output as compared to R. And it might be a bit overwhelming! But if you look at it closely, the crux of the output still remains that the final solution of 3 clusters has 38, 50 and 62 observations and the overall R_SQUARE value is 88.4%, both of which were already reported in R. However, the statistics for variables, the pseudo-F, CCC, cluster means/std deviations, etc are some of the additional outputs which SAS presents in a nice format and something which the businesses have been used to looking at.

Although it is not possible to replicate the SAS result 100% because the initial seeds chosen by SAS and R would vary considerably, there are some statistics like the RSQ/(1-RSQ) ratio per variable and the pseudo-F, which would definitely enhance the R output, and help to take a call on what variables are performing better in terms of separating the observations. I have tried to search for ready-made packages that could help but there seems to be none as of now. However, there is some help in the materials presented here and here which are the basis for the code below:

In order to replicate the exact output of SAS FASTCLUS, we first export the SAS’s IRIS data (which has all the variables SepalLength, SepalWidth,PetalLength and PetalWidth in mm as compared to IRIS data in R datasets package which has these in cm) and then import the same into R and then run the k-means:

sas_iris<-read.csv(file = 'sas_iris.csv',header = T,sep = ',',fill = T)
sas_clus<-kmeans(x = sas_iris[-1],centers = 3,iter.max = 100)
sas_clus

would produce the following output as we’ve seen before:

We see that although the absolute values for the within_SS have changed (due to change in variable scale), the overall R-SQ value still remains at 88.4% with the observations grouped into 3 clusters of 62, 38 and 50 as before.

Statistics for variables and RSQ/(1-RSQ) ratio

We now know that the variables SepalLength, SepalWidth,PetalLength and PetalWidth can together create a great deal of separation between the species. But is there a way to statistically know which of these would be the variable with the highest degree of separation and which of these is the least. This is where the table with statistics for variables comes in – something that the R output seems to miss out. According the post here , a good way to find that out would be to run a simple linear regression of the variable against the classified cluster and get the adjusted R-Square as the proxy for the strength of the variable:

Something like this:

sas_iris$classified<-sas_clus$cluster
summary(lm(formula = SepalLength~classified,data = sas_iris))
summary(lm(formula = SepalWidth~classified,data = sas_iris))
summary(lm(formula = PetalLength~classified,data = sas_iris))
summary(lm(formula = PetalWidth~classified,data = sas_iris))

Although it gives the R-Square, the variation is just too high to make any inference w.r.t strength of the variable.

An almost similar replica of the SAS output table can be generated by getting the Total Standard Deviation (Total STD), the within cluster Standard Deviation (Within STD) and the subsequent use of these in the formulae to get

Variable RSquare = 1- (within STD/Total STD)^2

and the RSQ/(1-RSQ) ratio as well. Although the exact SAS output upto 4 decimal places cannot be reproduced because the exact formulae used by SAS is not available anywhere on the internet, the ones used here are come close to the actual numbers and help us decide the strength of the individual variables.

sas_iris$classified<-sas_clus$cluster
variable.stats<-function(v,classified){
  tot<-sd(v)
  wth<-sqrt(sum(tapply(v,classified,FUN = function (x) {sum((x-mean(x))^2)})/(length(v)-unique(classified))))
  RSq<-1-(wth/tot)^2
  Ratio<-RSq/(1-RSq)
  a<-c(tot,wth,RSq,Ratio)
  a
}


vapply(X = sas_iris[,2:5],FUN = variable.stats, FUN.VALUE = c(Tot.STD=0,Within.STD=0,RSQ=0,RSQRatio=0),
       classified=sas_iris$classified)

And this will give the results in a tabular format, similar to SAS

By looking at this, we can now make out that the variable ‘PetalLength’ produces the highest degree of separation, while ‘SepalWidth’ has the least. So, if we were iteratively drop variables from the clustering, we would have to do away with the ‘SepalWidth’ variable first and so on.

The cluster means is already stored in the k-means result object. But if we want to generate the mean and standard deviations of each cluster, we can do it programmatically:

# Cluster means and standard deviations 
# SD 
sapply(X = sas_iris[,2:5],FUN = tapply,sas_iris$classified,sd)
# Mean
sapply(X = sas_iris[,2:5],FUN = tapply,sas_iris$classified,mean)
# Mean is same as cluster centers
sas_clus$centers

to produce the result like this:

The Pseudo-F statistic can also be generated programmatically by using the formula:

pseudo_F <- (sas_clus$betweenss/(length(sas_clus$size)-1))/(sas_clus$tot.withinss/(sum(sas_clus$size)-length(sas_clus$size)))

and the output would be:

> pseudo_F
[1] 561.6278

As for the cubic clustering criterion and approximate overall R-Squared, the results which are displayed on SAS seem to be a closely guarded secret and hence it is not exactly available on the internet, nor reproducible exactly in R. However, to get the ccc (cubic clustering criterion) , we could use the package NbClust :

library(NbClust)
NbClust(data = sas_iris[,2:5],min.nc = 3,max.nc = 3,method = 'kmeans',index = "ccc")

$All.index
  nc.Ward index.CCC 
  3.00000  37.67012

37.6 which is completely different from what the SAS output reports. However, there are a lot of arguments on which one is correct. Refer to the link here for more details. For now, let us just make peace with whatever is available on Nbclust and move forward.

Visualization of k-means cluster results

An area where R has a definite edge over SAS is the visualization of the results. Although the ODS has improved the plotting options on SAS, R is way ahead when it comes to creating colorful plots. So, once we have a cluster solution, we can use the powerful visualization features in R to create pretty plots. The reference used for this section of the post can be found here

Using the code as mentioned in the above article, we can create pretty plots for the IRIS data results like this:

As we can see, these plots have used the principal components decomposition to generate a 2-d plot for the 3 clusters.
A pair-wise plot can be created to confirm the strength of each variable like this:

pairs(x = sas_iris[,2:5], col=c(1:3)[sas_iris$classified])

The output confirms that PetalLength has a very high separating power, the species ‘Setosa’ (colored green) has PetalLengths between 10-20, while Versicolor and Verginica(colored black and red respectively) have lengths from 30-70mm.

Notes:
1. Although the results turn out to be similar on both the software in this example, there might cases when it will be impossible to match the results, as the internal implementations are way different. In some cases, even the cluster sizes will be lot different even when you run multiple iterations.
2. A major factor that influences the results is the scaling of variables. It is always recommended to have variables on the same scale in order to arrive at optimal results.

Tuesday, February 18, 2014

Binary Logistic regression: Fast Concordance

This is a follow up to an earlier article on concordance in binary logistic regression. You can find the original article here. In that post, I had compared between 2-3 different ways of computing concordance, discordance and ties while running a binary logistic regression model on R. And the conclusion was that the OptimizedConc was an accurate, yet fast way to get to concordance in R. In this post we cover the following topics:

- Function for Fast and accurate Concordance in logit models using R
- Comparison of the fastConc function against other methods

My analyst friend wrote to me and complained that even the optimized function was not so optimized when it came to large datasets! It seems the data frame that he used had more than a million observations and the function always kept failing due to memory issues. It immediately occurred to me the culprit were the huge matrices which are created in the function. It creates 3 matrices (initialized with zeroes), each of which are of size (number of ones) * (number of zeros). So, if you had half a million ones and half a million zeroes in the dataset, you would need three matrices of (0.5M * 0.5M) each, even before the actual calculations in the ‘for’ loop begun.

As we sat and discussed about it, we knew that if we were to use this function on real data, the matrix allocations and the dual for-loops had to somehow be optimized. And being the geek that he is, my friend suggested an approach to reduce the number of ‘for’ loops from two to one. The function below, which I have called fastConc, reduces the number of ‘for’ loops to one and uses the native ‘subset’ feature in the loop to calculate the number of concordant and discordant pairs. It is one of the fastest functions which can give you exact concordance values and on performance side, it compares itself against the github code, which just gives approximate concordance values:

###########################################################
# Function fastConc : for concordance, discordance, ties
# The function returns Concordance, discordance, and ties
# by taking a glm binomial model result as input.
# It uses optimisation through subsetting
###########################################################
fastConc<-function(model){
  # Get all actual observations and their fitted values into a frame
  fitted<-data.frame(cbind(model$y,model$fitted.values))
  colnames(fitted)<-c('respvar','score')
  # Subset only ones
  ones<-fitted[fitted[,1]==1,]
  # Subset only zeros
  zeros<-fitted[fitted[,1]==0,]
  
  # Initialise all the values
  pairs_tested<-nrow(ones)*nrow(zeros)
  conc<-0
  disc<-0
    
  # Get the values in a for-loop
  for(i in 1:nrow(ones))
  {
    conc<-conc + sum(ones[i,"score"]>zeros[,"score"])
    disc<-disc + sum(ones[i,"score"]<zeros[,"score"])
  }
  # Calculate concordance, discordance and ties
  concordance<-conc/pairs_tested
  discordance<-disc/pairs_tested
  ties_perc<-(1-concordance-discordance)
  return(list("Concordance"=concordance,
              "Discordance"=discordance,
              "Tied"=ties_perc,
              "Pairs"=pairs_tested))
}

The output of the function is exactly similar to the OptimisedConc function and it returns the Concordance, Discordance, Ties, etc as ratios, than percentages, which can be easily changed.

Performance of the function

Intuitively, the function fastConc() seems to do better on memory as related to the optimisedConc just because it stores the concordance and discordance values in a count variable than in big matrices. So, how do all these functions match up on time? To check, I used a dataset with 20,000 observations which had 2000 ones and 18000 zeros (very low response model, you might say). There would be a total of (18000 * 2000) 36,000,000 pairs which need to be tested. And these are results of the functions:

> system.time(bruteforce(logit_mod))
   user  system elapsed 
4291.10    6.12 4479.85 

> system.time(OptimisedConc(logit_mod))
   user  system elapsed 
 221.98    0.45  223.69 
 
> system.time(fastConc(logit_mod))
   user  system elapsed 
   0.69    0.00    0.69

As can be seen, bruteforce() took more than an hour to give me the concordance results! And I had almost given up when the system.time() function finally returned the value. OptimisedConc does lot better in terms of time 4 minutes, it is pathetic in terms of memory utilization! The fastConc() gives me the same result within a second, thanks to the native functions being used, and it consumes negligible memory.

So, the verdict is clear. There will be lots of situations like this, where multiple things seem to work and produce the same output. However, it is always best to choose the one which uses native functions for its implementations rather than other data heavy or user defined functions. If you ran logistic regression in other tools like SAS, you would not even worry about the functions, because they have already implemented it using ready-made native functions, and hence they tend to be really oiptimised!. As for concordance in R, the fastConc() now becomes my go-to function everytime I run a glm() code because of its sheer efficiency. If you have had any situation where you’ve used non-native functions to accomplish a task, let me know in comments. I’ll be back with more posts soon. Till then, take care!

Thursday, January 30, 2014

Excel style VLookup and RangeLookup in R

A friend of mine, also an R enthusiast, came to me with this task that he was doing as part of a larger activity. The task seemed quite simple – assigning a bin value to each row of a dataset based on the information in a lookup table which contained information on the bins:

Data table (a large table of about 20,000+ rows) contains a variable called ‘indep1’, the values of which range from -30 to 280. The information on the bins is contained in the lookup table. And the bin numbers are such that they are in the increasing order of the ‘min_value’/’max_value’. The required output would be something like this:

An extra column in the data table indicating which bin the indep1 belongs to. Just to take care of the details, in case the value of the variable is at the border (say row number 4 in this case), it should go into the higher bin (bin 9 instead of bin 8).

Seemed like a simple but an interesting puzzle to solve on R. This post covers the following topics:
- Excel style Vlookup in R
- Range lookup in R similar to Vlookup in Excel
- Comparison among all the lookup functions

Lookup on R

Due to paucity of time, the first solution which the friend had tried was the non-algorithmic brute-force approach of iterating through all the bins (1 to 10) for all the 20,000 rows of data and assigning the bin numbers to the data table. Something like this:

### Brute force method
full_iterate_way<-function(data,lookup){
  data$bin_num<-0
  for(j in 1:nrow(lookup)){
    minval <- lookup[j, "min_value"]
    maxval <- lookup[j, "max_value"]
    label <- lookup[j, "bin_num"]
    
    for(k in 1:nrow(data)){
      if(data[k, "indep1"] >= minval & data[k, "indep1"] < maxval){
        data[k, "bin_num"] <- label
        }
    }
  }
  data
}

data_full<-full_iterate_way(data=data_table,lookup=lookup_table)

The function iterates through all the 20,000 rows of the data table for 10 times to assign the bin value to the variables. It does what is needed. However, if you were a programmer who looked at the code, you would immediately have apprehensions about code performance when there are two-for loops. And thus began the programmer’s quest to find alternative faster codes which would do the same.

If you had the data in excel, you would immediately know that this thing can be achieved using the RANGE LOOKUP property of the powerful VLOOKUP function, as the lookup table is anyway in the increasing order of bins. What’s more, if you need a column other than bin_info (say bin_weight) to be on the data_table, it would be a matter of just changing the argument 3 in Vlookup to get the desired column. So, the first improvement to the brute force would be to replicate the Vlookup (range lookup instead of the exact lookup) on R. Something like this:

rngLookup<-function(value, dataFrame,column){
  retVal<-dataFrame[value<dataFrame[,"min_value"],column][1]-1
  if(is.na(retVal)){retVal<-nrow(dataFrame)}
  retVal
}

lookup_way<-function(data,lookup){
  for(i in 1:nrow(data)){
    data$bin_num[i]<-rngLookup(data[i,2],dataFrame=lookup,column=2)
  }
  data
}

data_lookedup<-lookup_way(data_table,lookup=lookup_table)

Since we have function to do the lookup, we can call it for every row of the data frame eliminating one ‘for’loop. ‘data_lookedup’ would now contain the same information as in ‘data_full’. Just for the record, replacing the ‘<’ sign in the lookup function with ‘==’ sign can give you the exact VLookup function of excel in R.

Although the performance slightly improved after using the lookup function, it is still not an optimal way of going about things in R. This is mainly because we have still stuck to the programming paradigm of looping instead of the powerful vectorization and subsetting capabilities that R offers. So, we explore further and arrive at the next code to do the same thing - the ubiquitous and powerful SQL:

library(sqldf)
sql_way<-function(data,lookup){
  data<-sqldf("select A.*,B.bin_num from
              data A left join lookup B 
              ON (A.indep1 >= B.min_value and A.indep1 < B.max_value)")
  data
}

The library sqldf allows SQL codes directly on R data frames and this is one of the most elegant and optimal solution which I have come across to do the range lookup on R. The improvements to the performance are very substantial, as can be seen in the summary in the performance section below. However, the only thing which made me explore further for an even better alternative was that I was not convinced that a language like R, which is acclaimed to be one of the best for statistical analysis did not have a native function to achieve this simple task. And then I stumbled upon this:

findInterval

How could I miss something as trivial and intuitive like this in the first place? It was sort of the perfect answer to the question we asked and it was as native as R itself! And so, here is the simple code which will do what we had been trying to achieve all along – the one which I would prefer to use when the bin numbers start from ‘1’ and go on upto ‘10’ as in the example above:

find_interval<-function(data,lookup){
  data$label<-findInterval(x=data$indep1,vec=lookup$min_value)
  data
}
data_interval<-find_interval(data=data_table,lookup_table)

Comparison of the lookup functions

We now have 4 functions which do the exact same thing, and just by looking at them, we can assume the latter 2 to be more elegant than the earlier ones. However, let us also use the system.time function to see how each one of them performs when run on a data frame of 25000 rows:

As expected, the brute force method will be the slowest owing to the double-for-loops, and the lookup has only decreased the time on a linear scale. The SQL and the native findInterval win hands down by exponentially bringing down the time taken to perform the same task. The 'dual for-loop' brute force approach took 37 seconds to do the same thing, while the lookup just reduced it to 25 seconds , only a fractional improvement. The SQL got it down to 0.2 seconds and the findInterval did it in no time at all! The little overhead in the SQL as compared to the findInterval can be because of the Cartesian product table join it needs to perform.

Concluding remarks

1. Although all the functions above achieve the same result, there could be slight differences. Some of the functions might produce unexpected results when the variable value is at the extreme end of the lookup table. Say, if the variable value is 280 (the highest value), the brute force approach gives the bin_value of ‘0’ due to initialization and the SQL method gives a value of ‘NA’ because of join conditions not matching. However, the findInterval has no such problems because it anyway does the comparison only till the 9th bin and the 10th bin is anything greater than 230

2. The findInterval is not a complete lookup because it can return only numbers starting from 0/1 to the number of bins. Suppose in the above example, we also wanted to have the ‘bin_weight’ variable along with the ‘bin_num’ variable for all rows of indep1, then findInterval would not be able to achieve that, but there would be no such problem if we used the SQL method. Suppose we wanted to have the desired output (adding even the bin_weight column in output):

We could tweak the find_interval code to achieve this as well:

library(plyr)
find_interval<-function(data,lookup){
  data$bin_num<-findInterval(x=data$indep1,vec=lookup$min_value)
  data<-join(x=data,y=lookup,by="bin_num")[,c(1,2,3,7)]
  data
}
data_interval<-find_interval(data=data_table,lookup_table)

The addition of the merge statement in 'find_interval' makes it almost similar to SQL in terms of performance and functionality, and now either of them can be used in place of the earlier, brute force approach.

Though this seemed like a very simple exercise, I found a lot of ways to do one particular thing in R while exploring this. The best part about R is that we still cannot conclude if this is the best way of getting the desired outcome. And purists may go ahead and suggest the use of merge using ‘data.table’ which seems to be way faster than regular merge, using the lookup function from library qdap, or some combination of match(), etc. However, if you find that the code which you have does what you expect it do without being too heavy on resources, you can continue using the thing which works rather than going for the kill on optimization. If you used any better ways to achieve the same result, you are most welcome to share it. And if you found this post useful, please let me know that as well. I’ll be back writing more on these simple yet thought provoking exercises. Have fun!

Wednesday, January 15, 2014

Binary logistic Regression on R : Concordance and Discordance

Logistic regression might not be the most trending in the analytics industry anymore. But is still bread and butter for most analytics folks, especially in the marketing decision sciences. Most of propensity models, survival analysis, churn measurement, etc are exclusively driven by this traditional yet powerful statistical technique.

A lot of material is available online to get started with building logistic regression models and getting the model fit criterion satisfied. If you are totally new to building logistic regression models, an excellent point to start off would be the UCLA help articles on building these binary logit models. Even before getting to the model building stage, some of the pre-processing and variable selection procedures must be followed in order to get good results, which would be the subject of a separate post. In this post we will cover some of the important model fit measures like Concordance, discordance, and other association measures like Somers D, gamma and Kendall’s Tau A which compare the predicted responses to actual responses.

The following questions will be answered during the course of this article:

Measures for logistic regression Concordance and discordance in R
Somers'D, Gamma, Kendall’s Tau-a statistics in R

Concordance and Discordance in R

The most widely used code to run a logit model in R would be the glm() function with the ‘binomial’ variant. So, if you wanted to run a logistic regression model on the hypothetical dataset (available on the UCLS website here) , all you need to do is load the data set in R and run the binary logit using the following code:

# Clear workspace objects
rm(list=ls())

# Load the modelling dataset into workspace
model_data<-read.csv('binary.csv',header=T,sep=',',fill=T)

# Run a binary logistic regression model
logit_mod<-glm(formula=admit~gre+gpa+rank,
               family='binomial',data=model_data)
# Display the summary
summary(logit_mod)

And this is how the model summary would look like:

Since all the co-efficients are significant and the residual deviance has reduced as compared to the null deviance, we can conclude that we have a fair model. But, looking at the model result this way, it would be really difficult to say how well this model performs. In OLS regression, the R-squared and its more refined measure adjusted R-square would be the ‘one-stop’ metric which would immediately tell us if the model was a good fit or not. And since this was a value between 0 and 1, we could easily change it to a percentage value and pass it off as ‘model accuracy’ for beginners and the not-so-much-math-oriented businesses. Unfortunately, looking at adj-R square would be totally irrelevant in case of logistic regression because we model the log odds ratio and it becomes very difficult in terms of explain ability

This is where concordance steps in to help. Concordance tells us the association between actual values and the values fitted by the model in percentage terms. Concordance is defined as the ratio of number of pairs where the 1 had a higher model score than the model score of zero to the total number of 1-0 pairs possible. A higher value for concordance (60-70%) means a better fitted model. However, a very large value for concordance (85-95%) could also suggest that the model is over-fitted and needs to be re-aligned to explain the entire population.

A straight-forward, non-optimal, brute-force approach to getting to concordance would be to write the following code after building the model:

###########################################################
# Function Bruteforce : for concordance, discordance, ties
# The function returns Concordance, discordance, and ties
# by taking a glm binomial model result as input.
# It uses the brute force method of two for-loops
###########################################################
bruteforce<-function(model){
  # Get all actual observations and their fitted values into a frame
  fitted<-data.frame(cbind(model$y,model$fitted.values))
  colnames(fitted)<-c('respvar','score')
  # Subset only ones
  ones<-fitted[fitted[,1]==1,]
  # Subset only zeros
  zeros<-fitted[fitted[,1]==0,]
  
  # Initialise all the values
  pairs_tested<-0
  conc<-0
  disc<-0
  ties<-0
  
  # Get the values in a for-loop
  for(i in 1:nrow(ones))
    {
    for(j in 1:nrow(zeros))
      {
      pairs_tested<-pairs_tested+1
      if(ones[i,2]>zeros[j,2]) {conc<-conc+1}
      else if(ones[i,2]==zeros[j,2]){ties<-ties+1}
      else {disc<-disc+1}
      }
  }
  # Calculate concordance, discordance and ties
  concordance<-conc/pairs_tested
  discordance<-disc/pairs_tested
  ties_perc<-ties/pairs_tested
  return(list("Concordance"=concordance,
              "Discordance"=discordance,
              "Tied"=ties_perc,
              "Pairs"=pairs_tested))
  }

All this code does is to iterate through each and every 1-0 pair to see if the model score of ‘1’ was greater than the model score of ‘0’. And based on this comparison, it classifies the pair as a concordant pair, discordant pair or a tied pair. The final values for concordance, discordance and ties are expressed as a percentage of the total number of the pairs tested. When this code is run, we see the following output on the console:

As can be seen, the model reports a concordance percentage of 69.2% which tells us that the model is fairly accurate.

Although the above code gets the job done, it can be a real burden on system resources because of the two ‘for-loops’ and no optimization done at all. So, as the modelling data set increases in size, using this function can sometimes lead to a heavy toll on system resources, long waiting time and sometimes, crashing the R-process altogether.

Alternatively, the following function which is provided by a fellow blogger Vaibhav here can be used which uses the power of vectorization in R and gives the same result by using less computation time. The code for the same is (originally posted at the above link):

###########################################################
# Function OptimisedConc : for concordance, discordance, ties
# The function returns Concordance, discordance, and ties
# by taking a glm binomial model result as input.
# Although it still uses two-for loops, it optimises the code
# by creating initial zero matrices
###########################################################
OptimisedConc=function(model)
{
  Data = cbind(model$y, model$fitted.values) 
  ones = Data[Data[,1] == 1,]
  zeros = Data[Data[,1] == 0,]
  conc=matrix(0, dim(zeros)[1], dim(ones)[1])
  disc=matrix(0, dim(zeros)[1], dim(ones)[1])
  ties=matrix(0, dim(zeros)[1], dim(ones)[1])
  for (j in 1:dim(zeros)[1])
  {
    for (i in 1:dim(ones)[1])
    {
      if (ones[i,2]>zeros[j,2])
      {conc[j,i]=1}
      else if (ones[i,2]<zeros[j,2])
      {disc[j,i]=1}
      else if (ones[i,2]==zeros[j,2])
      {ties[j,i]=1}
    }
  }
  Pairs=dim(zeros)[1]*dim(ones)[1]
  PercentConcordance=(sum(conc)/Pairs)*100
  PercentDiscordance=(sum(disc)/Pairs)*100
  PercentTied=(sum(ties)/Pairs)*100
  return(list("Percent Concordance"=PercentConcordance,"Percent Discordance"=PercentDiscordance,"Percent Tied"=PercentTied,"Pairs"=Pairs))
}

This code also does the same thing as above but using matrices already initialized with zeroes. The output and the measures for concordance,etc are exactly the same as in the bruteforce approach. So, the toll on system resources would be much lesser as compared to the earlier code, because it has taken the power of R into consideration. Now, just for the sake of comparison, let us just see what is the savings in terms of system resources by looking at the time taken to execute the two functions. We use the system.time() function to evaluate the time:

The second function does the same thing as the first using only 10% of the time! That is what vectorization can do in R.

Of course, there are other functions which can be written which will approximate the value of Concordance instead of calculating accurately using all the possible 1-0 pairs. One of the most frequently returned search URL when you search for Concordance is the following link at GITHUB . This code is even better in terms of performance as compared to the optimized function above, but the only catch is that it is not accurate. It has approximated the number of 1-0 pairs on the assumption that the data usually has as many number of ones as there are zeroes. If you calculate the concordance of the above model using this function, this is what you get:

The code has given a better value for Concordance (70.8%) instead of the actual value (69.2%). However this might get totally inaccurate if we had sorted the data to have all top scoring ones at the top of our data set, in which case Concordance would reach an unusually high value. The only thing about this code is that it is very quick, and can be used to get an approximate idea of what range the actual concordance would lie. And it does not even take a second to do that! My vote would still be for the OptimisedConc function.

Somers D, Gamma, Kendall’s Tau-a statistics in R

Once the total number of pairs, concordant pairs, tied pairs and discordant pairs are obtained, then calculation of the above statistics is pretty easy and straight forward. Gamma (more famous as Goodman and Kruskal Gamma) is the measure of association in a doubly ordered contingency table. Refer here for more info . It can be calculated as:

where P is the number of concordant pairs and Q is the number of discordant pairs and ‘T’ is the number of tied pairs. It is a measure of how well the model is able to distinguish between concordant pairs and compared to the discordant pairs.

Somers’D is almost similar to gamma, but however takes does not into account the tied number of pairs. So, usually, if there are tied pairs in the model, Somers’D is usually less than gamma and can be calculated as

Both Gamma and Somers’D have values ranging from zero to one and the higher value of them indicates better distinguishing ability for the model.

Kendall’s tau-a is one more measure of association in the model. It can be computed using the following formula:

Where N is the total number of observations in the model. It is again a value between 0 and 1, however, for any given model, Kendall’s tau would be much lesser than gamma or SomersD because Tau-A takes all possible pairs as the denominator while the others take only the 1-0 pairs in the denominator.

Once we know these definitions, we can modify the above function OptimisedConc to return even these values by adding the following lines of code just before the return statement like this:

PercentConcordance=(sum(conc)/Pairs)*100
  PercentDiscordance=(sum(disc)/Pairs)*100
  PercentTied=(sum(ties)/Pairs)*100
  N<-length(model$y)
  gamma<-(sum(conc)-sum(disc))/Pairs
  Somers_D<-(sum(conc)-sum(disc))/(Pairs-sum(ties))
  k_tau_a<-2*(sum(conc)-sum(disc))/(N*(N-1))
  return(list("Percent Concordance"=PercentConcordance,
              "Percent Discordance"=PercentDiscordance,
              "Percent Tied"=PercentTied,
              "Pairs"=Pairs,
              "Gamma"=gamma,
              "Somers D"=Somers_D,
              "Kendall's Tau A"=k_tau_a))

And the call to the function would return:

This post covered one of the practical considerations to be taken into account while running predictive models using R. In the upcoming posts, I plan to cover some of the ways the above outputs can be beautified using html and some of the other practical considerations while modeling on R. If you liked this post/found it useful, you can give me a thumbs up using comment/likes. I’ll be back with more on these areas of predictive modeling soon. Till then, happy modeling :)

Update: 18 Feb 2014

A follow-up to this article has been published today. Although the OptimisedConc works well to save time, it is very poor in terms of memory utilization. And hence, a better function named as 'fastConc' has been written which makes use of the native functionality.
You can find the new article and the function on this link.