Interview:Concordant, Discordant and Tied Pairs for model validation

What are Concordant, Discordant and Tied Pairs for model validation?

A friend who was interviewed by Amazon for a data related position was asked about this question. Here is a very clear solution for this question.

http://www.listendata.com/2014/08/modeling-tips-calculating-concordant.html

 

最基本的是把1的放一组(有a个),把0的放一组(有b 个),做笛卡尔积cartesian product)得到aXb对儿数据。然后看每一对儿,把对应该是1的和应该是0的预测出来的数值做比较,如果应该是1 的大于应该是0的,叫concordance pair, 如果应该是0的大于应该是1的,叫discordance pair, 如果相等就叫tied pair。

好的model的特点:越多的concordant pairs,越少的discordant and tied pairs

一般concordant pairs占80%以上的比例比较好。

 

Citation:

Steps to calculate concordance / discordance and AUC

  1. Calculate the predicted probability in logistic regression model.
  2. Divide the data into two datasets. One dataset contains observations having actual value of dependent variable with value 1 (i.e. event) and corresponding predicted probability values. And the other dataset contains observations having actual value of dependent variable 0 (non-event) against their predicted probability scores.
  3. Compare each predicted value in first dataset with each predicted value in second dataset.

Total Number of pairs to compare = x * y
x:  Number of observations in first dataset (actual values of 1 in dependent variable)
y: Number of observations in second dataset (actual values of 0 in dependent variable).

In this step, we are performing cartesian product (cross join) of events and non-events. For example, you have 100 events and 1000 non-events. It would create 100k (100*1000) pairs for comparison.

  1. A pair is concordant if 1 (observation with the desired outcome i.e. event) has a higher predicted probability than 0 (observation without the outcome i.e. non-event).
  2. A pair is discordant if 0 (observation without the desired outcome i.e. non-event) has a higher predicted probability than 1 (observation with the outcome i.e. event).
  3. A pair is tied if 1 (observation with the desired outcome i.e. event) has same predicted probability than 0 (observation without the outcome i.e. non-event).
  4. The final percent values are calculated using the formula below –

Percent Concordant = (Number of concordant pairs)/Total number of pairs
Percent Discordance = (Number of discordant pairs)/Total number of pairs
Percent Tied = (Number of tied pairs)/Total number of pairs
Area under curve (c statistics) = Percent Concordant + 0.5 * Percent Tied

In general, higher percentages of concordant pairs and lower percentages of discordant and tied pairs indicate a more desirable model.

Leave a Comment