What are Concordant, Discordant and Tied Pairs for model validation?
A friend who was interviewed by Amazon for a data related position was asked about this question. Here is a very clear solution for this question.
http://www.listendata.com/2014/08/modeling-tips-calculating-concordant.html
最基本的是把1的放一组(有a个),把0的放一组(有b 个),做笛卡尔积(cartesian product)得到aXb对儿数据。然后看每一对儿,把对应该是1的和应该是0的预测出来的数值做比较,如果应该是1 的大于应该是0的,叫concordance pair, 如果应该是0的大于应该是1的,叫discordance pair, 如果相等就叫tied pair。
好的model的特点:越多的concordant pairs,越少的discordant and tied pairs
一般concordant pairs占80%以上的比例比较好。
Citation:
“
Steps to calculate concordance / discordance and AUC
- Calculate the predicted probability in logistic regression model.
- Divide the data into two datasets. One dataset contains observations having actual value of dependent variable with value 1 (i.e. event) and corresponding predicted probability values. And the other dataset contains observations having actual value of dependent variable 0 (non-event) against their predicted probability scores.
- Compare each predicted value in first dataset with each predicted value in second dataset.
Total Number of pairs to compare = x * y
x: Number of observations in first dataset (actual values of 1 in dependent variable)
y: Number of observations in second dataset (actual values of 0 in dependent variable).
In this step, we are performing cartesian product (cross join) of events and non-events. For example, you have 100 events and 1000 non-events. It would create 100k (100*1000) pairs for comparison.
- A pair is concordant if 1 (observation with the desired outcome i.e. event) has a higher predicted probability than 0 (observation without the outcome i.e. non-event).
- A pair is discordant if 0 (observation without the desired outcome i.e. non-event) has a higher predicted probability than 1 (observation with the outcome i.e. event).
- A pair is tied if 1 (observation with the desired outcome i.e. event) has same predicted probability than 0 (observation without the outcome i.e. non-event).
- The final percent values are calculated using the formula below –
Percent Concordant = (Number of concordant pairs)/Total number of pairs
Percent Discordance = (Number of discordant pairs)/Total number of pairs
Percent Tied = (Number of tied pairs)/Total number of pairs
Area under curve (c statistics) = Percent Concordant + 0.5 * Percent Tied
In general, higher percentages of concordant pairs and lower percentages of discordant and tied pairs indicate a more desirable model.
”