Measuring Inter-coder Agreement – Why Cohen’s Kappa is not a good choice
Written by: Dr. Susanne Friese
This paper outlines the well-known limitations of Cohen’s kappa as measurement for inter-coder agreement and introduces the family of alpha coefficients developed by Prof. Krippendorff in cooperation with a team of qualitative researchers and IT specialists at ATLAS.ti.
Limitations of Cohen’s Kappa
When conducting qualitative research, not all researchers agree that inter-coder agreement is important or should be considered at all; after all, its roots lie in quantitative research. See for example the article by Daniel Turner.
If you decide that measuring inter-coder agreement is essential for your research, or you need to do it because your Ph.D. advisor asks you to do so; or a journal where you want to publish requires it; the next question is what measure do you want to apply. So you probably turn to the literature and search what other researchers have used, or take a look at the textbooks in your field. You will find that many researchers use Cohen’s kappa and that this measure is also recommended in many popular textbooks. If you don’t look beyond that, you go with the flow and probably think that you cannot get it wrong, because this is how it seems to be done. But – there is also plenty of literature out there, pointing out the limitations of Cohen’s kappa. Xie (2013) for instance explains:
“It is quite puzzling why Cohen’s kappa has been so popular in spite of so much controversy with it. Researchers started to raise issues with Cohen’s kappa more than three decades ago (Kraemer, 1979; Brennan & Prediger, 1981; Maclure & Willett, 1987; Zwick, 1988; Feinstein & Cicchetti, 1990; Cicchetti & Feinstein, 1990; Byrt, Bishop & Carlin, 1993). In a series of two papers, Feinstein & Cicchetti (1990) and Cicchetti & Feinstein (1990) made the following two paradoxes with Cohen’s kappa well-known: (1) A low kappa can occur at a high agreement; and (2) Unbalanced marginal distributions produce higher values of kappa than more balanced marginal distributions. While the two paradoxes are not mentioned in older textbooks (e.g. Agresti, 2002), they are fully introduced as the limitations of kappa in a recent graduate textbook (Oleckno, 2008). On top of the two well-known paradoxes aforementioned, Zhao (2011) describes twelve additional paradoxes with kappa and suggests that Cohen’s kappa is not a general measure for interrater reliability at all but a measure of reliability under special conditions that are rarely held.
Krippendorff (2004) suggests that Cohen’s Kappa is not qualified as a reliability measure in reliability analysis since its definition of chance agreement is derived from association measures because of its assumption of raters’ independence. He argues that in reliability analysis raters should be interchangeable rather than independent and that the definition of chance agreement should be derived from estimated proportions as approximations of the true proportions in the population of reliability data. Krippendorff (2004) mathematically demonstrates that kappa’s expected disagreement is not a function of estimated proportions from sample data but a function of two raters’ individual preferences for the two categories.”
Xie concludes: If Cohen’s kappa “ is ever used, it should be reported with other indexes such as percent of positive and negative ratings, prevalence index, bias index, and test of marginal homogeneity.”
Another limitation of Cohen’s kappa is that it can only be used for 2 coders, and it assumes an infinite sample size (Banerjee, et al 1999; Krippendorff, 2018). In a lot of qualitative research studies, the two coder limit is not really an issue, but infinite sample size is a requirement that can never be fulfilled.
Developing an Alternative Measurement
At ATLAS.ti, it was therefore decided not to implement Cohen’s kappa despite is popularity. Instead we worked closely together with Prof. Krippendorff to implement Krippendorff’s alpha as a measure for inter-coder agreement. One great advantage of implementing Krippendorff’s alpha was and still is that Prof. Krippendorff is still alive. We had long discussions with him to understand where he was coming from and the role of inter-coder agreement in quantitative content analysis.
Likewise, Prof. Krippendorff learned something about how researchers who analyze qualitative data with ATLAS.ti (or other QDAS) work and code. Based on the mutual understanding we developed, Prof. Krippendorff adapted the alpha coefficient he developed for use in qualitative data analysis. For instance, as it is common in qualitative data analysis to apply multiple codes to the same or overlapping data segments, he modified his measure to account for “multi-value” coding. Further, he extended the family of alpha coefficients so that you can drill down from the general to the specific.
At ATLAS.ti, we learned about the importance of mutual exclusive coding for the measure to be calculated, and also introduced the concept of the semantic domain. A semantic domain is defined as a set of distinct concepts that share common meanings. You can also think about it as a category with sub codes. Currently, an ATLAS.ti user needs to create a semantic domain via labelling the codes that belong to a semantic domain accordingly. See an example below.
Figure 1: Two semantic domains with their sub codes
As the requirement for mutual exclusive coding is somewhat artificial for qualitative researchers, this requirement will be relaxed in a future implementation of the inter-coder agreement tool in ATLAS.ti.
Krippendorff’s family of alpha coefficients – developed for use in qualitative research
Krippendorff’s family of alpha coefficients offers various measurement that allow you to carry out calculations at different levels. Currently, the first three coefficients are implemented in ATLAS.ti.
Figure 2: Krippendorff’s family of alpha coefficients
You find more information on how the various coefficients are calculated here.
At the most general level, you can measure whether different coders identify the same sections in the data to be relevant for the topics of interest, represented by codes. You can but do not need to use semantic domains at this level. It is also possible to enter single codes into the analysis. You get a value for alpha binary for each code or domain entered the analysis and a summary value for all items in the analysis. For this analysis, all text units are considered, coded as well as not coded data.
Cu-alpha and cu-alpha
Another option is to test whether different coders were able to distinguish between the codes of a semantic domain. For example, if you have a semantic domain called EMOTIONS with the sub codes:
Figure 3: Sub codes of the semantic domain EMOTIONS
The coefficient gives you an indication whether the coders were able to reliably distinguish between for instance ‘excitement’ and ‘surprise’, or between ’anger’ and ‘sadness’. The cu-alpha coefficient will give you a value for the overall performance of the semantic domain. It will however not tell you which of the sub codes might be problematic. You need to look at the quotations and check where the confusion is. In a future update, ATLAS.ti will offer a calculation of the csu-alpha coefficient. See below for further information. Figure 4 illustrates various ways and levels of agreement and disagreement:
Figure 4: Illustrating various ways of agreement or disagreement
- Coder 1 and coder 2 have applied the same codes to the first two quotations, i.e. they agree on the domain, and on the sub code of the domain.
- To the third quotation, the two coders have applied a code from the same domain, but they do not agree on the sub code.
- To the fourth quotation, the two coders have applied codes from two different domains.
Cu-alpha is the summary coefficient for all cu-alphas. It takes into account that you can apply codes from multiple semantic domains to the same or overlapping quotations. Thus, Cu-alpha is not just the average of all cu-alphas.
Further, Cu-alpha is an indicator whether the different coders agree on the presence or absence or a specific domain; or expressed differently: Could coders reliably identify that data segments belong to a specific semantic domain, or did the various coders applied codes from other semantic domains?
Once implemented, this coefficient will allow you to drill down a level deeper and you can check for each semantic domain, which code within the domain performs well or not so well. It indicates the agreement on coding within a semantic domain.
A scenario could be that the a cu-alpha coefficient for the entire domain is 0,76. As this is satisfactory, you would not look further into the codings for this domain. However, coders might have applied two of the sub codes inconsistently. The csu-alpha coefficient for these two codes would be low and you can further investigate why this is the case. Another scenario is that the cu-alpha coefficient is already low. Then the csu-alpha coefficient helps you to detect where the problems are.
Even though Cohen’s kappa is a popular measure, there are strong arguments against using it as outlined in this paper. Not everyone might be able to follow and comprehend the arithmetic problems of the measure. Even if you don’t, there are more convincing arguments why qualitative researchers should not use Cohen’s kappa. Cohen developed his kappa measure for a very different purpose and not with the qualitative researcher in mind. The cooperation with Prof. Krippendorff gave us the unique opportunity to build and expand on existing methods for measuring inter-coder agreement. The result is the family of alpha coefficients described here, which is tailored to the needs of qualitative researchers.
Take a look at this video tutorial, to learn more about how to run inter-coder agreement analysis in ATLAS.ti.
Agresti, A. (2002). Categorical data analysis (2nd ed.). John Wiley & Sons, Inc., Hoboken, New Jersey, USA.
Banerjee, M., Capozzoli, M., McSweeney, L., Sinha, D. (1999). Beyond kappa: A review of interrater agreement measures. The Canadian Journal of Statistics, Vol. 27 (1), 3-23.
Brennan, R. L., & Prediger, D. J. (1981). Coefficient Kappa: some uses, misuses, and alternatives. Educational and Psychological Measurement, 41 (3), 687-699.
Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence, and kappa. Journal of Clinical Epidemiology, 46 (5), 423-429.
Cicchetti, D. V. & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43 (6), 551-558.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement. 20 (1), 37–46.
Feinstein, A. R. & Cicchetti, D. V. (1990). High agreement but low kappa: I. The problems of the two paradoxes. Journal of Clinical Epidemiology, 43 (6), 543- 549.
Krippendorff, K.(2004/2012/2018). Content Analysis: An Introduction to Its Methodology. 2ed /3rd /4th edition. Thousand Oaks, CA: Sage.
Krippendorff, K. (2004). Reliability in Content Analysis: Some Common Misconceptions and Recommendations. Annenberg School for Communication Departmental Papers (ASC). http://repository.upenn.edu/cgi/viewcontent.cgi?article=1250&context=asc_papers
Kraemer, H. C. (1979). Ramifications of a population model for k as a coefficient of reliability. Psychometrika, 44 (4), 461– 472
Maclure, M., & Willett, W.C. (1987). Misinterpretation and misuse of the kappa statistic. American Journal of Epidemiology, 126 (2), 161-169.
Oleckno, W. (2008). Epidemiology: Concepts and Methods. Long Grove, IL: Waveland Press, Inc., pp. 649.
Xie, Q. (2013). Agree or disagree? A demonstration of an alternative statistic to Cohens kappa for measuring the extent and reliability of agreement between observer. In Proceedings of the Federal Committee on Statistical Methodology Research Conference, The Council of Professional Associations on Federal Statistics, Washington, DC, USA, 2013.
Zhao, X. (2011). When to use Cohen’s K, If ever?, International Communication Association 2011 Conference, Boston, Massachusetts, U.S.A.
Zwick, R. (1988). Another look at interrater agreement. Psychological Bulletin, 103 (3), 374-378.
Friese. S (2020). Measuring Inter-coder Agreement – Why Cohen’s Kappa is not a good choice. Retrieved from https://atlasti.com/2020/07/12/measuring-inter-coder-agreement/
About the author
Dr. Susanne Friese
Dr. Susanne Friese started working with computer software for qualitative data analysis in 1992. Her initial contact with CAQDAS tools was from 1992 to 1994, as she was employed at Qualis Research in the USA. In following years, she worked with the CAQDAS Project in England (1994 – 1996), where she taught classes on The Ethnograph and Nud*ist (today NVivo). Two additional software programs, MAXQDA and ATLAS.ti, followed shortly. Susanne has accompanied numerous projects around the world in a consulting capacity, authored didactic materials and is the author to the ATLAS.ti User’s Manual, sample projects and other documentations. The third edition of her book “Qualitative Data Analysis with ATLAS.ti” was published in early 2019 with SAGE publications. Susanne’s academic home is the Max Planck Institute for the Study of Religious and Ethnic Diversity in Göttingen (Germany), where she pursues her methodological interest, especially regarding qualitative methods and computer-assisted qualitative data analysis.