When conducting qualitative research, not all researchers agree that inter-coder agreement is important or should be considered at all; after all, its roots lie in quantitative research. See for example the article by Daniel Turner.
If you decide that measuring inter-coder agreement is essential for your research, or you need to do it because your Ph.D. advisor asks you to do so; or a journal where you want to publish requires it; the next question is what measure do you want to apply. So you probably turn to the literature and search what other researchers have used, or take a look at the textbooks in your field. You will find that many researchers use Cohen’s kappa and that this measure is also recommended in many popular textbooks. If you don’t look beyond that, you go with the flow and probably think that you cannot get it wrong, because this is how it seems to be done. But – there is also plenty of literature out there, pointing out the limitations of Cohen’s kappa. Xie (2013) for instance explains:
“It is quite puzzling why Cohen’s kappa has been so popular in spite of so much controversy with it. Researchers started to raise issues with Cohen’s kappa more than three decades ago (Kraemer, 1979; Brennan & Prediger, 1981; Maclure & Willett, 1987; Zwick, 1988; Feinstein & Cicchetti, 1990; Cicchetti & Feinstein, 1990; Byrt, Bishop & Carlin, 1993). In a series of two papers, Feinstein & Cicchetti (1990) and Cicchetti & Feinstein (1990) made the following two paradoxes with Cohen’s kappa well-known: (1) A low kappa can occur at a high agreement; and (2) Unbalanced marginal distributions produce higher values of kappa than more balanced marginal distributions. While the two paradoxes are not mentioned in older textbooks (e.g. Agresti, 2002), they are fully introduced as the limitations of kappa in a recent graduate textbook (Oleckno, 2008). On top of the two well-known paradoxes aforementioned, Zhao (2011) describes twelve additional paradoxes with kappa and suggests that Cohen’s kappa is not a general measure for interrater reliability at all but a measure of reliability under special conditions that are rarely held.
Krippendorff (2004) suggests that Cohen’s Kappa is not qualified as a reliability measure in reliability analysis since its definition of chance agreement is derived from association measures because of its assumption of raters’ independence. He argues that in reliability analysis raters should be interchangeable rather than independent and that the definition of chance agreement should be derived from estimated proportions as approximations of the true proportions in the population of reliability data. Krippendorff (2004) mathematically demonstrates that kappa’s expected disagreement is not a function of estimated proportions from sample data but a function of two raters’ individual preferences for the two categories.”
Xie concludes: If Cohen’s kappa “ is ever used, it should be reported with other indexes such as percent of positive and negative ratings, prevalence index, bias index, and test of marginal homogeneity.”
Another limitation of Cohen’s kappa is that it can only be used for 2 coders, and it assumes an infinite sample size (Banerjee, et al 1999; Krippendorff, 2018). In a lot of qualitative research studies, the two coder limit is not really an issue, but infinite sample size is a requirement that can never be fulfilled.
At ATLAS.ti, it was therefore decided not to implement Cohen’s kappa despite is popularity. Instead we worked closely together with Prof. Krippendorff to implement Krippendorff’s alpha as a measure for inter-coder agreement. One great advantage of implementing Krippendorff’s alpha was and still is that Prof. Krippendorff is still alive. We had long discussions with him to understand where he was coming from and the role of inter-coder agreement in quantitative content analysis.
Likewise, Prof. Krippendorff learned something about how researchers who analyze qualitative data with ATLAS.ti (or other QDAS) work and code. Based on the mutual understanding we developed, Prof. Krippendorff adapted the alpha coefficient he developed for use in qualitative data analysis. For instance, as it is common in qualitative data analysis to apply multiple codes to the same or overlapping data segments, he modified his measure to account for “multi-value” coding. Further, he extended the family of alpha coefficients so that you can drill down from the general to the specific.
At ATLAS.ti, we learned about the importance of mutual exclusive coding for the measure to be calculated, and also introduced the concept of the semantic domain. A semantic domain is defined as a set of distinct concepts that share common meanings. You can also think about it as a category with sub codes. Currently, an ATLAS.ti user needs to create a semantic domain via labelling the codes that belong to a semantic domain accordingly. See an example below.
As the requirement for mutual exclusive coding is somewhat artificial for qualitative researchers, this requirement will be relaxed in a future implementation of the inter-coder agreement tool in ATLAS.ti.
Krippendorff’s family of alpha coefficients offers various measurement that allow you to carry out calculations at different levels. Currently, the first three coefficients are implemented in ATLAS.ti.
You find more information on how the various coefficients are calculated here.
At the most general level, you can measure whether different coders identify the same sections in the data to be relevant for the topics of interest, represented by codes. You can but do not need to use semantic domains at this level. It is also possible to enter single codes into the analysis. You get a value for alpha binary for each code or domain entered the analysis and a summary value for all items in the analysis. For this analysis, all text units are considered, coded as well as not coded data.
Another option is to test whether different coders were able to distinguish between the codes of a semantic domain. For example, if you have a semantic domain called EMOTIONS with the sub codes:
The coefficient gives you an indication whether the coders were able to reliably distinguish between for instance ‘excitement’ and ‘surprise’, or between ’anger’ and ‘sadness’. The cu-alpha coefficient will give you a value for the overall performance of the semantic domain. It will however not tell you which of the sub codes might be problematic. You need to look at the quotations and check where the confusion is. In a future update, ATLAS.ti will offer a calculation of the csu-alpha coefficient. See below for further information. Figure 4 illustrates various ways and levels of agreement and disagreement:
Cu-alpha is the summary coefficient for all cu-alphas. It takes into account that you can apply codes from multiple semantic domains to the same or overlapping quotations. Thus, Cu-alpha is not just the average of all cu-alphas.
Further, Cu-alpha is an indicator whether the different coders agree on the presence or absence or a specific domain; or expressed differently: Could coders reliably identify that data segments belong to a specific semantic domain, or did the various coders applied codes from other semantic domains?
Once implemented, this coefficient will allow you to drill down a level deeper and you can check for each semantic domain, which code within the domain performs well or not so well. It indicates the agreement on coding within a semantic domain.
A scenario could be that the a cu-alpha coefficient for the entire domain is 0,76. As this is satisfactory, you would not look further into the codings for this domain. However, coders might have applied two of the sub codes inconsistently. The csu-alpha coefficient for these two codes would be low and you can further investigate why this is the case. Another scenario is that the cu-alpha coefficient is already low. Then the csu-alpha coefficient helps you to detect where the problems are.
Even though Cohen’s kappa is a popular measure, there are strong arguments against using it as outlined in this paper. Not everyone might be able to follow and comprehend the arithmetic problems of the measure. Even if you don’t, there are more convincing arguments why qualitative researchers should not use Cohen’s kappa. Cohen developed his kappa measure for a very different purpose and not with the qualitative researcher in mind. The cooperation with Prof. Krippendorff gave us the unique opportunity to build and expand on existing methods for measuring inter-coder agreement. The result is the family of alpha coefficients described here, which is tailored to the needs of qualitative researchers.
Take a look at this video tutorial, to learn more about how to run inter-coder agreement analysis in ATLAS.ti.