Best Practice

This paper introduces the family of alpha coefficients developed by the renowned Prof. Krippendorff in cooperation with a team of qualitative researchers and IT specialists at ATLAS.ti. It also outlines why Cohen’s kappa is not an appropriate measure for inter-coder agreement.

- Introduction
- Selection of a suitable measuring instrument
- Developing an alternative measurement
- Krippendorff’s family of alpha coefficients – developed for use in qualitative research
- Alpha binary
- Cu-alpha and cu-alpha
- Conclusion

This paper introduces the family of alpha coefficients developed by Prof. Krippendorff in cooperation with a team of qualitative researchers and IT specialists at ATLAS.ti. It also outlines why Cohen’s kappa is not an appropriate measure for inter-coder agreement

If you decide that measuring inter-coder agreement is essential for your research, the next question is what measure you want to use. You probably turn to the literature and search what other researchers have used or look at the textbooks in your field. You will find that many researchers use Cohen’s kappa and that this measure is also recommended in many popular books even though there is plenty of literature pointing out the limitations of this measuring instrument Xie (2013) for instance, explains:

**“It is quite puzzling why Cohen’s kappa has been so popular despite so much controversy with it**. Researchers started to raise issues with Cohen’s kappa more than three decades ago (Kraemer, 1979; Brennan & Prediger, 1981; Maclure & Willett, 1987; Zwick, 1988; Feinstein & Cicchetti, 1990; Cicchetti & Feinstein, 1990; Byrt, Bishop & Carlin, 1993). In a series of two papers, Feinstein & Cicchetti (1990) and Cicchetti & Feinstein (1990) made the following two paradoxes with Cohen’s kappa well-known:

(1) A low kappa can occur at a high agreement;

(2) Unbalanced marginal distributions produce higher values of kappa than more balanced marginal distributions.

While the two paradoxes are not mentioned in older textbooks (e.g., Agresti, 2002), they are fully introduced as the limitations of kappa in a recent graduate textbook (Oleckno, 2008). On top of the aforementioned well-known paradoxes, Zhao (2011) describes twelve other paradoxes with kappa and suggests that **Cohen’s kappa is not a general measure for inter-rater reliability but a measure of reliability that only holds under particular conditions, which are rarely met.**

Krippendorff (2004) suggests that **Cohen’s Kappa is not qualified as a reliability measure in reliability analysis**. Its definition of chance agreement is derived from association measures because it assumes raters’ independence. He argues that raters should be interchangeable in reliability analysis rather than independent. The definition of chance agreement should be derived from estimated proportions as approximations of the true proportions in the population of reliability data. Krippendorff (2004) mathematically demonstrates that kappa’s expected disagreement is not a function of estimated proportions from sample data but a function of two raters’ individual preferences for the two categories.”

Xie concludes: If Cohen’s kappa “ is ever used, it should be reported with other indexes such as percent of positive and negative ratings, prevalence index, bias index, and test of marginal homogeneity.”

Another limitation of Cohen’s kappa is that it can only be used for two coders, assuming an infinite sample size (Banerjee et al., 1999; Krippendorff, 2018). In many qualitative research studies, the two-coder limit is not an issue, but **infinite sample size is a requirement that can never be fulfilled.**

Instead of using a faulty measure, at ATLAS.ti, we implement Krippendorff’s family of alpha coefficients. One great advantage was that we could discuss the specifics of coding in qualitative research with Prof. Krippendorff. We learned where he was coming from and how data are coded for quantitative content analysis. Likewise, Prof. Krippendorff learned from us how qualitative researchers code data. Based on the mutual understanding we developed, Prof. Krippendorff adapted the alpha coefficients for use in qualitative data analysis. For instance, as it is common in qualitative data analysis to apply multiple codes to the same or overlapping data segments, he modified his measure to account for multi-value coding. Further, he extended the family of alpha coefficients so that you can drill down from the general to the specific.

At ATLAS.ti, we learned about the importance of mutual exclusive coding for the measure to be calculated and introduced the semantic domain concept. A semantic domain is defined as a set of distinct concepts that share common meanings. You can also think about it as a category with subcodes. See an example below.

Krippendorff’s family of alpha coefficients offers various measurements that allow you to carry out calculations at different levels. Currently, the first three coefficients are implemented in ATLAS.ti.

You find more information on how the various coefficients are calculated here.

At the most general level, you can measure whether different coders identify the same sections relevant to the topics of interest, represented by codes. You can but do not need to use semantic domains at this level. It is also possible to use single codes for the analysis. You get a value for the alpha binary for each code or domain and a summary value for all items in the analysis. When calculating the alpha binary coefficient, all text units are considered - coded and not coded data.

Another option is to test whether different coders were able to distinguish between the codes of a semantic domain. For example, if you have a semantic domain called EMOTIONS with the subcodes:

- anger
- excitement
- fear
- joy
- sadness
- surprise

The coefficient gives you an indication whether the coders were able to reliably distinguish between for instance ‘excitement’ and ‘surprise’, or between ’anger’ and ‘sadness’. The cu-alpha coefficient will give you a value for the overall performance of the semantic domain. It will however not tell you which of the sub codes might be problematic. You need to look at the quotations and check where the confusion is.

- Coder 1 and coder 2 have applied the same codes to the first two quotations, i.e. they agree on the domain, and on the sub code of the domain.
- To the third quotation, the two coders have applied a code from the same domain, but they do not agree on the sub code.
- To the fourth quotation, the two coders have applied codes from two different domains.

Cu-alpha is the summary coefficient for all cu-alphas. It takes into account that you can apply codes from multiple semantic domains to the same or overlapping quotations. Thus, Cu-alpha is not just the average of all cu-alphas.

Further, Cu-alpha is an indicator whether the different coders agree on the presence or absence or a specific domain; or expressed differently: Could coders reliably identify that data segments belong to a specific semantic domain, or did the various coders applied codes from other semantic domains?

The cooperation with Prof. Krippendorff gave us the unique opportunity to build and expand on existing methods for measuring inter-coder agreement. The result is the family of alpha coefficients described here, which is tailored to the needs of qualitative researchers.

Take a look at this video tutorial to learn more about running inter-coder agreement analysis in ATLAS.ti.