Inter-rater reliability
Encyclopedia
In statistics
, inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement among raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.
There are a number of statistics which can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are: joint-probability of agreement, Cohen's kappa
and the related Fleiss' kappa
, inter-rater correlation, concordance correlation coefficient
and intra-class correlation.
, that is, a tendency to deviate towards what is expected by the rater. When interpreting and presenting the results, there may be inter-rater variations in digit preference, that is, preferences differ whether to round off a value to a lower one or a higher one.
There are three operational definitions of agreement:
1. Reliable raters agree with the "official" rating of a performance.
2. Reliable raters agree with each other about the exact ratings to be awarded.
3. Reliable raters agree about which performance is better and which is worse.
These combine with two operational definitions of behavior:
A. Reliable raters are automatons, behaving like "rating machines". This category includes rating of essays by computer . This behavior can be evaluated by Generalizability theory
.
B. Reliable raters behave like independent witnesses. They demonstrate their independence by disagreeing slightly. This behavior can be evaluated by the Rasch model
.
When the number of categories being used is small (e.g. 2 or 3), the likelihood for 2 raters to agree by pure chance increases dramatically. This is because both raters must confine themselves to the limited number of options available, which impacts the overall agreement rate, and not necessarily their propensity for "intrinsic" agreement (is considered "intrinsic" agreement, an agreement not due to chance). Therefore, the joint probability of agreement will remain high even in the absence of any "intrinsic" agreement among raters. A useful inter-rater reliability coefficient is expected to (1) be close to 0, when there is no "intrinsic" agreement, and (2) to increase as the "intrinsic" agreement rate improves. Most chance-corrected agreement coefficients achieve the first objective. However, the second objective is not achieved by many known chance-corrected measures.
Cohen's kappa, which works for two raters, and Fleiss' kappa, an adaptation that works for any fixed number of raters, improve upon the joint probability in that they take into account the amount of agreement that could be expected to occur through chance. They suffer from the same problem as the joint-probability in that they treat the data as nominal and assume the ratings have no natural ordering. If the data do have an order, the information in the measurements is not fully taken advantage of.
Either Pearson
's or Spearman
's can be used to measure pairwise correlation among raters using a scale that is ordered. Pearson assumes the rating scale is continuous; Spearman assumes only that it is ordinal. If more than two raters are observed, an average level of agreement for the group can be calculated as the mean of the (or ) values from each possible pair of raters.
Both the Pearson and Spearman coefficients consider only relative position. For example, (1, 2, 1, 3) is considered perfectly correlated with (2, 3, 2, 4).
There are several types of this and one is defined as, "the proportion of variance of an observation due to between-subject variability in the true scores". The range of the ICC may be between 0.0 and 1.0 (an early definition of ICC could be between −1 and +1). The ICC will be high when there is little variation between the scores given to each item by the raters, e.g. if all raters
give the same, or similar scores to each of the items. The ICC is an improvement over Pearson's and Spearman's ,
as it takes into account of the differences in ratings for individual segments, along with the correlation between raters.
of these differences is termed bias and the reference interval (mean +/- 1.96 x standard deviation
) is termed limits of agreement. The limits of agreement provide insight into how much random variation may be influencing the ratings. If the raters tend to agree, the differences between the raters' observations will be near zero. If one rater is usually higher or lower than the other by a consistent amount, the bias (mean of differences) will be different from zero. If the raters tend to disagree, but without a consistent pattern of one rating higher than the other, the mean will be near zero. Confidence limits (usually 95%) can be calculated for both the bias and each of the limits of agreement.
Bland and Altman have expanded on this idea by graphing the difference of each point, the mean difference, and the limits of agreement on the vertical against the average of the two ratings on the horizontal. The resulting Bland–Altman plot demonstrates not only the overall degree of agreement, but also whether the agreement is related to the underlying value of the item. For instance, two raters might agree closely in estimating the size of small items, but disagree about larger items.
When comparing two methods of measurement it is not only of interest to estimate both bias and limits of agreement between the two methods (inter-rater agreement), but also to assess these characteristics for each method within itself (intra-rater agreement). It might very well be that the agreement between two methods is poor simply because one of the methods has wide limits of agreement while the other has narrow. In this case the method with the narrow limits of agreement would be superior from a statistical point of view, while practical or other considerations might change this appreciation. What constitutes narrow or wide limits of agreement or large or small bias is a matter of a practical assessment in each case.
where textual units are categorized by trained coders and is used in counseling and survey research
where experts code open-ended interview data into analyzable terms, in psychometrics
where individual attributes are tested by multiple methods, or in observational studies where unstructured happenings are recorded for subsequent analysis.
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....
, inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement among raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.
There are a number of statistics which can be used to determine inter-rater reliability. Different statistics are appropriate for different types of measurement. Some options are: joint-probability of agreement, Cohen's kappa
Cohen's kappa
Cohen's kappa coefficient is a statistical measure of inter-rater agreement or inter-annotator agreement for qualitative items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. Some...
and the related Fleiss' kappa
Fleiss' kappa
Fleiss' kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items. This contrasts with other kappas such as Cohen's kappa, which only work when assessing the agreement...
, inter-rater correlation, concordance correlation coefficient
Concordance correlation coefficient
In statistics, the concordance correlation coefficient measures the agreement between two variables, e.g., to evaluate reproducibility or for inter-rater reliability.-Definition:...
and intra-class correlation.
Sources of inter-rater disagreement
Different raters can disagree about measurement results from the same object by e.g. variations in the procedures of carrying out the experiment, interpreting the results and, subsequently, presenting them. All these stages may be affected by experimenter's biasExperimenter's bias
In experimental science, experimenter's bias is subjective bias towards a result expected by the human experimenter. David Sackett, in a useful review of biases in clinical studies, states that biases can occur in any one of seven stages of research:...
, that is, a tendency to deviate towards what is expected by the rater. When interpreting and presenting the results, there may be inter-rater variations in digit preference, that is, preferences differ whether to round off a value to a lower one or a higher one.
The philosophy of inter-rater agreement
There are several operational definitions of "inter-rater reliability" in use by Examination Boards, reflecting different viewpoints about what is reliable agreement between raters.There are three operational definitions of agreement:
1. Reliable raters agree with the "official" rating of a performance.
2. Reliable raters agree with each other about the exact ratings to be awarded.
3. Reliable raters agree about which performance is better and which is worse.
These combine with two operational definitions of behavior:
A. Reliable raters are automatons, behaving like "rating machines". This category includes rating of essays by computer . This behavior can be evaluated by Generalizability theory
Generalizability theory
Generalizability theory, or G Theory, is a statistical framework for conceptualizing, investigating, and designing reliable observations. It is used to determine the reliability of measurements under specific conditions. It is particularly useful for assessing the reliability of performance...
.
B. Reliable raters behave like independent witnesses. They demonstrate their independence by disagreeing slightly. This behavior can be evaluated by the Rasch model
Rasch model
Rasch models are used for analysing data from assessments to measure variables such as abilities, attitudes, and personality traits. For example, they may be used to estimate a student's reading ability from answers to questions on a reading assessment, or the extremity of a person's attitude to...
.
Joint probability of agreement
The joint-probability of agreement is probably the most simple and least robust measure. It is the number of times each rating (e.g. 1, 2, ... 5) is assigned by each rater divided by the total number of ratings. It assumes that the data are entirely nominal. It does not take into account that agreement may happen solely based on chance. Some question, though, whether there is a need to 'correct' for chance agreement; and suggest that, in any case, any such adjustment should be based on an explicit model of how chance and error affect raters' decisions.When the number of categories being used is small (e.g. 2 or 3), the likelihood for 2 raters to agree by pure chance increases dramatically. This is because both raters must confine themselves to the limited number of options available, which impacts the overall agreement rate, and not necessarily their propensity for "intrinsic" agreement (is considered "intrinsic" agreement, an agreement not due to chance). Therefore, the joint probability of agreement will remain high even in the absence of any "intrinsic" agreement among raters. A useful inter-rater reliability coefficient is expected to (1) be close to 0, when there is no "intrinsic" agreement, and (2) to increase as the "intrinsic" agreement rate improves. Most chance-corrected agreement coefficients achieve the first objective. However, the second objective is not achieved by many known chance-corrected measures.
Kappa statistics
- Main articles: Cohen's kappaCohen's kappaCohen's kappa coefficient is a statistical measure of inter-rater agreement or inter-annotator agreement for qualitative items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. Some...
, Fleiss' kappaFleiss' kappaFleiss' kappa is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings to a number of items or classifying items. This contrasts with other kappas such as Cohen's kappa, which only work when assessing the agreement...
Cohen's kappa, which works for two raters, and Fleiss' kappa, an adaptation that works for any fixed number of raters, improve upon the joint probability in that they take into account the amount of agreement that could be expected to occur through chance. They suffer from the same problem as the joint-probability in that they treat the data as nominal and assume the ratings have no natural ordering. If the data do have an order, the information in the measurements is not fully taken advantage of.
Correlation coefficients
- Main articles: Pearson product-moment correlation coefficientPearson product-moment correlation coefficientIn statistics, the Pearson product-moment correlation coefficient is a measure of the correlation between two variables X and Y, giving a value between +1 and −1 inclusive...
, Spearman's rank correlation coefficientSpearman's rank correlation coefficientIn statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles Spearman and often denoted by the Greek letter \rho or as r_s, is a non-parametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can...
Either Pearson
Pearson product-moment correlation coefficient
In statistics, the Pearson product-moment correlation coefficient is a measure of the correlation between two variables X and Y, giving a value between +1 and −1 inclusive...
's or Spearman
Spearman's rank correlation coefficient
In statistics, Spearman's rank correlation coefficient or Spearman's rho, named after Charles Spearman and often denoted by the Greek letter \rho or as r_s, is a non-parametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can...
's can be used to measure pairwise correlation among raters using a scale that is ordered. Pearson assumes the rating scale is continuous; Spearman assumes only that it is ordinal. If more than two raters are observed, an average level of agreement for the group can be calculated as the mean of the (or ) values from each possible pair of raters.
Both the Pearson and Spearman coefficients consider only relative position. For example, (1, 2, 1, 3) is considered perfectly correlated with (2, 3, 2, 4).
Intra-class correlation coefficient
Another way of performing reliability testing is to use the intra-class correlation coefficient (ICC) .There are several types of this and one is defined as, "the proportion of variance of an observation due to between-subject variability in the true scores". The range of the ICC may be between 0.0 and 1.0 (an early definition of ICC could be between −1 and +1). The ICC will be high when there is little variation between the scores given to each item by the raters, e.g. if all raters
give the same, or similar scores to each of the items. The ICC is an improvement over Pearson's and Spearman's ,
as it takes into account of the differences in ratings for individual segments, along with the correlation between raters.
Limits of agreement
Another approach to agreement (useful when there are only two raters and the scale is continuous) is to calculate the differences between each pair of the two raters' observations. The meanMean
In statistics, mean has two related meanings:* the arithmetic mean .* the expected value of a random variable, which is also called the population mean....
of these differences is termed bias and the reference interval (mean +/- 1.96 x standard deviation
Standard deviation
Standard deviation is a widely used measure of variability or diversity used in statistics and probability theory. It shows how much variation or "dispersion" there is from the average...
) is termed limits of agreement. The limits of agreement provide insight into how much random variation may be influencing the ratings. If the raters tend to agree, the differences between the raters' observations will be near zero. If one rater is usually higher or lower than the other by a consistent amount, the bias (mean of differences) will be different from zero. If the raters tend to disagree, but without a consistent pattern of one rating higher than the other, the mean will be near zero. Confidence limits (usually 95%) can be calculated for both the bias and each of the limits of agreement.
Bland and Altman have expanded on this idea by graphing the difference of each point, the mean difference, and the limits of agreement on the vertical against the average of the two ratings on the horizontal. The resulting Bland–Altman plot demonstrates not only the overall degree of agreement, but also whether the agreement is related to the underlying value of the item. For instance, two raters might agree closely in estimating the size of small items, but disagree about larger items.
When comparing two methods of measurement it is not only of interest to estimate both bias and limits of agreement between the two methods (inter-rater agreement), but also to assess these characteristics for each method within itself (intra-rater agreement). It might very well be that the agreement between two methods is poor simply because one of the methods has wide limits of agreement while the other has narrow. In this case the method with the narrow limits of agreement would be superior from a statistical point of view, while practical or other considerations might change this appreciation. What constitutes narrow or wide limits of agreement or large or small bias is a matter of a practical assessment in each case.
Krippendorff’s Alpha
Krippendorff's alpha is a versatile and general statistical measure for assessing the agreement achieved when multiple raters describe a set of objects of analysis in terms of the values of a variable. Alpha emerged in content analysisContent analysis
Content analysis or textual analysis is a methodology in the social sciences for studying the content of communication. Earl Babbie defines it as "the study of recorded human communications, such as books, websites, paintings and laws."According to Dr...
where textual units are categorized by trained coders and is used in counseling and survey research
Survey research
Survey research a research method involving the use of questionnaires and/or statistical surveys to gather data about people and their thoughts and behaviours. This method was pioneered in the 1930s and 1940s by sociologist Paul Lazarsfeld. The initial use of the method was to examine the effects...
where experts code open-ended interview data into analyzable terms, in psychometrics
Psychometrics
Psychometrics is the field of study concerned with the theory and technique of psychological measurement, which includes the measurement of knowledge, abilities, attitudes, personality traits, and educational measurement...
where individual attributes are tested by multiple methods, or in observational studies where unstructured happenings are recorded for subsequent analysis.
Further reading
- Gwet, Kilem L. (2010) Handbook of Inter-Rater Reliability (Second Edition), (Gaithersburg : Advanced Analytics, LLC) ISBN 978-0970806222
- Gwet, K. L. (2008). “Computing inter-rater reliability and its variance in the presence of high agreement.” British Journal of Mathematical and Statistical Psychology, 61, 29-48
- Shoukri, M. M. (2010) Measures of Interobserver Agreement and Reliability (2nd edition). Boca Raton, FL: Chapman & Hall/CRC Press, ISBN 978-1-4398-1080-4
External links
- Statistical Methods for Rater Agreement by John Uebersax
- Inter-rater Reliability Calculator by Medical Education Online
- Online (Multirater) Kappa Calculator
- Handbook of Inter-Rater Reliability and AgreeStat (a point-and-click Excel VBA program for the statistical analysis of inter-rater reliability data)