Chapter 6: Assuring Reliability and Validity
|
I. The Notion of Measurement Acceptability --researchers present information about the consistency and accuracy of measures before assessing major research questions and hypotheses because: |
|
|
(1) to do a sound empirical study, all variables must be operationally defined: (2) measurement adequacy underlies the very ability to use statistical tools; and (3) inadequate measurement prevents capturing the true value of a variable
|
operational definition is “a description of the way researchers will observe and measure a variable” (Vogt, 2005, p. 220). manipulation check For manipulated or experimental variables, methods used to assess the accuracy of the assumption that the variables were operating in the study measured value is equal to the true value of the variable plus measurement error. |
|
II. How to Do a Study of Measurement Adequacy A. Isolating Measurements and Securing Baselines (baseline measures should be taken before influences of independent and experimental variables have their impact—using posttest scores to compute reliability will produce an inaccurate assessment of reliability because random error around a mean tends to increase as the mean does) B. Assuring Consistency of Measurement Applications and Instructions C. Securing Consistent Data with Consistent Measurement Elements (Researchers must assure that the data examined for measurement adequacy are from the target population used in the full study. Sample size also must be big enough to justify the use of statistical tools [small samples tend to inflate reliability coefficients].) D. Assessing Reliability and Validity While Looking for Signs of Abnormal Variability |
|
|
III. Reliability --reliability coefficients range in value from 0, indicating no reliability, to 1.0, indicating perfect reliability. --the square root of the reliability coefficient “is the correlation of a test with true scores” (Nunnally, 1978, p. 220). --Coefficients above .90 are considered “highly reliable,” between .80 and .89 are considered to have “good reliability,” between .70 and .79 are considered to have “fair reliability,” between .60 and .69 are considered to have “marginal reliability,” and coefficients under .60 are considered unacceptable reliability |
reliability “the internal consistency of a measure” (Reinard, 2001, p. 441) reliability coefficient a form of correlation coefficient “that measures the degree of consistency of a measure” (Reinard, 2001, p. 441). |
|
A. Sources of Unreliability 1. Uncontrolled variation in the test setting or instrument (repeatability)
|
repeatability variation in measures produced by uncontrolled variation in the tests setting or measurement instrument |
|
2. Differences among respondents or raters (reproducibility) --sometimes it occurs because the researcher has accidentally sampled individuals from very different populations, because the language of the measurement items is understood by some, but not by others, or because of the use of different raters with different experience, training, or moods |
reproducibility variation in measures produced by uncontrolled variation among respondents or raters |
|
3. Lack of precision in the measurement tool |
precision “how finely an estimate is specified” (Helberg, 1995, ¶ 30) |
|
4. Length of the measure --short tests tend to be less reliable than long ones
--Spearman-Brown prophecy formula:
|
|
|
B. Methods to Determine Reliability 1. Test-retest reliability --if researchers use small sample sizes (under 15, for instance), the Pearson product-moment correlation will tend to be inflated --the intraclass correlation coefficient may be preferred (it also can be used when more than two tests are used (W. G. Hopkins, 2000a, ¶ 18). The formula for the intraclass coefficient requires the researcher to compute analysis of variance for repeated measures first:
|
test-retest reliability: giving a measure twice and reporting the correlation between scores. |
|
2. Alternate forms reliability |
alternate forms reliability: constructing different forms of the same test from a common pool of measurement items. Then, one takes two (or three) forms, administers them to the same individuals varying the order of presentation), and computes the correlation between the two measures, or the intraclass correlations in the case of more than two measures |
|
3. Split half reliability
--because the correlation
reflects the relationship between two “shorter” tests rather than the
reliability of the full measure, researchers apply the Spearman-Brown
prophecy; In the case of the
split half method, the formula is simplified to |
split half reliability: dividing a test into two parts after giving it to a group of people, scoring the two parts separately, and checking consistency between the two scores |
|
4. Item to total reliability |
item to total reliability: taking the average correlation of items with the total test score |
|
5. Intercoder reliability --Scott’s pi: Scott’s pi adjusts for the frequency with which categories may be used (what is called the degree to which agreement would be expected by chance) |
Intercoder reliability: determining the consistency of different raters who respond to the same events by using some sort of check sheet. |
|
C. Other Statistical Shortcuts: 1. Coefficient alpha
Fi2 is the variance of each individual item on the index; and FX2 is the variance of the total of all items on the index |
Coefficient alpha is a measure of the “consistency of items in an index” (Vogt, 2005, p. 71). |
|
--In essence, coefficient alpha is an average of the correlations of pairs of items on a measure. Cronbach (1951) explains that coefficient alpha is equal to the average of all possible split-half reliabilities. Yet, this assertion is “true only when all the split-half coefficients are computed using r2 (Feldt & Brennan, 1989) or in the unlikely event that all the test items are classically parallel” (Charter, 2001, p. 693). --Limitations: (1) though robust to moderately heterogeneous data, when heterogeneity is large, it may underestimate (Osburn 2000), overestimate, or both under- and overestimate (Raykov, 1998) reliability; (2) it may be imprecise when sample sizes are small (Bonnett, 2002); (3) if the items do not measure of single dimension, coefficient alpha may be inflated (Shevlin, Miles, Davies, & Walker, 2000) and the maximum possible value will not be 1 (Shojima & Toyoda, 2002); |
|
|
(4) when the assumption of “essential tau-equivalence” is violated by a great amount, alpha tends to be underestimated; (5) when measurement error terms are correlated, coefficient alpha is inflated (Komaroff, 1997); (6) coefficient alpha measures more than internal consistency of a scale (Streiner, 2003); an d (7) coefficient alpha may nbot be applied to all types of indices, such as lists of symptoms (often called “effect indicators”). |
essential tau-equivalence: assuming “that the test measures a single trait or ability and that the parts have homogeneous true score variances [Feldt & Ankenmann, 1998].” |
|
2. K-R 20 (Kuder-Richardson formula 20)
k is the number of items on the measure; Pi is the proportion who answered the item "correctly;" Qi is the proportion who did not answer the item "correctly" FX2 is the variance of the total of all items on the index. The sample variance (sX2) may be substituted as an unbiased estimator of the population variance of the total test score. --used when items have responses scored as “passing” or “not passing” |
|
|
D. Attenuation in Measurement Reliability --“no test can correlate more highly with a criterion than the square root of its reliability” (Towers, 2003, ¶ 7).
where
rxy is the observed correlation between variable x and variable y; rx is the reliability for measurement of variable x ry is the reliability for measurement of variable y |
attenuation: “a reduction in a measure of association caused by measurement errors” (Vogt, 2005, p. 15) |
|
IV. Validity
|
validity: “the consistency of a measure with a criterion (the degree to which a measure actually assesses what is claimed)” (Reinard, 2001, p. 444) bias: “when the expected value of a sample statistic tends to over- or underestimate a population parameter” (Vogt, 2005, p. 25).
|
|
A. Methods to Determine Validity 1. Face Validity
|
Face validity: “the practice of examining the content of measurement items and advancing an argument that, on its face, the measure identifies what is claimed” (Reinard, 2001, p. 435). |
|
2. Expert Jury Validity
|
Expert jury validity (or just jury validity): “having a group of experts in the subject matter examine a measurement device to judge its merit” (Reinard, 2001, p. 435) |
|
--Content Validity Index to reveal
agreement:
|
|
|
3. Criterion Validity
|
criterion (or criterion-related) validity presents an argument for validity of a measure by showing that it is related to some critical outside criterion |
|
a. Concurrent validity --the concurrent validity of the new measure can be only as strong as the evidence was for the validity of the previous measure |
concurrent validity involves correlating a new measure with a previously validated measure of the same thing (Reinard, 2001, p. 434) validity coefficient: in concurrent validity, a correlation computed between a new and previous version of a measure |
|
b. Predictive validity |
predictive validity: the degree to which a measure predicts known groups in which the construct must exist (Reinard, 2001, p. 440) |
|
4. Construct Validity
|
construct validity: involves studying the relationships between a new measure of a construct and its known properties in regard to another measure. In construct validity, researchers correlate “a measure with at least two other measures, one of which is a valid measure of a construct that is known conceptually to be directly related to the new measure, and another one of which is a valid measure of a construct that is known conceptually to be inversely related to the new measure” (Reinard, 2001, p. 433). |
|
V. The Relation of Validity to Reliability --a reliable measure may not be valid, but a valid measure must be reliable |
|