Validity and Realibility

·         refers to the accuracy of an assessment -- whether or not it measures what it is supposed to measure. Even if a test is reliable, it may not provide a valid measure.
Type of Validity
The extent to which the content of the test matches the instructional objectives
A semester or quarter exam that only includes content covered during the last six weeks is not a valid measure of the course's overall objectives -- it has very low content validity.
The extent to which scores on the test are in agreement with (concurrent validity) or predict (predictive validity) an external criterion.
If the end-of-year math tests in 4th grade correlate highly with the statewide math tests, they would have high concurrent validity.
The term construct  is defined as a property that is offered to explain some aspect of human behavior, such as mechanical ability, intelligence, or introversion
early self-esteem studies - self-esteem refers to a person's sense of self-worth or self-respect. Clinical observations in psychology had shown that people who had low self-esteem often had depression. Therefore, to establish the construct validity of the self-esteem measure, the researchers showed that those with higher scores on the self-esteem measure had lower depression scores, while those with low self-esteem had higher rates of depression

Factor affecting Validity
Nature of the group

Consistency of the validity
coefficient for subgroups which
differ in any characteristic (e. g.
age, gender, educational level,
etc, …)
Sample heterogeneity
A wider range of scores results
in a higher validity coefficient
(range restriction phenomenon)
Criterion-predictor relationship
There must be a linear
relationship between predictor
and criterion. Otherwise, the
Pearson correlation coefficient
would be of no use!
Validity-reliability proportionality
Reliability has a limiting
influence on validity – we
simply cannot validate an
unreliable measure!

Moderator variables
Variables like age, gender,
personality characteristics may
help to predict performance for
particular variables only – keep
them in mind!
Criterion contamination
Get rid of bias by measuring
contaminated influences.
Then correct this influence
statistically by use of partial


·      The degree of consistency between two measures of the same thing. (Mehrens and Lehman, 1987).

• The measure of how stable, dependable, trustworthy, and consistent a test is in measuring the same thing each time (Worthen et al., 1993)

The same form of a test on two or more separate occasions to the same group of examinees (Test-retest)
For example, the examinees will adapt the test format and thus tend to score higher in later tests. Hence, careful implementation of the test-retest approach is strongly recommendation

Two different forms of test, based on the same content, on one occasion to the same examinees
A examinee who took Form A earlier could not share the test items with another student who might take Form B later, because the two forms have different items.

The coefficient of test scores obtained from a single test or survey
The same principle can be applied to a test. When no pattern is found in the students' responses, probably the test is too difficult and students just guess the answers randomly.
A measure of consistency where a test is split in two and the scores for each half of the test is compared with one another.
you have the Math test and divide the items on it in two parts. If you correlated the first half of the items with the second half of the items, they should be highly correlated if they are reliable.
When multiple people are giving assessments of some kind or are the subjects of some test, then similar people should lead to the same resulting scores.
Two people may be asked to categorize pictures of animals as being dogs or cats. A perfectly reliable result would be that they both classify the same pictures in the same way.


  1. Insufficient number of tasks
Remedy: Accumulate results from several assessments
  1. Poorly structured assessment procedures
Remedy: Define carefully nature of tasks, conditions for obtaining the assessment and the criteria for scoring and judging the results.
  1. Dimensions of performance are specific to the tasks
Remedy: Increase generalizability of performance by selecting tasks that have dimensions like those in similar tasks
  1. Inadequate scoring guides for judgemental scoring
Remedy: Using scoring rubrics or rating scales that specifically describe the criteria and levels of quality
  1. Scoring judgements that are influenced by personal bias
Remedy: Check scores with those of an independent judge. Receive training in scoring and rating if possible

The relationship between validity and reliability.
At best, we have a measure that has both high validity and high reliability. It yields consistent results in repeated application and it accurately reflects what we hope to represent.
It is possible to have a measure that has high reliability but low validity - one that is consistent in getting bad information or consistent in missing the mark. *It is also possible to have one that has low reliability and low validity - inconsistent and not on target.
Finally, it is not possible to have a measure that has low reliability and high validity - you can't really get at what you want or what you're interested in if your measure fluctuates wildly.


