Measurement Toolkit

Contents

Introduction
Validity vs. reliability
Forms of validity in population health sciences
Assessing validity
Confusion about validity: validity for what?
References

Introduction

The aim of any assessment of diet [1,2], physical activity [4-7], or anthropometry [8,9] is to accurately estimate the true value. This estimate consists of the true value plus error, even for the most accurate tool or overall method. Validity is the extent to which the estimated value matches the true value, or, the extent to which a method measures what it is supposed to measure.

Estimated Value = True Value + Total Error

Since we cannot know the true value with absolute certainty, it follows that interpretation of validity cannot be simplified to the question: is this method valid or not? Instead, validity differs according to a variable of interest, study design, population, and context. Validity can vary when:

Two different methods are used to assess the same phenomenon (e.g. self-report vs. laboratory weighing scale measures of an adolescent body mass).
The same method is used to assess two different phenomena (e.g. accelerometer estimates of activity intensity during running vs. cycling).
The same method is applied in different contexts or populations (e.g. self-report of a body mass in adolescent vs. adult populations).

Poor validity is typically the result of systematic error, which causes the estimated value to be distorted in a particular direction away from the true value. One example would be a measurement of height of study participants with their shoes still on their feet. Their shoes would cause the consistent effect of producing values systematically greater than their true height and thus decrease the truthfulness of the resulting data.

Validity vs. reliability

Validity is closely linked to reliability, however whilst reliability relates to the consistency of results, validity relates to the accuracy. It is, therefore, possible for a highly reliable method to have limited validity.

In the height example above, repeated measurements of height of the same individual would be the same each time and therefore reliable. However, the measurement would not be valid due to the underlying poor agreement with the true height caused by the shoes. Reliability and validity are described visually by the target example in Figure C.2.1 below.

Figure C.2.1 Relationships between reliability and validity at an individual level.

It is possible for a method to be unreliable at the individual level, but provide valid estimates at the group level using the mean, as shown in Figure C.2.2 below. Such a method would not be valid at the individual level.

Figure C.2.2 Relationships between reliability and validity at the group level. Neither reliable nor valid at the individual level, but valid at the group level because the mean of all values matches the target.

Forms of validity in population health sciences

Validity is a broad concept that has been defined in different ways and for different purposes. Some of the more commonly used forms of validity are described below [4].

Face validity

The degree to which a method appears to provide desired information about a variable designed to measure. This is typically a more qualitative judgement which, given the multi-dimensional nature of diet, physical activity, and anthropometry, can be an important step in determining whether a method is fit for purpose.

Content validity (also known as logical validity)

The extent to which the method is considered to assess specific aspects of a phenomenon to assess. This is important when measuring health behaviours since they can be broken down into various dimensions and domains. Similar to face validity, this is a more qualitative judgement made by considering the target variable to be measured alongside the dimensions captured by the method.

Construct validity

The extent to which a method measures the theoretical construct it is designed to measure. It is demonstrated when the method yields data as might be expected, given its intended purpose.

For example, a questionnaire assessing occupational physical activity could be expected to produce higher values for bus conductors than for bus drivers, or lumberjacks as compared with office workers.
If the resulting data were to correlate well with an assessment of physical fitness – such as maximal oxygen consumption – this too would be considered evidence of construct validity.

Criterion-related validity

The extent to which estimated values relate to those derived from a comparison or ‘criterion’ method, preferably one of very high validity and thought to provide the closest approximation of the true value, commonly referred to as a ‘gold standard’ method. For example:

The criterion validity of a new method to assess total body fat, such as a novel set of skinfold thickness equations, could be evaluated by comparing its data against scores derived by the 4-component body composition model, which has been used as the gold standard method.
By measuring the extent to which data derived by the new skinfold equations relate and/or agree with those from the criterion method, we can better understand how to interpret data from the new method.

Convergent validity

Like criterion validity, this is the extent to which predicted values match those derived from a comparison method, but one not generally accepted to be the gold standard.

Assessing validity

A validity study can assess the extent to which a method produces estimated values which are consistent with ‘true’ values. Typically, the method being examined and another method - ideally a gold standard - are used to assess the same phenomenon, followed by evaluation of the data from each.

Validity for absolute and relative measures

The relationship between the two measures can be expressed in absolute or relative terms:

Absolute validity refers to the agreement between two sets of data measuring the same phenomenon with the same units.
Relative validity is the degree to which two methods, irrespective of units, rank individuals in the same order.

One type of measurement may not be valid to capture absolute levels of exposure, but valid to capture relative differences between individuals in a study population. For example, a dietary assessment of the frequency of consuming selected foods (food frequency questionnaires) is often used without assessment of portion sizes. Thus, absolute levels of nutrient intakes cannot be valid.

Despite the absolute measures of nutrient intakes not being valid, ranking individuals by levels of nutrient intakes can be valid and thus be adopted in a study of a lifestyle-disease association. Depending on the research question, validity for absolute measures is not always necessary.

Absolute validity can be separated according to whether the interpretations are to be made about groups or individuals.

What if no gold standard is available?

There are often circumstances in which no gold standard method is available for use as the criterion [10]. This may be because:

No accepted gold standard method exists or is widely accepted (e.g. habitual dietary energy intake).
Gold standard methods which do exist are inaccessible, impracticable, or unethical. For example, the use of computer tomography to quantify adipose tissue in childhood research studies is limited due to ionising radiation exposure.

In such instances, the validity of a method can only be estimated by comparing its data with that of another with known systematic errors and biases [11]. This type of comparison is known to indicate convergent validity.

When no gold standard method is available, it is desirable that the comparison method relies on a different type of measurement to obtain data in order to avoid introducing correlated errors. For example, comparing a 24-hour dietary recall to an estimated food diary carries the risk of similar under-reporting from both methods and produces a correlation between errors from the two methods.

Confusion about validity: validity for what?

Even if the validity of a tool has been assessed through comparison with a gold standard, it should not be assumed that it is appropriate for use in every research scenario. In practice this is rarely the case; validity is tied to the overall method, plus the intended purpose, population, and context where it is applied.

The following should be considered when assessing the validity of a method to measure any aspect of diet, physical activity or anthropometry:

The characteristics of the sample used in the validation study
The scientific rigour of the validity study
The dimension(s) measured in the validity study and those of interest for the research question – i.e. face/content validity
The study design being used to answer the research question, and whether absolute or relative validity is necessary
Agreement (absolute) or association (relative) between the comparison and the method being assessed – i.e. criterion/convergent validity

Internal and external validity

The sample used in validation or other types of study should be reviewed to ascertain if the results are likely to be generalisable to other populations or contexts. This is known as external validity. Sample characteristics such as age, sex, ethnic origin, and socio-economic status may all limit generalisability. For example, an adult physical activity questionnaire that is valid for adult use may not be suitable for use in a youth population.

In contrast, internal validity is the extent to which the study or estimate is free from bias or systematic error – i.e. the appropriateness and rigour of the study design, data collection protocols, and/or analysis.

Face and content validity

Another important consideration should be whether the criterion used to evaluate a method would be suitable for use in answering your research question. For example, validity reported when compared to doubly labelled water (gold standard estimate of overall energy expenditure), would not be sufficient evidence to support the use of a questionnaire to estimate subcategories of activity such as active commuting. A method with acceptable validity for one dimension of behaviour may not be relevant or generalisable to another dimension.

Suitability for study design and research question

It is very important to recognise that the degree of validity of a method may be more or less acceptable for studies designed for different purposes. Table C.2.1 illustrates different validity for different outcomes assuming use of a ‘gold standard’ method, such as:

Assessment of 24-hour urinary sodium excretion that precisely captures exposure to dietary sodium
24-hour calorimetry that examines energy expenditure

Table C.2.1 illustrates that even if a perfect method is used, validity of such methods varies by their application.

Table C.2.1 Theoretical validity of a ‘gold standard’ measurement by exposure type.

N times of the assessment (N participants)	Once (n = 5)	1000 times* (n = 5)	Once (n = 50,000)†	1000 times* (n = 50,000)†
Internal validity
Exposure on a specific day of each person	Valid‡	Valid‡	Valid‡	Valid‡
Habitual exposure* of each person	?	Valid‡	?	Valid‡
External validity
Average habitual exposure* of the population	?	?	Valid‡	Valid‡
Variation of habitual exposure* of the population	?	?	?	Valid‡
% of the population meeting a certain public guideline or clinical cut-off	?	?	?	Valid‡

* Assumed to be sufficient to represent a habitual condition over a long period in a person.
† Assumed to be sufficient to represent the source population.
‡ Assumed to have no change in participant’s characteristics in response to each measurement and to have no errors in measurement, processing, and analysis.

For example, gold standard measures of 24-hour calorimetry in 50,000 people can capture the energy expenditure of a specific day for each individual. Also, even if we know that energy expenditure varies by time, the average of 50,000 measures can be valid to estimate an average of habitual energy expenditure of the parent population.

However, those 50,000 measures do not provide a valid measure of the variability of habitual energy expenditure between different individuals. This limitation is because an estimate of variability mixes both between-person and within-person variability together (reliability), precluding a study on between-person variability. If there is no or little within-person variability in a measurement (e.g. knee height), measuring many individuals just once allows inference of between-individual variability.

References

Johnson F, Wardle J, Griffith J. The Adolescent Food Habits Checklist: reliability and validity of a measure of healthy eating behaviour in adolescents. Eur J Clin Nutr. 2002;56(7):644-9.
Charlton KE, Steyn K, Levitt NS, Jonathan D, Zulu JV, Nel JH. Development and validation of a short questionnaire to assess sodium intake. Public Health Nutr. 2008;11(1):83-94.
Albanes D, Conway JM, Taylor PR, Moe PW, Judd J. Validation and comparison of eight physical activity questionnaires. Epidemiology. 1990;1(1):65-71.
Kelly P, Fitzsimons C, Baker G. Should we reframe how we think about physical activity and sedentary behaviour measurement? Validity and reliability reconsidered. Int J Behav Nutr Phys Act. 2016;13:32.
Kurtze N, Rangul V, Hustvedt BE. Reliability and validity of the international physical activity questionnaire in the Nord-Trondelag health study (HUNT) population of men. BMC Med Res Methodol. 2008;8:63.
Ommundsen Y, Page A, Ku PW, Cooper AR. Cross-cultural, age and gender validation of a computerised questionnaire measuring personal, social and environmental associations with children's physical activity: the European Youth Heart Study. Int J Behav Nutr Phys Act. 2008;5:29.
Rennie KL, Wareham NJ. The validation of physical activity instruments for measuring energy expenditure: problems and pitfalls. Public Health Nutr. 1998;1(4):265-71.
Stolk RP, Wink O, Zelissen PM, Meijer R, van Gils AP, Grobbee DE. Validity and reproducibility of ultrasonography for the measurement of intra-abdominal adipose tissue. Int J Obes Relat Metab Disord 2001; 25: 1346–1351.
Rolfe EDL, Brage S, Sleigh A, Finucane F, Griffin SJ, Wareham NJ, Ong KK, Forouhi NG, Validity of ultrasonography to assess hepatic steatosis compared to magnetic resonance spectroscopy as a criterion method in older adults, PLOS One, 2018;13(11):e0207923.
Schmidt ME, Steindorf K. Statistical methods for the validation of questionnaires--discrepancy between theory and practice. Methods Inf Med. 2006;45(4):409-13.
Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1(8476):307-10.