A solution to limitations of cognitive testing in children with intellectual disabilities: the case of fragile X syndrome

Intelligence testing in children with intellectual disabilities (ID) has significant limitations. The normative samples of widely used intelligence tests, such as the Wechsler Intelligence Scales, rarely include an adequate number of subjects with ID needed to provide sensitive measurement in the very low ability range, and they are highly subject to floor effects. The IQ measurement problems in these children prevent characterization of strengths and weaknesses, poorer estimates of cognitive abilities in research applications, and in clinical settings, limited utility for assessment, prognosis estimation, and planning intervention. Here, we examined the sensitivity of the Wechsler Intelligence Scale for Children (WISC-III) in a large sample of children with fragile X syndrome (FXS), the most common cause of inherited ID. The WISC-III was administered to 217 children with FXS (age 6–17 years, 83 girls and 134 boys). Using raw norms data obtained with permission from the Psychological Corporation, we calculated normalized scores representing each participant’s actual deviation from the standardization sample using a z-score transformation. To validate this approach, we compared correlations between the new normalized scores versus the usual standard scores with a measure of adaptive behavior (Vineland Adaptive Behavior Scales) and with a genetic measure specific to FXS (FMR1 protein or FMRP). The distribution of WISC-III standard scores showed significant skewing with floor effects in a high proportion of participants, especially males (64.9%–94.0% across subtests). With the z-score normalization, the flooring problems were eliminated and scores were normally distributed. Furthermore, we found correlations between cognitive performance and adaptive behavior, and between cognition and FMRP that were very much improved when using these normalized scores in contrast to the usual standardized scores. The results of this study show that meaningful variation in intellectual ability in children with FXS, and probably other populations of children with neurodevelopmental disorders, is obscured by the usual translation of raw scores into standardized scores. A method of raw score transformation may improve the characterization of cognitive functioning in ID populations, especially for research applications.


Introduction
The accurate measurement of cognitive capacity in children with intellectual disabilities (ID) is important for determining appropriate diagnosis, service eligibility, individual strengths and weaknesses, treatment and education planning, and for research studies on these populations that rely heavily on IQ as a critical variable of interest. ID is a disability, originating before the age of 18, characterized by significant limitations both in intellectual functioning and in adaptive behavior as expressed in conceptual, social, and practical adaptive skills (American Association of Intellectual and Developmental Disabilities; www.aaidd.org). The Diagnostic and Statistical Manual of Mental Disorders (DSM-IV; American Psychiatric Association [1]) classifies ID in the following degrees of severity based on adaptive functioning and IQ: Mild (50-55 to approximately 70; 85% of the ID population), Moderate (35-40 to 50-55; 10% of ID), Severe (20-25 to 35-40; 3%-4% of ID), and Profound (below 20 or 25; 1%-2% of ID) [1]. Intellectual functioning is defined as IQ obtained by assessment with a standardized, individually administered intelligence test such as the Wechsler Intelligence Scales, the Stanford-Binet, or the Kaufman Assessment Battery. Although the DSM-IV includes classifications for more impaired individuals, it is very challenging to measure the IQ reliably and accurately in subjects with ID below the Mild range (IQ 50-70). Indeed, a major limitation of these tests is that they do not typically measure IQ below 40 or 50, and that subtest standardized scores, which contribute to the overall score, are highly subject to floor effects and poor estimates of true ability.
A further complication and limitation is that whereas IQ tests generally do not measure functioning below 4 standard deviations below average (IQ=40), measures of adaptive behavior, such as the Vineland Adaptive Behavior Scales (VABS) [2], typically have a standard score floor of 20 (over 5 standard deviations below average), making comparisons between cognitive capacity and daily functioning impossible for these individuals. The lack of sensitivity of intelligence tests in this range of functioning is typically due to relative dearth of children with ID of varying levels of severity in the standardization samples, and limitations in the range of difficulty of test items and tasks that prevent measurement of lower levels of ability.
Notably, test publishers have recently made some improvements in the normative sampling of lower functioning children (Stanford-Binet, Fifth Edition [3]; Differential Ability Scales, Second Edition (DAS-II; [4]), and one of these tests now has a lower IQ limit of 30 (DAS-II).
Clinical and research experience with intelligence testing in children with neurodevelopmental disorders shows that meaningful variation in performance is often obscured by flooring effects when raw scores are converted to standardized scores based on the normative data in test manuals. We can use the performance of two 15-year-old children with ID on the Wechsler Intelligence Scale for Children, Third Edition (WISC-III) and the VABS to illustrate this point (on both of these measures, IQ and VABS standardized scores have a mean of 100 and standard deviation of 15. On the WISC-III, subtest standardized scores have a mean of 10 and standard deviation of 3, with a range of 1 to 19). "Sam" is 15 years of age, speaks in one-to two-word utterances, receives a VABS Adaptive Behavior Composite (ABC) score of less than 20 (below the 0.1 percentile) and a Full Scale IQ (FSIQ) of 40 (the floor of the test). On the WISC-III Vocabulary subtest, for example, he obtains a raw score of 1 which converts to a standardized score of 1 (in response to "What is a clock?" he answers, "Time.", and then has no further correct responses). "Joe" is a verbally fluent 15-year-old with a VABS ABC score of 60. He obtains a Vocabulary raw score of 16 and responds to questions with complex phrases or complete sentences; however his raw score also converts to a standardized score of 1, the same as Sam. Joe obtains a FSIQ of 42, just 2 points higher than Sam.
Floor effects and other measurement problems in intelligence testing with children with ID are common; however with a few exceptions such as those below, they are not often recognized or discussed in published studies. In a longitudinal study of a large sample of adults with mental retardation using the Wechsler Adult Intelligence Scale-Revised (WAIS-R), Facon [5] reported mean IQ scores between 54 and 58 for four different age bands; however the scores and distributions were indicative of significant flooring effects that the authors acknowledged as a limitation in their discussion. In their analysis, the authors chose to use subtest raw scores instead of the standardized scores; they re-standardized the raw scores relative to their entire sample and summed these scores to create new composite verbal and performance scores for each subject. Another example comes from a study of 195 individuals with Down syndrome that were longitudinally assessed with the Stanford Binet, Fourth Edition. The authors reported that 37% of the available test results were assigned the lowest possible score of 36 [6] but that these individuals demonstrated highly variable levels of performance despite flat standardized score profiles.
Our research centers have been studying individuals with FXS, the leading cause of inherited ID, for the past 25 years. FXS is a single gene disorder caused by a mutation in the fragile X mental retardation 1 (FMR1) gene on the X chromosome at Xq27.3. This mutation results from a trinucleotide expansion preventing normal transcription, and leads to reduction or absence of the FMR1 protein (FMRP) [7,8] and consequent abnormal brain development, including aberrant dendritic arborization and synaptic plasticity [9][10][11][12][13]. In full mutation females, FMRP is usually expressed only by the normal allele carried on the active X chromosome. As a result, females tend to be higher functioning than males with FXS, although there is wide variability from significant ID to normal or above average IQ. Variable FMRP expression also results from mosaicism, where transcriptional silencing of the gene does not occur in all cells, either because of varying sizes of the repeat expansion or variation in methylation. Although more frequent in males, mosaicism also occurs in females with FXS. Individual differences in FMRP production in the brain as a result of these factors are thought to account for a significant proportion of the variability in IQ in individuals with FXS.
We have sought to understand the impact of gene function, brain function, and environmental variation on cognition and behavior in FXS, with the ultimate goal of identifying effective interventions based on this information. However, our research and clinical work has been significantly limited by a lack of IQ measurement sensitivity, as described above, in a substantial portion of individuals with this disorder. For example, in one study, designed to determine genetic and environmental factors contributing to IQ (as measured by the Wechsler scales), 43% of boys with FXS scored at the floor on all 12 subtests, and all of these children obtained a FSIQ of 40 [14][15][16]. Although these individuals demonstrated considerable variability in their cognitive abilities and level of adaptive behavior [15], their individual strengths and weaknesses and variation within the group were not reflected in their standardized scores. In an attempt to overcome this problem, in a recent study [17] we abandoned standard scores altogether, and employed raw WISC-III subtest scores to examine the development of intellectual functioning in children with FXS. Using raw scores, and covarying for age, we found that intellectual functioning in children with FXS developed approximately two times slower than typically developing siblings over the age range of 6 to 16 years. While raw scores may offer significant advantages over standard scores (e.g., no floor effect, normal distribution of scores), the WISC-III manual does not contain raw subtest scores from the normative population. Thus, investigators cannot use raw subtest scores in their analyses without the inclusion of a well-matched comparison group.
Fragile X offers a unique opportunity to examine the sensitivity of intelligence testing in an ID population. The specific genetic etiology has been identified, the neuroanatomical morphology has been well-described, and the cognitive and behavioral phenotype is well known and relatively consistent. Although there are differences in FMRP expression in the brain compared to blood, the gene-dose of the mutation can be estimated by measurement of FMRP in lymphocytes. The degree of FMRP deficit can then be correlated with the cognitive deficit as measured by standardized testing [18,19]. Thus, FXS is a model for examining assumptions about measurement of cognition of individuals with mental impairment that can then be tested in other neurodevelopmental disorders (e.g. autism, Down syndrome) and more heterogeneous populations (e.g. children with idiopathic ID).
Here, we examined the sensitivity of the WISC-III, one of the most widely used intelligence tests, in a large sample of children and adolescents with FXS. First, we show the distribution of the usual standard scores in this sample of boys and girls. Next, we present a method for calculating new normalized scores representing each child's actual deviation from the standardization sample, based on the raw score descriptive statistics obtained with permission from the publisher of the WISC-III (Psychological Corporation, San Antonio, TX). Finally, we compare the distribution of the normalized scores to the usual standardized scores, and correlate each of these with another measure of developmental level, the Vineland Adaptive Behavior Scales, and the degree of FMRP deficit.

Participants
Participants included 217 children with the fragile X full mutation ranging in age from 6 to 17 years (83 girls, mean age = 10.94 ± 3.01 years; 134 boys, mean age 11.04 ± 2.59 years). Twelve girls (14.5%) and 44 boys (32.8%) had repeat size mosaicism and 1 girl (1.2%) and 13 boys (9.7%) had methylation mosaicism. Sixty-nine girls and 103 boys participated in studies conducted at the Center for Interdisciplinary Brain Sciences Research at Stanford University (PI, A. Reiss) and 14 girls and 31 boys in studies at the M.I.N.D. Institute at University of California Davis (PIs R. Hagerman and D. Hessl). Note that participants resided in various locations throughout the United States and Canada and were from a wide range of socioeconomic backgrounds as previously described [14][15][16]. The ethnic distribution was 86.6% Caucasian, 5.1% Hispanic, 1.9% African American, 0.9% Asian, 0.5% Native American, and 5.1% other or unknown. The mothers' highest level of education obtained was 33.3% college degree, 32.8% partial college or specialized training, 17.7% high school degree or GED, 14.6% graduate professional training, and 1.6% partial high school. FSIQ ranged from 40 to 123 (mean 50.0, SD 19.5). The parents of all participants provided written consent, and participants provided assent when possible, according to protocols approved by Institutional Review Boards at Stanford University and U.C. Davis.

Measures
Wechsler Intelligence Scale for Children, Third Edition (WISC-III; [20]) The WISC-III is a standardized test of intellectual aptitude for children between ages 6 and 16 years, 11 months. It is an individually administered clinical instrument with 13 subtests (all but the optional Mazes subtest were used in the study), each of which assesses either Verbal or Performance (perceptual-motor) abilities. A description of abilities addressed by each subtest is shown in Table 1. Each subtest generates a raw score, which then yields a standardized score based on normative data, and these standardized scores are combined and translated into overall Verbal IQ (VIQ), Performance IQ (PIQ) and Full Scale (FSIQ) scores. The WISC-III standardization sample included 2200 individuals, including 200 children in each of 11 age groups between ages 6 and 16 years. The groups were stratified by sex, race/ ethnicity, geographic region, and parent education based on the 1988 U.S. Bureau of the Census. In the standardization sample, 7% were classified as learning disabled, speech/ language impaired, emotionally disturbed or physically impaired. Published materials do not include information on any children with ID in the sample.
Vineland Adaptive Behavior Scales (VABS; [2]) The VABS is a widely used tool for assessing an individual's ability to care for one's self personally and socially. The VABS was designed to be administered as a semi-structured informant interview for assessing strengths and weaknesses of individuals from birth through 18 years 11 months or low-functioning adults. Part of the utility of this measure is the ability to gain accurate reporting from a responder who is familiar with a person's behavior. The interview lasts between approximately 60 min and contains 297 items. Adaptive behavior is measured in four to five domains: Communication (receptive, expressive and written), Daily Living Skills (personal, domestic, and community), Socialization (interpersonal relationships, play and leisure time, and coping skills), and Motor Skills (gross motor and fine motor; completed only for the youngest children). An Adaptive Behavior Composite (ABC) is yielded by combining scores on each of the four (or five) main domains. Standardization samples of handicapped and non-handicapped individuals provided normative data for the VABS and included 3,000 individuals between birth and 18 years 11 months, stratified by sex, race or ethnic group, community size, geographical region, and parents' education level.
Fragile X diagnosis and FMRP analysis Southern blot analyses were performed according to procedures described by Taylor and colleagues [21]. FMRP expression from peripheral blood was determined by immunocytochemistry as the percent of FMRP-positive lymphocytes [22][23][24].

Statistical methods
Normalized scores The normalized scores were obtained using an age-dependent (within each population age band) z-score transformation as follows. Descriptive statistics (means and standard deviations) of subtest raw scores for each age band (6 years, 0 months to 6 years, 3 months; 6 years, 4 months to 6 years, 7 months, etc.) from the WISC-III standardization sample [20] were obtained with written permission from the Psychological Corporation (San Antonio, TX) for the purposes of this study. (Standardization data from the Wechsler Intelligence Scale for Children -Third Edition. Copyright © 1990 by Harcourt Assessment, Inc. Used with permission. All rights reserved). Denote the mean and standard deviation of a specific WISC-III subtest raw score in the jth age band by µ j and σ j , respectively. The normalized score for individual i falling into the jth age band is z ij =(r ij −µ j )/σ j , where r ij is the subtest raw score (note that data for each population age band contain representation of sex, race/ethnicity, education level and geographic region).
For example, a 12 year, 1 month old child obtains a Block Design subtest raw score of 3. In the standardization sample, for children 12 years, 0 months to 12 years, Analysis Summary and graphical analyses were used to characterize the raw, standardized and normalized scores. Downstream bivariate association/correlation analysis between Vineland ABC score and FMRP with normalized subtest scores were based on Pearson correlation as each variable is quantitative and continuous. We considered these correlative analyses to be descriptive, hence no formal p-value adjustment was used, although the majority of p-values based on normalized scores remained significant after false discovery rate (FDR) adjustment [25]. Finally, we compared the distribution of intellectual ability classifications (Mild ID, Moderate ID, Borderline, Low Average, etc.) determined by FSIQ in comparison to the assumed classification determined by the mean normalized score.

Flooring effect of standardized scores
As expected, examination of the subtest standardized and IQ scores demonstrated significant flooring effects. The effects of flooring of subtest raw scores, resulting from standardization (i.e. the use of standard scores) are summarized in Table 2. As can be seen, a wide range of raw scores for each subtest received a floored value of 1 as the standard score, resulting in a loss of information on low performance, a range of cognitive abilities of interest in FXS, and potentially other ID populations of interest. The percent of participants in the study with floored standard scores ranged from 40.1% (Picture Completion) to 70.0% (Arithmetic). Although the bulk (e.g. 75th percentile) of floored raw scores were low (e.g. typically 0-7), the range of raw scores floored ( Table 2, second column) were quite wide. Thus, important variability in the measurement of ability on subtests in lower functioning individuals was lost by the standard score flooring. As was expected for FXS individuals, a greater proportion of raw scores for males were floored compared to females: for example, 94.0% compared to 31.3% for Arithmetic and 84.5% compared 22.4% for Comprehension. Typically, the proportion of floored raw scores for males was many-fold higher than for females (see Table 2, column 1).

Characteristics and interpretation of normalized scores
The distribution of normalized scores for males and females are displayed in Fig. 1 for two representative subtests, Arithmetic and Vocabulary, alongside the standardized scores. Note that this flooring characteristic of standard scores was apparent for all subtests, as the distributions were non-normally distributed with significant positive skewing (See supplementary materials for all figures at http://dnguyen.ucdavis.edu/.html/SUP_iq/Supplemental Figures.pdf). In contrast, the normalized scores exhibited more "normal" distributions with no flooring effects. The flooring characteristics from raw to standard scores for both males and females induced skewed distributions of standard scores and these are apparent from Fig. 1, as also described above, although to a lesser extent for females. Note that the interpretation of the normalized scores is that they are standard deviation units away from the general population mean (for a specific age band). As expected, for an ID population, as in the FXS population, the mean was negative and not symmetric about zero. The average profile of the normalized subtest scores for the FXS study cohort is displayed in Fig. 2. On average, the FXS cohort (males and females combined) performed worst on Arithmetic and best on the Similarities and Information subtests. With respect to the Arithmetic subtest, the group was about 4.3 standard deviations below the general population mean and about 2.1 standard deviations below the general population mean for Similarities and Information subtests (Fig. 2). All other subtests ranged between 2 to 3 standard deviations below general population mean (Fig. 2). Similar profiles were observed with FXS males and females with normalized scores ranging between −2 to −1 standard deviations across subtests (except for Arithmetic) for females and scores between −4 to −3 standard deviations for males. Figure 3 provides two specific case examples of 14-yearold boys with FXS, "John" and "Max" to illustrate differences between the usual standardized scores and the normalized scores derived in the study. John is a boy with a fully-methylated full mutation and 2.5% FMRP, VABS ABC standard score of 20, and FSIQ of 40. Max has repeat size mosaicism, 17% FMRP, and a VABS ABC of 38 and FSIQ of 40. As can be seen in the figure, both boys obtained standardized scores of 1 on all subtests with no variability (overlapping horizontal lines at top of figure). In contrast, the normalized scores demonstrated increased variability within each case, and somewhat lower scores for John, who had lower FMRP and adaptive behavior.

Correlation of normalized and standardized subtest scores to Vineland and FMRP
The correlation/association between the clinical outcome, Vineland ABC, and normalized scores for each subtest was stronger with higher (positive) point estimates than with standard scores (Table 3, Combined data). The correlations ranged from a low of 0.58 to a high of 0.80 for Object Assembly and Information normalized scores, respectively (combined data; all correlations, p<0.001). For males, correlation analysis without the floored values lead to appreciable loss of data, reduced power and many correlation estimates not statistically different from zero (e.g. Block Design, Comprehension, Object assembly, Picture Arrangement, Picture Completion, Similarities and Vocab-  Table 3, male data). Interestingly, in females with FXS, where the effect of flooring was less due to reduced disease severity, point estimates for correlations between normalized scores with Vineland were higher than corresponding subtest estimates based on standardized scores (Table 3, female data). See Fig. 4 scatterplots displaying representative associations between standard versus normalized scores and Vineland ABC (Scatterplots for all subtests can be found in the supplemental materials).
A similar pattern of positive association was found with normalized subtest scores and FMRP, although the overall strength of association was weaker relative to Vineland ABC scores (Table 4, Combined data, all correlations, p< 0.001). No association between standardized scores (without flooring) and FMRP was observed for all subtest scores for males, except for Digit Span, Picture Arrangement and Picture Completion. Similar to the pattern of association with Vineland score described above, the correlation estimates with normalized scores (significantly different from zero correlation) were observed more broadly across subtests in males (Table 4, male data). In females, no association was observed between FMRP and standardized score across all subtests, although stronger and significant associations based on normalized scores were observed    Similar to individual subtest standardized scores, IQ was predominantly floored, especially for males. Because subtest (standardized) scores for females were typically not floored across subtests, the IQ computed from these standard subtest standard scores, its association/correlation with Vineland outcome tracked closely that of corresponding correlation of Vineland to the normalized scores. This similarly held with correlation to FMRP (see supplementary materials).    normalized scores within the Borderline range whereas their FSIQ scores were in the mild range.

Discussion
The results of this study, using the Wechsler Intelligence Scale for Children, highlight significant floor effects and restricted sensitivity as major limitations of standardized intelligence testing of children with fragile X syndrome, one of the most common causes of intellectual disability. Despite the preponderance of floored standardized scores (up to 70% of the sample), we demonstrate that substantial and meaningful variability in performance of lower functioning individuals is lost in the standardization of raw scores. We show that renormalized scores that are based on the individual's actual deviation from the test normative data have a distribution and variability that is very much improved over the typical subtest standardized scores derived from norms tables. We show that relative to the usual subtest standardized scores, these normalized scores demonstrate more robust linear associations with a clinical measure of adaptive behavior (Vineland Adaptive Behavior Scale) and a genetic measure specific to FXS indicating the degree of FMR1 protein deficiency. The normalized scores appear to provide a profile of relative strengths and weaknesses in lower functioning individuals that is not reflected in the usual standardized scores. On a group level, the normalized scores show a substantial deficit on the Arithmetic subtest, which is consistent with prior research highlighting this aspect of the FXS cognitive phenotype and its neuroanatomical basis [26]. These results appear to have major clinical and research implications for intelligence testing of children with FXS and probably other types of ID. Although we have only documented this problem in one population and with one intelligence test, the WISC-III, the results suggest that cognitive tasks that are integral to the measurement of IQ can be sensitive to individual differences, even in very low functioning individuals.
The results of this study also have important research implications. IQ is an almost universal variable in developmental, neuroscience and genetic studies as an outcome of interest, a predictor variable, or as a critical tool for group matching. The use of IQ in lower functioning individuals, as currently derived by standardized tests, in such studies appears to lead to poor estimates of true level of cognitive ability and potential, an "even" profile that may obscure significant relative strengths and weaknesses, lower estimates of associations with other behavioral, biological, and genetic measures of interest, and samples that are inadequately matched on this dimension. Indeed, in FXS and perhaps other neurodevelopmental disorders, it will become increasingly important to utilize sensitive cognitive tests for tracking change as new targeted treatment trials are implemented.
The renormalization and improved sensitivity of intelligence testing for individuals with ID has implications for future research on the neuropsychology, neuroimaging, and genetic bases of neurodevelopmental disorders. For example, cognitive phenotyping studies and other research programs aimed at establishing links between genotype and specific cognitive patterns would greatly benefit from using individual scores that more accurately reflect the true deviation from normal as well as relative strengths and weaknesses. This has immediate implications for fragile X research as we develop and validate much more accurate measures of FMRP expression that could ultimately be used as prognostic indicators of developmental trajectory. In neuroimaging studies, efforts to determine the impact of brain morphological and functional abnormalities on neuropsychological deficits or relatively preserved abilities would also depend on cognitive scores that reflect the ability of individuals with ID (often in the experimental group) as accurately as those with typical development (often in the control group). Finally, from a study design perspective, it is important for many clinical studies of individuals with neurodevelopmental disorders to include comparison groups that are well-matched on cognitive ability so that results can be more confidently attributed specifically to the disorder in question and not confounded by more general developmental differences. A more accurate estimate of cognitive ability, as is presented here, would lead to improved matching and more powerful research designs. We emphasize that the concepts and methodological/statistical approaches proposed here may impact our ability to find other links between behavioral or cognitive phenotypes and biomarkers/genotypes. Although children with ID represent a small proportion of the population, they should receive intellectual assessments that are as sensitive and valid as those available to children who are higher functioning. Many intelligence tests currently report performance of children in special categories, such as those with mental retardation, autism, or specific learning disabilities; however these data are primarily for validation study and are separate from the normative sample. An ambitious but worthwhile solution to the sensitivity problem is to over-sample children who are lower functioning in the standardization studies and include tasks that can be completed across a broader range of developmental levels, including items designed for children with a mental age extending into toddlerhood. An oversampling of these children would yield enough normative data from children of varying levels of impairment, allowing a lower IQ floor. In the meantime, the publishers of widely-used standardized tests should consider releasing the raw data obtained from their standardization samples into the public domain so that more accurate estimates might be derived for lower functioning individuals, at least in research applications.
In summary, we show significant floor effects and lack of sensitivity of IQ measurement in children with FXS and mental impairment that can be substantially ameliorated by calculating each child's actual deviation from the normative sample. The validity of this approach was accomplished by our demonstration of stronger associations between these new normalized scores and another measure of development and a genetic measure specific to FXS, in contrast to similar correlations with the traditional standardized scores. We hope that our observations and conclusions will lead to future studies examining the sensitivity of intelligence testing in other populations of children with neurodevelopmental disorders and to improved tools for measuring cognitive abilities and patterns of strengths and weaknesses in lower functioning individuals.