Using generalizability theory to evaluate the comparative reliability of developmental measures in neurogenetic syndrome and low-risk populations

The lack of available measures that can reliably characterize early developmental skills in children with neurogenetic syndromes (NGS) poses a significant challenge for research on early development in these populations. Although syndrome-specific measures may sometimes be necessary, a more cost- and time-efficient solution would be to identify existing measures that are appropriate for use in special populations or optimize existing measures to be used in these groups. Reliability is an important metric of psychometric rigor to consider when auditing and optimizing assessment tools for NGS. In this study, we use Generalizability Theory, an extension of classical test theory, as a novel approach for more comprehensively characterizing the reliability of existing measures and making decisions about their use in the field of NGS research. We conducted generalizability analyses on a popular early social communication screener, the Communication and Symbolic Behavior Scales—Infant-Toddler Checklist (CSBS-ITC), collected on 172 children (41 Angelman syndrome, 30 Prader-Willi syndrome, 42 Williams syndrome, 59 low-risk controls). Overall, the CSBS-ITC demonstrated at least adequate reliability in the NGS groups included in this study, particularly for the Prader-Willi and Williams syndrome groups. However, the sources of systematic error variance in the CSBS-ITC varied greatly between the low-risk control and NGS groups. Moreover, as unassessed in previous research, the CSBS-ITC demonstrated substantial differences in variance sources among the NGS groups. Reliability of CSBS-ITC scores was highest when averaging across all measurement points for a given child and was generally similar or better in the NGS groups compared to the low-risk control group. Our findings suggest that the CSBS-ITC communicates different information about the reliability of stability versus change, in low-risk control and NGS samples, respectively, and that psychometric approaches like Generalizability Theory can provide more complete information about the reliability of existing measures and inform decisions about how measures are used in research on early development in NGS.


Background
Children with neurogenetic syndromes (NGS) demonstrate severe developmental delays that span multiple domains. Reliably operationalizing and measuring these delays is important for a number of reasons, including determining areas of need and monitoring response to intervention. While many of the most commonly used developmental measures function reliably in typically developing (TD) populations, a current challenge in NGS is determining whether these measures are similarly reliable when applied with children with severe and multidimensional delays. Importantly, measures with low reliability are more likely to either miss effects that are truly present (i.e., type II errors) or suggest the presence of statistically significant effects that are due to error (i.e., type I errors). These risks are particularly hazardous in NGS research, which often deals with very small, highly heterogeneous samples, and whose findings are often given substantial weight in determining the effectiveness of treatment protocols, especially in clinical trials [1]. Thus, there is a need for comprehensive psychometric evaluation of measurement tools in NGS. That is, can a measure be used reliably in its current form, does it require optimization for use in NGS, or is wholesale replacement by an instrument specifically designed to measure developmental skills in NGS justified? This critical evaluation of the reliability of existing measures in NGS is one aspect of this comprehensive evaluation that will provide better understanding of the strengths and limitations of common tools in the field.
In this study, we present generalizability theory (GT) as an underutilized and parsimonious method for evaluating the reliability of measures used in NGS research. GT is an extension of classical test theory [2] that allows for the evaluation of the effects of multiple sources of variance, and their interactions, on the reliability of a measure. We demonstrate the advantages of using GT by examining the variance decomposition and reliability of a popular early social communication screening measure, the Communication and Symbolic Behavior Scales-Infant Toddler Checklist (CSBS-ITC [3];), in three NGS populations and a low-risk comparison group.

Threats to reliability in NGS
Classical reliability is the ability to measure a certain phenomenon consistently, such that similar results should be obtained across repeated evaluations when the phenomenon of interest is stable. Any variation in scores across evaluations is attributed to error, which can occur randomly or systematically. While random error is difficult or impossible to avoid, systematic error, when identified, can be reduced by adjusting certain characteristics of the measure. Importantly, assessment (e.g., longitudinal) and methodological (e.g., GT) approaches can be used to parse sources of systematic and random measurement variability into random effects that inform how a given measure's reliability can be optimized. Using measures with low reliability (i.e., large error) increases the chances of error in statistical hypothesis testing. Thus, the goal in developing any measure is to reduce the overall amount of error, which in turn will improve the measure's reliability.
Several overarching attributes common to NGS groups increase the risk of measurement unreliability. First, a vast amount of heterogeneity in phenotypic presentation exists within most syndrome populations. Thus, even individuals with the same genetic diagnosis demonstrate immense variability across developmental domains, which may not necessarily be the target of measurement, but that may nevertheless differentially contribute to error. Second, early development in children with NGS is often severely delayed compared to that of similarly aged TD children who serve as the normative sample for most developmental measures. As such, many measures focus on skills that are more advanced than what would be expected for a child with NGS, which can lead to many children with NGS receiving scores that cluster around the measure floor. Relatedly, children with NGS may receive very similar scores across multiple measurement points over time due to exhibiting a slower rate of development than would be expected in TD populations. Combined, these factors contribute to many commonly used assessment tools lacking enough granularity to capture meaningful change in skills over time for children with NGS [4,5]. A lack of between-and within-person variability in scores limits the ability of researchers and clinicians to accurately estimate true skills, separate from random variation due to error, among children with NGS.
A major question in the field, then, is how to move forward with assessing early development in NGS given the limitations of current measurement options. One approach has been to move away from using existing standardized measures and instead turn to new measures that have either been tailored to specific syndromic populations or that draw from naturalistic samples of behavior with the ability to capture more variability within individuals (e.g., [6][7][8][9]). While sometimes necessary, this approach comes at a cost: validating new measures requires significant time, resources, and expertise, and can also limit the ability to compare with the vast amount of existing literature using more popular measures in typical development.
Another option is to critically evaluate existing measures to determine if and how these tools can be used appropriately in NGS. In fact, an NIH work group focused on identifying appropriate outcome measures for clinical trials in fragile X syndrome recommended an approach of "borrow, adapt, and evaluate," which involves examining measures being used in other clinical populations for feasibility in fragile X syndrome [10]. By embracing this approach as a first step for evaluating measurement tools in NGS, the field may potentially avoid unnecessary efforts towards collecting and analyzing intricate behavioral samples or developing new measures altogether. However, a first step to the "borrow, adapt, and evaluate" approach will be to assess the psychometric properties of existing measures in NGS populations, including one of the most commonly discussed metrics of psychometric rigor: reliability.

Generalizability theory
Reliability is the ability to measure a certain phenomenon consistently such that similar results should be obtained across repeated evaluations. Reliability is commonly measured using classical test theory methods, including internal consistency, inter-rater reliability, and test-retest reliability, all of which focus on the stability of point estimates at a single measurement point or across pairs of measurement points. However, these common metrics of reliability do not inform the reliability of measurements across expected periods of change-such as natural development or acute response to intervention. That is, classical test theory methods do not answer the question of whether change over time as measured by a particular instrument is considered true change or measurement error, and what sources of variance may be contributing to changes in scores over time.
GT is a more flexible statistical framework for evaluating the reliability of a measure [11]. As an expansion of classical test theory, GT acknowledges that variance in an observed score comes from multiple sources [12]. Classical test theory only considers a single source of variance, while GT allows for simultaneous consideration of between-person, within-person, and the remaining variance, which is a combination of unattributed variance and true error variance [13]. While GT can be used to construct various types of reliability estimates, it also provides insight into the sensitivity of a measure by estimating the amount of variance accounted for by individual and group factors.
There are two main components of a GT analysis: the generalizability study or "G Study" and the decision study or "D Study". The first step (G Study) estimates potential sources of variance in observed scores by computing linear combinations of analysis of variance (ANOVA) mean squares, which are estimates of variance across groups [11]. These analyses provide generalizability estimates, which represent a raw proportion of the total variance accounted for by each included factor. Such factors can be specified to include person, item content, time of measurement, or other variables that are thought to systematically relate to the construct of interest (e.g., a child's age would be considered when evaluating a developmental measure). The output of GT analyses is random effect variance components, which are commonly expressed as percentages (out of 100%), that describe the relative contribution of each factor to the total variance. For example, if the variance due to person is 12%, that means that 12% of the total variance is due to individual differences not accounted for by other included factors.
In the second step of GT (D Study), variance components are used to estimate various forms of reliability [14]. For longitudinal designs, four reliability estimates are typically computed using the procedure outlined by Cranford et al. [14]. R KF represents the reliability of the average scale ratings from all items and all measurement periods. If a researcher gave participants the same measure every day for 1 week, R KF would be the betweenperson reliability when the average was taken across all 7 days for each participant. R 1F represents the reliability of one fixed measurement point (e.g., when collected on the same day or at the same age). In a repeatedmeasures study, R 1F would be the reliability if the researcher selected one single date from a week-long data collection period and compared all participants on that single day, perhaps to control for effects of day of the week. In developmental research, studies often collect measures at specified ages (e.g., all participants are assessed when they are 6, 9, and 12 months old). In this case, R 1F would be the reliability when comparing all participants of the same age. R 1R represents the reliability of one randomly selected measurement point (e.g., measurement points that may be collected on different days or at different ages), as is commonly the case in developmental research. In this way, R 1R demonstrates how the level of a given measure can be impacted both by differences at measurement points and how participants may respond uniquely at that measurement point [13]. Finally, R C represents the reliability of change for a given measure and is particularly relevant to developmental research. Different from test-retest reliability, R C considers the proportion of variability that is due to systematic changes over time within individuals. High R C indicates that a measure can reliably detect variation in scores for the same person across time, whereas low R C may indicate a lack of reliable variation in scores over and above the common effect of time across all persons.
GT can address specific measurement challenges faced by early developmental research in NGS. First, GT allows researchers to characterize specific sources of measurement error variance inherent in NGS groups. For example, it is possible to determine how much variability stems from individual differences, item content or wording, and developmental trajectories. Importantly, GT also allows researchers to examine how interactions across these factors contribute to variance. For example, person-level differences and item content may interact, suggesting that certain individuals tend to systematically score higher (or lower) on certain items of the measure compared to others. This pattern may indicate that certain items are not capturing the intended construct for a certain group of individuals due to a secondary variable such as speech or motor ability. Second, GT demonstrates how different sources of variance affect the reliability of measurement tools when they are applied to NGS samples but were designed for use in TD populations. The GT reliability estimates can indicate whether a measure can reliably differentiate children within a certain syndromic population, as well as whether the measure can reliably detect withinindividual variation over time. These analyses can inform decisions about the use of a measure in NGS research; specifically, whether the measure is reliable enough to answer the particular research question of interest, or if forgoing the use of this measure is justified to instead devote efforts to developing measures explicitly designed to optimize reliability.
Importantly, the goal of GT is to characterize the reliability of a measurement tool, not the validity. That is, while GT can determine how consistently a measurement tool is capturing a certain skill, it cannot determine if the skill being captured is ultimately relevant to the outcome or effect of interest. For example, a child who is nonverbal may receive a low score on an anxiety screener that primarily measures symptoms of anxiety that are expressed verbally. This low score may not accurately reflect the child's anxiety level but instead the child's inability to express anxiety verbally. GT is unable to "diagnose" and "fix" this underlying cause of decreased validity, though the issue may be reflected in GT analyses of the measure. For example, a GT analysis might indicate that person-level differences account for only a small proportion of variance in scores among children who are nonverbal; however, it is up to the researcher running the GT analysis to identify verbal ability as a potentially relevant factor, and to structure their GT analysis in a way that can address this question. Current efforts are underway to address these issues of validity specifically for measures of early development in NGS (Kelleher et al., under review). Nevertheless, understanding the reliability of measurement tools is an important step in better understanding the validity of those tools. Thus, while using GT to characterize the reliability of measurement tools does not solve all the measurement challenges that the field of NGS research faces, it does advance the field by providing a mechanism to better understand how well the existing and commonly-used measurement tools are functioning in these unique populations.
In this study, we demonstrate the advantages of using GT in NGS research by evaluating the reliability of the CSBS-ITC in three distinct NGS groups, each with a unique phenotypic profile that demonstrates the heterogeneity often observed among NGS groups. We compare the reliability of the CSBS-ITC in these NGS groups to that which is observed in a low-risk control group, providing an anchor for what might be generally considered acceptable in non-syndromic populations. Together, this work aims to provide a model for using GT to critically evaluate measurement tools used in NGS research in order to better align our study designs with the strengths and limitations of the available measurement tools in the field.

Participants
Participants included 172 children enrolled in the Purdue Early Phenotype Study, a longitudinal survey examining the early development of children with NGS. The present study includes children who have a diagnosis of Angelman syndrome (AS; n = 41), Prader-Willi syndrome (PWS; n = 30), or Williams syndrome (WS; n = 42). We also included a group of children with no genetic diagnosis (low-risk controls or LRC; n = 59) to provide a comparison of the expected profiles and trajectories of early social communication development in typical development. While we expect that most children with a NGS will demonstrate atypical social communication development, the three NGS groups included in this study have unique phenotypes that will allow us to examine the reliability of the CSBS-ITC across a range of phenotypic profiles that vary in the severity and variability of intellectual disability and language skills. Figure 1 includes key phenotypic features of the NGS groups included in this study.
Participants' mothers completed surveys at 6-or 12month intervals depending on child age, for a total of 454 measurement points across participants. Across groups and measurement points, children's ages ranged from 1 to 60 months (M = 28.12, SD = 13.95). Table 1 includes demographic information across groups, as well as a breakdown of number of measurement points per participant by group. Cross-group analyses of CSBS-ITC scores have been previously published using a subset of the sample included in this report [15].
Inclusion criteria required that the child's family resided in the USA and that English was the primary language spoken in the home. LRC exclusion criteria included premature birth (< 37 weeks gestation), speech delay, or having a first-degree family member diagnosed with autism spectrum disorder. We confirmed genetic status for participants in the NGS groups by medical and genetic reports provided by the participant's family.

Communication and Symbolic Behavior Scales Developmental Profile-Infant-Toddler Checklist ([3])
The CSBS-ITC is a 24-item parent-report screening checklist that assesses skills pertaining to social responses, speech abilities, and symbolic comprehension to identify infants who may be at risk for developing social communication delays. Parents choose an option that best describes their child's behavior related to specific social communication skills. Most items are scored on a 0-2 scale (19 items; 0 = not yet, 1 = sometimes, 2 = often). Five items are scored on scales with more options (1 item with scale 0-3, 4 items with scale 0-4) that allow the parent to indicate a range that best describes their child's behavior (e.g., "About how many words does your child use meaningfully that you recognize?" where 0 = none, 1 = 1-3, 2 = 4-10, 3 = 11-30, 4 = Over 30). All item scores are summed to create a total raw score, which ranges from 0 to 57. While the normed age range for the CSBS-ITC is from ages 6 to 24 months, the CSBS-ITC can be used in children older than 24 months who have language delays (such as those with neurogenetic syndromes), particularly when using raw scores instead of norm-referenced scores [15]. Parents completed the CSBS-ITC 1 to 5 times per child, starting when the child's mother contacted the Neurodevelopmental Family Lab and continuing at approximately 6-or 12-month intervals, depending on the child's age. Reliability of the CSBS-ITC total raw score in past studies is α = .93 for a single fixed assessment [3]. In the present study, Cronbach's alpha was .96 for the CSBS-ITC total raw score of each participant's first measurement point in the overall sample (including NGS and LRC together). When calculated for NGS and LRC groups individually, Cronbach's alpha was .97 in LRCs, .90 in AS, .95 in PWS, and .95 in WS. 1

Vineland Adaptive Behavior Scales-Third Edition ([16])
The VL-3 is a semi-structured parent-report interview assessing adaptive skills for individuals aged birth through 89 years. Studies have applied previous versions of the VL-3 to neurogenetic syndrome populations, including in clinical trials [17]. In this study, we used the Adaptive Behavior Composite (ABC) score to characterize adaptive functioning level for a subset of participants whose mothers had completed a follow-up phone interview (LRC n = 42, 71%; AS n = 30, 73%; PWS n = 21, 70%; WS n = 26, 62%).

Procedure
All procedures were approved by the university's Institutional Review Board, and caretakers provided consent for participation. We recruited families primarily through postings on social media pages (i.e., Facebook and a lab webpage), as well as a subset of WS and AS participants through the Williams Syndrome Registry and Angelman Syndrome Foundation, respectively. We recruited LRC participants through advertisements on Facebook that targeted a nationally-representative sample of mothers with young children. Mothers interested in participating in the study contacted the lab and completed screening questions to determine eligibility. After confirming eligibility, we sent mothers the link to a password-protected online survey to complete the first measurement point of the study, which included the CSBS-ITC. Due to the likelihood of developmental 1 Cronbach's alpha should approximate R 1F estimates. Our Cronbach's alpha estimates for the first measurement point are generally higher than our R 1F estimates (which use all available measurement points), suggesting that the first measurement point may be the most reliable for our groups except for AS.   Demographics delays in the NGS group, we administered the CSBS-ITC to all age ranges to capture older participants whose developmental level matched the level of social communication skills included on the CSBS-ITC. After mothers completed their child's second measurement point, we invited them to participate in a phone interview about their child, which included the VL-3.

Data cleaning
We created an "age bin" variable as a categorical variable that indicated the developmental sequence of the participant's measurement points. This transformation ensured that age was partially crossed with individuals instead of nested within person. Age bins ranged from 0 to 60 months at 3-month intervals, for a total of 20 age bins. We chose this interval because 3 months is a small enough age range to be fairly certain that children will make meaningful advances in skills and is also consistent with typical age groupings commonly used in many studies in the field (e.g., [18]). Each individual had measurement points assigned to 1-5 different age bins. Age bin size is reported in Table S1 as supplementary material at https://osf.io/dsb68/. After running initial analyses, we re-ran our models using different sizes of age bins to determine whether the size of the age bin affected the reliability of the CSBS-ITC Total score and found that the effect of changing age bin size was minimal. These analyses are also presented as supplemental material in Table S2. While most items on the CSBS-ITC are rated on a scale of 0 to 2, one item is scored on a scale of 0 to 3, and four are scored on a scale from 0 to 4. Because we used raw scores as the dependent variable, it was important that all items were scored on the same scale to ensure that all items were given equal weight in models. Thus, we rescaled scores from items scored on a 0 to 3 scale by dividing the item score by 1.5, and items scored on a 0 to 4 scale by dividing by 2. These transformations resulted in all items being scored on a 0 to 2 scale.

Analytic plan
We completed statistical analyses using PROC MIXED in SAS Software Version 9.4. We used non-parametric bootstrapping to simulate 1000 iterations of each model, the results of which were compiled and used to calculate point estimates and 95% confidence intervals [19]. All models were fitted using restricted maximum likelihood. 2 SAS code for all analyses can be found at https:// osf.io/awbrn/. We ran all multi-level models for generalizability analyses based on models and procedures reported in Cranford et al. [14], which are summarized briefly here.

GT analysis 1
The goal of the first GT analysis was to conduct a G Study and a D Study to make decisions about the appropriate application of the CSBS-ITC total raw score in each risk group. For each group, we predicted variance in item-level scores of all 24 CSBS-ITC items from the random effects of person, age, item, and the interactions of these variables, based on variance components needed to calculate Cranford et al. [14]'s reliability estimates (Eq. 1). By predicting item-level scores of all 24 items of the CSBS-ITC, we can interpret variance estimates as the proportion of variance 2 We used the "nlminb" optimizer and set optCtrl = list(maxfun = 2e5), as several models did not converge when using the default optimizer. The model predicting speech composite scores in AS converged using these settings but was not positive definite, as the age*item variance component was estimated to be 0. Not reported 1 (2%) 0 (0%) 0 (0%) 0 (0%) Note. a Participants in the AS and WS groups were significantly younger than participants in the PWS and LRC groups at the first observation, F(3,168) = 6.42, p < .001 b Sample size decreases across measurement points reflect a combination of attrition, the ongoing nature of our study, and our flexible recruitment approach, which allows participants to enter the study at any age < 60 months, resulting in the variable accounted for in the CSBS-ITC total raw score.
Eq. 1: We then conducted D Studies to calculate 4 reliability estimates for each group. These reliability estimates (R 1F , R 1R , R KF , R C ) can be used to determine how reliably the CSBS-ITC can differentiate scores from different individuals at the same or different measurement points. We used classical test theory conventions for interpreting reliability estimates (i.e., 0.00-0.10 = virtually none, 0.11-0.40 = slight, 0.41-0.60 = fair, 0.61-0.80 = moderate, 0.81-1.00 = substantial [20];). In the equations below, m represents the number of items (in our study, m = 24) and k represents the modal number of measurement points (in our study, k = 4).
R 1F estimates the ability of the CSBS-ITC to differentiate between similarly-aged children (i.e., children within a single age bin) (Eq. 2).
Eq. 2: estimates the reliability of the CSBS-ITC when differentiating children at measurement points from randomly selected age bins (Eq. 3).
Eq. 3: R KF estimates the reliability of the CSBS-ITC to differentiate children when averaging scores across all available measurement points for each child (Eq. 4).
Eq. 4: R c estimates the reliability of the CSBS-ITC to detect systematic variation in skills over time (Eq. 5).

GT analysis 2
The goal of the second GT analysis was to conduct G Studies and D Studies for each individual CSBS-ITC subscale (i.e., social, speech, symbolic) to determine whether the reliability of the CSBS-ITC varies based on the type of skill being measured. We first estimated item-level variance for items in each composite of the CSBS-ITC (social 13 items; speech 5 items; symbolic 6 items) from the random effects of person, age, item content, and the interactions of these variables (Eq. 1). We then calculated reliability estimates for each composite by risk group (Eq. 2, 3,4,5) using the number of items within each subscale as the value for index m (i.e., social m = 13; speech m = 5; symbolic m = 6).

Results
Generalizability analyses GT analysis 1: CSBS-ITC total raw score by risk status We first conducted a G Study and a D Study to evaluate the reliability of the CSBS-ITC total raw score for each risk group separately. Variance decomposition of the CSBS-ITC total raw score from the G Study is reported in Table 2 and Fig. 2. Results of the G Studies suggested that groups differed in terms of which factors contributed the most variance in CSBS-ITC total raw scores. In the LRC group, age accounted for 50% of the variance and the ageby-item interaction accounted for 12% of the variance in CSBS-ITC total raw scores. Person (5%), item (7%), and the person-by-item interaction (4%) accounted for relatively less variance in CSBS-ITC total raw scores in the LRC group, and very little variance (< 2%) was accounted for the by person-by-age interaction. In contrast, variance decompositions for the NGS groups tended to show different patterns than the LRC group. Age accounted for much lower proportions of variance in the NGS groups compared to the LRC group, and while age still accounted for the highest proportion of variance within the PWS (27%) and WS (28%) groups, it accounted for very little variance in the AS group (2%). Consequently, personand item-level factors accounted for much more variance in NGS CSBS-ITC total raw scores. In AS, the majority of variance in scores was due to item (47%) and the person-by-item interaction (15%). These components similarly accounted for relatively high proportions of variance in the WS group (item = 18%; person-by-item = 10%), though person-level factors also contributed a relatively high proportion of variance (13%). In the PWS group, person accounted for the next highest proportion of variance (16%) after age, followed by item (15%) and the person-by-item interaction (9%). Thus, while age appeared to account for large portions of variance across most groups, item-and person-level factors were also important to the variance decomposition of CSBS-ITC total raw scores in the NGS groups, and their relative contributions tended to differ slightly among the NGS groups.
Point estimates and confidence intervals for the reliability coefficients of CSBS-ITC total raw score by group are reported in Table 3. In the LRC group, the reliability of the CSBS-ITC total raw score was substantial at a single fixed measurement point and when averaged across all measurement points (R 1F = .85 and R KF = .96, respectively). The CSBS-ITC total raw score had moderate reliability to detect systematic change over time among LRC participants (R C = .64) but was not reliable when taken from a single random measurement point (R 1R = .09). Patterns of reliability in the NGS groups were generally similar to the LRC group. R 1F and R KF were substantial across all NGS groups. The ability of the CSBS-ITC total raw score to detect systematic change over time was substantial in the PWS group (R C = .81), moderate in the WS group (R C = .63), and fair in the AS group (R C = .53). Its ability to differentiate children with AS at any single random measurement point was moderate (R 1R = .75) but was only fair in the PWS and WS groups. Thus, it appeared that for all groups, the CSBS-ITC total raw score was most reliable when it was averaged across all available measurement points. It was least reliable when comparing scores from randomly selected measurement points for all groups except the AS group, for which it was least reliable when detecting systematic change in scores over time. Given that CSBS-ITC items are divided into three composites that target related but distinct developmental domains (social, speech, and symbolic skills), we conducted a second set of GT analyses for these subsets of items to determine how reliability differed by CSBS-ITC composite in each group. Variance decomposition by composite and risk status are reported in Table 4 and Fig. 3. In the LRC group, variance accounted for by age varied the most across composites, accounting for the highest  proportion in the symbolic composite (64%) and a much lower proportion in the social composite (38%). All other factors accounted for similar proportions of variance across composites (all within 6% across composites). This fluctuation in variance due to age was observed in the PWS and WS groups as well. In the PWS group, similar to the LRC group, age accounted for the highest proportion of variance in the symbolic composite (36%) and the least proportion of variance in the social composite (19%). However, in the WS group, age accounted for the most variance in the speech composite (39%) and the least variance in the social composite (18%). In contrast, in the AS group, age accounted for a similarly low proportion of variance across all composites (range 0-2%). In fact, all variance components accounted for relatively similar proportions of variance across composites in the AS group, with person-level factors showing this highest fluctuation-person-level factors accounted for 20% variance in symbolic composite scores, while only accounting for 10% variance in speech composite scores (all other fluctuations between composites in AS were < 8%). Variance due to personlevel factors fluctuated by 10% in the PWS group (accounted for 11% variance in symbolic composite scores, 21% variance in social composite scores), but by less than 5% in the WS group. However, variance due to item content varied considerably in the WS group, accounting for 27% variance in symbolic composite scores but only 10% variance in speech composite scores. Variance estimates for item content varied by less than 4% across composites in the PWS group. Overall, variation accounted for by age appeared to vary the most based on the particular skill domain in question and also showed nuanced fluctuation among the composites for each group. Variance due to person-level factors was also relatively affected by composite for the NGS groups, as was variance due to item-level factors (despite the fact that items within composites should be more similar to each other than to those from the other composites). Point estimates and confidence intervals for reliability estimates of each composite by group are reported in Table 5. For the social composite in the LRC group, reliability estimates were poor for R 1R and fair for R C , but were moderate or better for R 1F and R KF . In all three NGS groups, the social composite had substantial R 1F and R KF . R C was moderate in the PWS group but was fair in the AS and WS groups. In contrast, R 1R for the social composite was fair in the PWS and WS groups, but moderate in the AS group. For the speech composite, R KF was substantial, R 1F was moderate, R C was fair, and R 1F was not reliable in the LRC group. In the NGS groups, R 1F , R KF , and R C were all moderate or better for all composites except for R C in AS. Finally, for the symbolic composite in the LRC group, R 1F and R KF were moderate or better, but R 1R and R C were slight or worse. For the symbolic composite in the NGS groups, R 1F and R KF were moderate or better for all groups, as was R 1R in the AS group. R C was fair or worse in all three NGS groups. R 1R was slight for the symbolic composite in both the PWS and WS groups. Thus, overall, the ability of the CSBS-ITC composites to differentiate children at fixed measurement points or by averaging scores across measurement points was generally moderate or better across composites and groups, whereas the ability to differentiate children at randomly selected measurement points was fair or worse across all composites for all groups except the AS group. The ability of the CSBS-ITC composites to detect change was the most variable across groups and composites, tending to be highest for the social and speech composites and for the PWS and WS groups.

Discussion
Characterizing early development in children with severe developmental delays is challenging for a number of reasons, including the limited repertoire of developmental tools appropriate to these populations, as well as the unclear psychometric rigor of existing measures. This study used Generalizability theory to evaluate the reliability of a popular early social communication screening measure, the CSBS-ITC, when used in NGS populations. We present three major findings. First, for all groups, reliability of the CSBS-ITC was best when averaging across all measurement points and was generally similar or better in the NGS groups compared to the LRC group. Second, the patterns of variance decomposition for the CSBS-ITC for NGS groups varied from patterns observed in the LRC group, particularly related to the amount of variance due to age, person, and item. Third, the CSBS-ITC performed differentially across the NGS groups in terms of reliability and variance decomposition, suggesting the CSBS-ITC may capture nuanced phenotypic features of the three NGS groups. Overall, despite differences in variance decomposition across groups, it appears that the CSBS-ITC has similar or better reliability in NGS compared to low-risk controls.
The CSBS-ITC generally demonstrated similar or better reliability in the NGS groups as it did in the LRC group. This suggests that the CSBS-ITC will be similarly reliable at detecting levels and changes in social communication skills in NGS populations as it is in LRC. The CSBS-ITC was substantially reliable when it was used among children with similar age ranges and when averaged across all available measurement points. The ability of the CSBS-ITC to detect systematic change over time was quite variable across the different groups and composites. It tended to be most reliable in the PWS group, with moderate or substantial R C reliability for the total and composite scores. This suggests that variation in CSBS-ITC scores over time reflected reliable withinperson change in the PWS group, and not change due to error. This was also true of the LRC and WS groups for the CSBS-ITC total score and for the WS group with the speech composite. In all other cases, though, R C reliability was fair or worse, particularly when using a single composite instead of the total score. This suggests that variation in these CSBS-ITC score trajectories reflected a large amount of error. Importantly, it does not imply that the average trajectory of change across individuals was low; it suggests that differences in trajectories were likely due to error. Thus, studies that are using the CSBS-ITC to measure change over time and are Note. 95% confidence intervals are included in brackets below each reliability estimate. See Table 3 for brief descriptions of interpretation for reliability coefficients.
interested in the heterogeneity of social communication skill trajectories would be best advised to use the total score instead of focusing on a single composite. The reliability of CSBS-ITC scores from randomly selected measurement points generally presented the worst reliability estimates across groups. This likely reflected the developmental nature of the CSBS-ITC. That is, because children are continuously developing new skills, it is not reasonable to assume their score from any random measurement point will reflect their overall ability, making it necessary to know the ages at which the scores were obtained in order to meaningfully compare two CSBS-ITC scores. Statistically adjusting for age in predictive models is one way to mitigate this problem, but even then our findings suggest that it could be problematic to compare the CSBS-ITC scores of two children whose ages are more than 3 months apart because it would not take into account the other sources of systematic variation. Thus, using the CSBS-ITC to differentiate between children with skills reported at randomly selected measurement points would not be recommended for any population. This study model is sometimes used in research with children with NGS due to the challenges of recruiting children with these rare syndromes, especially at very young ages, including our own work [15]. Our findings suggest that this approach may not be appropriate for future work using the CSBS-ITC, and-more broadly-that GT may help researchers make such determinations when deciding how to analyze NGS data on a case-by-case basis.
The AS group tended to demonstrate a different pattern of CSBS-ITC reliability than the other NGS groups, such that the ability to differentiate children from randomly selected measurement points was better than in the other NGS groups, and the ability to detect systematic change was lowest in the AS group. The higher R 1R suggests that children of varying ages can still be compared reliably within the AS group. Furthermore, the low R C in the AS group suggests that variation in CSBS-ITC scores likely reflected mostly error within the age range included in our study. This is consistent with the AS phenotype, where communication skills are often delayed beyond what might even be expected for other NGS groups and change more slowly over time. Our findings suggest that this phenotypic feature may be an obstacle to the reliability of the CSBS-ITC when used across time in AS.
With regard to variance decomposition, we found that age tended to contribute large proportions of variance across all groups except AS. Age accounted for the largest proportion of variance in the LRC group, whereas variance was more evenly distributed among other components in the PWS and WS groups. This finding suggests that in the LRC group, children are developing systematically as a function of their age with very little variation across individuals. That is, their development is relatively homogenous. This finding is consistent with our expectation that children with no known risk for developmental delay will acquire early social communication skills along very similar trajectories, regardless of other factors that may be specific to the child. This was not the case in the NGS groups. Here, we saw that individual factors and the type of early social communication skill being measured were similarly important in determining a child's score on any given item on the CSBS-ITC. Again, this finding is consistent with our understanding that early development in NGS is highly heterogeneous, thus it is unsurprising that factors other than developmental timelines-such as genetic status or other medical diagnoses-are going to more strongly influence the child's development of early social communication skills.
The fact that age accounted for such a small proportion of variance in CSBS-ITC scores in the AS group-who generally exhibit the greatest socio-communicative delays relative to the NGS groups-suggests that children in this group did not demonstrate much consistent developmental variability in their performance of these early social communication skills. Instead, item accounted for the highest proportion of variance, indicating that children with AS tended to consistently score higher or lower on certain items of the CSBS-ITC and varied minimally in their scores over time. This may suggest that for children with AS, it is not necessary to collect multiple reports of the CSBS-ITC across the age range reflected in our study, as scores may not be expected to change much with time. Notably, it is possible that age may account for more variability in CSBS-ITC scores for children older than those included in our study. However, overall, the differences in variance decomposition across the four groups demonstrated that different factors contributed to variation in CSBS-ITC scores differentially, both between populations that are and are not at risk for developmental delays and among three high-risk populations with different phenotypic profiles.
Overall, we were able to use GT to better understand what factors contribute most to variance in CSBS-ITC scores and to compare the reliability of the CSBS-ITC in NGS populations with that of the populations in which it is most commonly used. In this way, we were better able to understand the utility of this measure and whether it can be used appropriately in populations with known developmental delays. As our findings show, this is often a nuanced answer. The reliability of the CSBS-ITC score varied tremendously-from virtually none to substantial reliability-depending on the way the score was being used and in which group. This highlights the importance of exploring multiple facets of reliability, which are typically not obtained using standard reliability methods but which can be estimated using GT analyses, and determining which facets of reliability are relevant for a particular measure in a given population.
Importantly, there are several additional applications of GT which were not possible to explore with the data used in the current study but which may nonetheless be relevant to developmental research. First, researchers may wish to include rater (e.g., mother, father, teacher, clinician, self-report) as a variance component in their generalizability analyses to determine the amount of variance due to certain respondents tending to answer items about the child in a systematic way (e.g., [21]). Second, some developmental measures have different scoring formats (e.g., basal and ceiling rules, item sets) that may warrant calculating reliability estimates by item set or for items that were directly administered compared to those above or below the ceiling and floor items. Third, GT can be used to optimize reliability for certain measures. For example, behavioral assessments are common in developmental research but can be heavily influenced by the child's behavior on any given day. Using GT, researchers can determine how many measurement points are required to result in stable and substantial reliability of the measure they choose to assess the skill of interest (e.g., [22]). These additional applications can further assist in evaluating and improving the utility of common measures in NGS populations.
There are several limitations to this study. First, our sample was demographically homogeneous, preventing us from exploring the effects of race and ethnicity on our generalizability analyses. Second, while inclusion criteria required English be the primary language spoken in the home, we did not evaluate the multilingual status of children in the present study. Given the CSBS-ITC's focus on communication and language skills, particularly the speech composite, we were unable to evaluate this potentially important facet of variance. Third, due to the webbased nature of this study, we were unable to examine how the CSBS-ITC scores were related to validation measures, such as developmental or language testing, which are typically collected through in-person assessment. While these comparisons were reported in the initial CSBS-ITC development study, they have yet to be explored in populations with severe developmental delays and should therefore be included in future studies. Finally, while the three NGS groups chosen for this study represent a wide range of developmental profiles, it is likely that these findings may not generalize to other NGS populations with different phenotypes.

Conclusions
GT provides a method for better understanding measurement variance and score reliability and is particularly useful for research with NGS populations for which there are often a lack of well-validated measures that can be used to characterize early development. We found that CSBS-ITC item score variance tended to be differentially accounted for by factors such as child age, individual differences, and item content across the NGS and LRC groups. However, CSBS-ITC reliability tended to be as good or better in the NGS groups compared to the LRC group, suggesting the CSBS-ITC can generally be used to reliably measure early social communication levels and trajectories in NGS populations. Thus, GT is a useful tool for quantifying the reliability of both new and existing measures in NGS populations, particularly with regard to reliability across periods of change.
Additional file 1: Table S1. Age Bin Cell Size by Group.