Looking for consistency in an uncertain world: test-retest reliability of neurophysiological and behavioral readouts in autism

Background Autism spectrum disorders (ASD) are associated with altered sensory processing and perception. Scalp recordings of electrical brain activity time-locked to sensory events (event-related potentials; ERPs) provide precise information on the time-course of related altered neural activity, and can be used to model the cortical loci of the underlying neural networks. Establishing the test-retest reliability of these sensory brain responses in ASD is critical to their use as biomarkers of neural dysfunction in this population. Methods EEG and behavioral data were acquired from 33 children diagnosed with ASD aged 6–9.4 years old, while they performed a child-friendly task at two different time-points, separated by an average of 5.2 months. In two blocked conditions, participants responded to the occurrence of an auditory target that was either preceded or not by repeating visual stimuli. Intraclass correlation coefficients (ICCs) were used to assess test-retest reliability of measures of sensory (auditory and visual) ERPs and performance, for the two experimental conditions. To assess the degree of reliability of the variability of responses within individuals, this analysis was performed on the variance of the measurements, in addition to their means. This yielded a total of 24 measures for which ICCs were calculated. Results The data yielded significant good ICC values for 10 of the 24 measurements. These spanned across behavioral and ERPs data, experimental conditions, and mean as well as variance measures. Measures of the visual evoked responses accounted for a disproportionately large number of the significant ICCs; follow-up analyses suggested that the contribution of a greater number of trials to the visual compared to the auditory ERP partially accounted for this. Conclusions This analysis reveals that sensory ERPs and related behavior can be highly reliable across multiple measurement time-points in ASD. The data further suggest that the inter-trial and inter-participant variability reported in the ASD literature likely represents replicable individual participant neural processing differences. The stability of these neuronal readouts supports their use as biomarkers in clinical and translational studies on ASD. Given the minimum interval between test/retest sessions across our cohort, we also conclude that for the tested age-range of ~ 6 to 9.4 years, these reliability measures are valid for at least a 3-month interval. Limitations related to EEG task demands and study length in the context of a clinical trial are considered. Supplementary Information The online version contains supplementary material available at 10.1186/s11689-021-09383-0.


Autism Spectrum Disorder (ASD) is defined by social-communication deficits and restricted and
repetitive patterns of behavior, and is often accompanied by sensory, motor, perceptual and cognitive atypicalities. Although well defined by clinical diagnostic criteria and assessed professionally through interviews and clinical observation, ASD is highly heterogeneous, with wide ranging presentation and a variety of etiologies and developmental trajectories (1-3). As a neurodevelopmental condition, direct measures of brain activity provide for greater understanding of the underlying neuropathology and how this impacts information processing. If robust, replicable and reliable neurophysiological measures of processing differences in ASD can be developed, these might then have utility in the stratification of individuals at early stages of the condition, to optimize targeted interventions, and as biomarkers for assaying treatment efficacy.
Scalp recordings of electrophysiological brain responses (Electroencephalogram: EEG) provide a non-invasive readout of network level neural processing with millisecond temporal resolution. EEG time-locked to stimulus presentation or to behavioral responses, referred to as event-related potentials (ERPs), is used to characterize the time-course of information processing (4,5), and can also be used to model the cortical loci of the underlying neural networks (6,7). EEG/ERPs are thus well-suited to the characterization of when and where cortical information processing might be altered in ASD, and have the potential to provide sensitive assays of treatments that are expected to act on processes with a clear neural signature. Additionally, since EEG/ERPs directly index neural function, they are likely more sensitive to initial treatment effects, given that they can measure siteof-action effects in real-time. This feature is particularly meaningful for clinical trials, which tend to be of relatively short-duration, and would benefit from more sensitive and immediate outcome measures. In contrast, more typical clinical and behavioral assays might be expected to show somewhat delayed treatment-related changes, since neural changes due to intervention would only give rise to changes in behavioral outcomes after sufficient time has passed.
There is an accumulation of support for altered sensory-perceptual processing in ASD, with evidence for differential processing across all the major sensory modalities, including audition (8)(9)(10)(11)(12), vision (13)(14)(15)(16), somatosensation (17,18) and multisensory integration systems (19). However, it bears mentioning that these differences, when present, can often be subtle, and that there has tended to be a high degree of inconsistency across the literature [see (20)]. Nonetheless, a promising development is that variance in sensory ERPs has been related to the severity of the clinical phenotype (12, 21), going to their utility as potential biomarkers. A similarly promising development is work showing that both auditory and visual sensory responses can be modulated by training, signifying potential sensitivity to treatment effects (22,23). However, sensory ERPs have not yet been submitted to standard assessment of test-retest reliability in ASD, which is surely a minimal requirement in assessing their potential as sensitive biomarkers. Indeed, this seems particularly germane given often inconsistent findings across studies, and suggestions by some research groups of increased inter-trail variability of the sensory-evoked response in ASD (2,24), but see (25,26).
In the quest for reliable biomarkers to index brain function in ASD, we sought here to measure the test-retest reliability of auditory and visual evoked potentials and related task performance. Highdensity EEG recordings and behavioral responses were recorded from children with ASD while they engaged in a simple speeded reaction time task in response to visually cued and non-cued auditory stimuli. Intraclass correlation coefficients (ICC) were calculated to assess the reliability of the sensory evoked responses and behavioral data recorded across two identical experimental sessions that were temporally separated by an average of about 5 months (27)(28)(29).
Participants: Data from 33 children diagnosed with ASD ranging from 6.1 to 9.4 years of age were included for this analysis (see Table 1 for participant characteristics). These came from a larger dataset (N=94) collected in the context of a clinical trial on the efficacy of different behavioral interventions, and included a subset of the participants from whom we recorded EEG and behavioral data from two sessions, which we refer to as test (pre intervention) and retest (post intervention).
While we had full datasets from test and retest sessions in 40 participants, 7 (17.5%) were excluded from the current analysis because of insufficient data due to artifact contamination in one or both of the recording sessions. See Table 2 for reasons for exclusion and for comparison of the demographics and characteristics of the included versus excluded participants; see also Discussion section.   [58 -131] 90.4 ± 20 [56 -130] 95.5 ± 16 [64 -128] 7.78 ± 1.5 [5 -10] The time between test and retest was 5. Participants were encouraged to take short breaks between blocks as needed. The entire experimental session lasted around 3 hours, and, in addition to data acquisition, included cap application, frequent short breaks, lunch, and cap removal. Participants were seated at a fixed distance of 65 cm from the screen and responded with their preferred hand. In all trials, they were instructed to press a button on a response pad (Logitech Wingman Precision Gamepad) as soon as they heard the auditory tone. Responses occurring between 150-1500ms after the auditory target stimulus were considered valid, and positive feedback was provided via presentation of a cartoon dog image and an uplifting sound. If the response was outside this time window, a running dog cartoon with a sad sound was presented to indicate that the response was too fast, and a sitting dog image with the sad sound was presented to indicate that the response was too slow. Frequent breaks were given as needed to ensure maximal task concentration. Here we focus on the sensory evoked and behavioral responses to evaluate their reliability between two data recording sessions separated by a minimum of 10 weeks (2.5 months). Analyses to test the hypothesis that children with ASD do not use temporally predictive information in a typical way are presented in another report in which we compare ASD data to an age matched typically developing control group (34).
Data processing: Data were processed and analyzed using custom MATLAB scripts (MATLAB r2017a, MathWorks, Natick, MA), and the FieldTrip toolbox (35). A minimum number of 50 EEG trials per condition per analysis was set as a criterion for a participant to be included in the analysis, however, most participants had more than 100 trials in each condition (test: mean ± standard deviation (SD): 208±86; retest: 209±95). Due to occasionally long reaction times in some of the participants, only responses given within 1000ms after stimulus presentation were considered as valid for further analysis of the behavioral data.
Measurements that were used in the ICC calculations were calculated as follows:

Behavior:
Reaction times and sensitivity, indexed by D-Prime (d') (36,37), scores were calculated from the behavioral data. Hits were defined as responses that occurred between 150 ms to 1500 ms following the auditory tone. False alarms were defined as a response to a catch trial -i.e.
pushing the button even though no auditory target stimulus occurred. D' was calculated for each participant: ′ = ( ( )) − ( ( )). ICC was calculated for RT means and SDs, and for D', for both Cue and No-Cue conditions.

EEG Data:
Continuous EEG data were down-sampled to 256 Hz, band-pass filtered between 0.1 and 55 Hz using Butterworth Infinite Impulse Response (IIR) windowing with filter order of 5, and then epoched as specified below. Epochs were demeaned to normalize for DC shifts, and baseline-corrected using the 100 ms time window prior to stimulus onset.

i. Visual Evoked Response (VEP):
To derive the VEP, epochs of 200ms before and 850ms after visual stimulus presentation were generated and baselined to the 100ms pre stimulus onset, and then averaged across trials separately for each participant and recording session. Data were referenced to a midline frontal channel (AFz) to optimize visualization and measurement of the VEP over occipital scalp at channels O1, O2, and Oz.
Comparison of the VEP between the sessions was calculated on the voltage at the peak of the P1, N1, and P2 components for each participant within time windows of 80ms, 60ms, and 100ms, centered at 100ms (visual P1), 180ms (visual N1) and 350ms (visual P2), respectively. Both means and SDs per participant were used to assess ICC for mean and inter-trial variability metrics of the VEP, respectively, for a total of 6 measures across the three visual components.

ii. Auditory Evoked Response (AEP):
To derive the AEP, epochs of 300ms before and 850ms after auditory stimulus presentation were generated and baselined to the 100ms pre respectively. Both means and SDs per participant were used to assess ICC for mean and inter-trial variability metrics of the AEP. This yielded a total of 12 measures since these were generated for both Cue and No-Cue conditions for each of three components.
iii. Phase measurement: We were also interested in whether mean phase angle at the Cue condition stimulation frequency of 1.5 Hz was consistent for participants across testing session, since oscillatory phase alignment has been shown under rhythmic stimulation conditions (38)(39)(40)(41)(42). This was measured at the onsets of visual stimuli, at 1.5 Hz, and yielded a single measure for ICC analysis. Phase was calculated for each participant and session on a trial-by-trial basis, on the complex number achieved through Morlet-based wavelet convolution of the signal. To encompass the full sequence of stimuli comprising a trial, EEG data were epoched at 3000ms before and 500ms after the auditory event. Trials were baselined to the 100ms before the onset of the first visual stimulus in the sequence, and referenced to a frontal channel (AFz). Analysis of phase was performed on epochs that were low-pass filtered at 55Hz, and high pass filtered at 0.1Hz. Following previous studies showing posterior entrainment to rhythmically presented visual inputs (38,40), our areas of interest for phase alignment were focused on parieto-occipital channels (PO3, POz, PO4).

Test-retest analysis
Our analyses focused on assessing the consistency of behavioral and electrophysiological responses across two recording sessions. To do this, we performed Inter-Class Correlation Coefficient (ICC) analyses using a one-way mixed effect model with absolute agreement and multiple observations (27,29,43,44), according to the formula: (1, ) = − MSR = mean square for rows (variance between participants); MSW = mean square for residual sources of variance; k = number of raters (measurements). To measure for possible associations between the SI and participant cognitive variables, Pearson correlation coefficients were calculated between SI, PIQ, VIQ, RBSR and ADOS, in the form of a correlation matrix. Results were then corrected for multiple comparisons (45).
Finally, to test for the possibility that test-retest reliability found for the participants was linked to the time that had passed between the sessions, which varied quite widely between 2.9 and 10.4 months, we measured the correlation between the participants' similarity index and the time interval between the test and the retest.

RESULTS
The auditory and visual sensory evoked responses at test and retest are illustrated in grand average VEP and AEP waveforms and topographic maps in Figure 2, and the individual participant VEP, AEP and behavioral responses in Figure 3. A striking similarity between the group mean responses can be observed in Figure 2, whereas at the individual participant level some variance is apparent ( Figure   3). ICC analyses were performed to formally assess the consistency of responses at the individual participant level between test and retest.

Intraclass Correlation Coefficient (ICC) results
25 measures of the EEG and behavioral data (see methods) were submitted to ICC analysis. ICC values are presented in Figure 4. ICC R values, P values and lower and upper bounds of the 95% confidence interval, calculated separately for each measurement, are presented in Table 3.
For ICC analysis, the higher the R-value is, the stronger the agreement between the two sessions.  Table 3 would be considered excellent, rather than good. Pearson correlations yielded significant correlations for all measurements but phase angle (uncorrected p<0.05; see Figure 5). d' cue, d' No-Cue, RT cue, RT No-Cue, RT ITV No-Cue, VEP P1, VEP P1 ITV, VEP N1, VEP N1 ITV, VEP P2, VEP P2 ITV, AEP N1, AEP N1 ITV, AEP P2, and AEP N1 No-Cue remained significant following correction for multiple comparisons.
Pearson correlations for the test-retest pairs are presented in Figure 5.

Similarity Index (SI) and clinical measures
SI was generated for each participant (see methods) and tested for correlation with ADOS severity scores, PIQ, VIQ and RBSR in a correlation matrix ( Figure 6). None of these correlations survived Bonferroni correction (45). Finally, there was no correlation between SI and time between the sessions (Rho=0.006; p=0.97)

DISCUSSION
In autism research, several factors bring into question the possibility that brain measurements can serve as reliable markers of neurocognitive function. Basic findings on sensory processing from recordings of electrophysiological brain activity often differ across laboratories; and there is some evidence of higher inter-participant (48-50) and inter-trial ((51-53), but see (25,26)) variability within such recordings compared to control groups. This raises the possibility that such measurements may simply be too noisy to serve as reliable readouts of brain function in ASD.
Alternatively, differences in findings between laboratories may result from factors that do not have direct implications for the reliability of the scalp recorded electrical brain response, such as differences in stimuli, task, EEG recording setup and analysis pipeline, ascertainment bias and clinical cohort. What is more, inter-participant variability may reflect a feature of the heterogeneity of the condition rather than random noise. Surprisingly few studies to date have sought to test the stability of these responses when participants, recording equipment, analytic approach, and stimulation parameters are held constant, which is particularly critical to establish if a biomarker is to be used as an outcome measure in a clinical trial, or as a reliable indicator of neural and neurocognitive dysfunction (54,55). Only two previous studies, as far as we are aware, examined the reliability of such measures across two recording sessions Levin and colleagues (56) collected 5 minutes of resting state EEG from children with and without ASD at two intervals separated by ~6 days, and found good reliability of the center frequency and amplitude of the largest alpha-band peak.
Cremone-Caira and colleagues (57) found moderate to good consistency of the executive function related frontal-N2 response elicited during go/nogo and flanker tasks in children with ASD across two time points separated by ~3 months.
Here we add to this emerging literature with the finding that in children with ASD, auditory and  (46,47). Interestingly, these high ICC values were found not only for mean responses but also for the inter-trial variance of these responses.
Significance was found across data category (ERP and behavior), sensory domain (VEP and AEP), ERP component (P1, N1 and P2), experimental condition (Cue and No-Cue) and response metric (mean and SD). Among the 10 measurements for which significant ICCs were not found, 6 were from the AEP (representing 50% of the AEP derived measures). In notable contrast, ICC was significant for all measurements of the VEP. Of the remaining four values that did not achieve significance, 3 were behavioral (RT ITV for the cued condition and D' for cued and non-cued conditions) and the remaining was for mean phase angle. From this we conclude that in the current setup, VEP and mean RT were most reliably consistent across test retest measurements. Given the minimum interval between test/retest sessions across our cohort, we also conclude that for the tested agerange of ~6 to 9.4 years these reliability measures are valid for at least a 3-month interval.
Atypicalities in sensory-evoked neural responses and behavioral performance have been widely reported in ASD, including altered responses to visual (2,49,58), auditory (59), and somatosensory stimuli (60)(61)(62). Moreover, in some studies, higher inter-participant (48-50) and within-participant inter-trial (51)(52)(53) variability of brain responses to sensory stimuli has been shown. This is in line with higher inter-trial behavioral variability that was observed for individuals with ASD, measuring reaction times to executive function (63) and tactile judgment tasks (64), as well as rhythmic tapping tasks (65). The higher variability between trials and between individuals with ASD has, in turn, been interpreted in the context of neuronal processing being "noisy" or "unreliable" (e.g., (49,50,52,(66)(67)(68), but see (69,70) for reports of lower noise in ASD, and (25,26) for evidence of typical levels of noise in ASD). According to this view, high levels of endogenous neural noise in ASD render neural signals unreliable (53,71). Arguing against a pure noise account, here we see a stable pattern of both mean activity and inter-trial variability over time. The current data suggest that such variance likely represents replicable neural processing differences at the individual participant level in the clinical group, rather than noise. Hypo-and/or hypersensitivity of synaptic activity, for example, could lead to a higher than typical range of neuronal responses to a given stimulus (72). A possible result would be an increased range of neural activity across large-scale neural networks, that is nevertheless stable over time (25). At the same time, given the consistency of individual responses within our clinical group, the inter-participant variability that has been observed (48, 50) is likely to reflect that ASD has a variety of etiologies and developmental routes (1), that in turn lead to heterogeneous neural and behavioral phenotypes.
A number of notable recent reviews have focused on the promise of EEG based biomarkers of IDDs, and discussed the requirements and challenges therein (54,(73)(74)(75). Biomarkers have the potential to serve many purposes including assessment of risk, diagnosis, disease progression, intervention response, and mechanism of disease. Validity and reliability of the potential biomarker are critical to establish. Here, we find good-to-excellent reliability of sensory evoked responses to simple auditory and visual stimuli using an active paradigm suitable for children. Since auditory and visual sensory ERPs have been shown to differ in ASD, the additional finding that they can be reliably measured and show stability within individuals over time opens the door to their further development as biomarkers. Next steps will be to establish if these measures are equally reliable in the absence of a task and how they are affected by state (e.g., drowsy versus alert), to determine if they can be applied in more severely affected individuals (76)(77)(78). Given the simplicity of the paradigm and stimuli, such biomarkers could also be suitable for translational studies in non-human models of ASD (see discussion by Modi & Sahin, 2017).
We should note that full datasets were collected for fewer than half of the potential cohort.
Consideration of the reasons for this, and the implications for EEG biomarker use in clinical studies, is worthwhile. The parent study, a clinical trial, required a minimum of ~40 lab visits. During these visits clinical assessments were performed, collection of primary outcome measures was made at three time-points, and therapy sessions occurred. Due to the already significant demands of the parent study, EEG recordings were not prioritized since they did not provide a primary outcome measure. In this context about half of the participants that completed the parent study did not yield full EEG datasets (N=31): 31% did not perform the task correctly or at all and so EEG data collection was terminated, 29% would not wear the cap, 16% refused to continue the EEG experiment partway into data collection, 13% did not sit still enough to acquire good EEG data and so data collection was terminated, and for <1% either no attempt at EEG data collection was made or hair style prevented adequate cap application. Participant characteristics for included and excluded participants are presented in Table 2 We note that when EEG data collection is primary to the study that the participant has been recruited for and therefore is prioritized, we typically have high levels of compliance of at least 85%.
In our EEG studies in high functioning clinical populations, we achieve at least this rate of compliance even when we use paradigms that involve relatively complex tasks. What is more, we have similar compliance rates in our EEG studies in lower functioning populations such as Rett Syndrome (76)(77)(78) and Batton Disease, where we use passive paradigms that do not require task performance, and in individuals with severe neuropsychiatric conditions (82,83).
ICC is a strong metric of the reliability of a response for a given group, but it does not provide individual scores that can be used to assess how test-retest similarity may vary as a function of another variable such as the time interval between measures. We therefore generated a composite measure for each individual, the Similarity Index (SI), which is simply the mean of the variance between test/retest values across all of the z-scored ERP and behavioral measures. The SI may be considered a composite measure of the stability within an individual of the neuronal/behavioral readouts, evoked by a given task. The more similar the test-retest readouts of a process are, the more stable and less variable the neuronal activity that underlies this process. The SI was used to test if the differences in the length of the interval between test and retest systematically varied with the reliability of the responses. It is important to note that differences in intervals arose due to uncontrolled factors such as appointments being rescheduled. In this context, there was no evidence for such a relationship, although we are reluctant to draw strong conclusions from this, since the study was not designed to test this hypothesis. Indeed, one might expect such a relationship, given that the brain is still immature and neuronal plasticity relatively high at these ages. We additionally tested for possible covariance of SI scores with participant traits, as represented by cognitive/clinical variables, but found no significant correlation between SI and RBSR, IQ or autism severity scores. Future work will be required to establish the validity of such a composite SI.

Study limitations
A potential limitation of the current analysis is that the data were collected in the context of a treatment study. Of the 33 participants included in these analyses, two thirds (n=22) were in active treatment groups (applied behavioral analysis (ABA), N=10; sensory integration therapy (SIT), N=12), and one-third in a treatment as usual control group (N=11). Importantly however, with regard to hypotheses for the parent study, there was no expectation that the treatments would influence the basic auditory and visual sensory responses or reaction times that we focused on here and for which we found good consistency.
Another limitation of these data is that EEG was not acquired for all participants, for reasons including hyperactive behavior that prevented them from being able to sit for EEG recordings, inability to do the task, and refusal to keep the cap on. As described earlier, this highlights possible challenges for research-grade EEG data collection for clinical trials (see earlier discussion, and Table   2).
Lastly, while our study finds strong consistency of neuronal and behavioral measurements in children with ASD, it does not include similar data from an age-matched typically developing (TD) control group, and thus we cannot draw conclusions regarding whether reliability differs from a healthy control group. However, this does not detract from evidence for remarkably good consistency of the responses between two recording sessions in children with ASD and all its implications, which is a critical feature for a treatment biomarker. Data Sharing: The authors will make the full de-identified dataset with appropriate notation and any related analysis code available in a public repository (Figshare) and include digital object identifiers within the final text of the paper, so that any interested party can access them.