Top-down processing in the prefrontal cortex has been particularly observed when there is a lack of rich and unambiguous visual information19,22, as is the case in our dot-following task. Specifically, the categorical prediction in the prefrontal cortex was activated by ambiguous visual information in early visual areas19. To test this constraint, we presented actual images of faces and houses in the MEG scanner as a control condition. In contrast to the dot-following task, here, we expected a dominance of feedforward processing along the ventral pathway.
The study consisted of a behavioral experiment and an MEG experiment, with a one-week interval between the two experiments. In the behavioral experiment (Fig. 1a), gaze tracks of all participants were recorded while they were looking at images of faces and houses (see Table S1 in Supplementary Information for gaze parameters). In the MEG experiment, participants followed dot sequences of their own (self-face, SF; self-house, SH) or another participant's gaze tracks (other-face, OF; other-house, OH) with eye movements (Gaze Session, Fig. 1b). After the Gaze Session, participants took part in an Image Session where they viewed images of faces and houses while maintaining the eyes on a central fixation point (Fig. 1c). The behavioral performance is presented in Behavioral results of Supplementary Information.
We first analyzed if the face- and house-related fixation sequences obtained in the behavioral experiment showed distinct patterns. A machine-learning classification analysis over the spatiotemporal parameters of the fixations (x, y coordinates, and fixation duration) showed a high prediction accuracy in discriminating the two categories of gaze sequences, 70.5 ± 11.4% (M ± SD), with above-chance significance (p < 0.001; permutation-based significance testing). The same analysis was also performed on the eye movements collected in the MEG-Gaze Session, yielding a high prediction accuracy in discriminating between SF and SH, 66.3 ± 10.2%, p < 0.001, and between OF and OH, 66.5 ± 9.7%, p < 0.001. Note that the significant classifications cannot be due to the different sizes of the two image categories (see Spatial dispersion of fixations in Supplementary Information).
We also performed cross-experiment classifications where the classifier was trained with the gaze patterns from the behavioral experiment and was used to predict the gaze patterns in the Gaze Session of the MEG experiment. The above-chance cross-experiment prediction accuracies confirmed that the distinct patterns of the online eye movement in the Gaze Session were related to the face vs. house categories in the behavioral experiment. Moreover, the cross-experiment prediction accuracies were higher for self-generated gaze tracks than for other-generated gaze tracks (Fig. 1e), indicating that participants followed their own gaze tracks better (see Cross-experiment classification of gaze patterns in Supplementary Information for the statistics).
To reveal the face-related gaze representations, we compared the MEG signals of face-related gaze tracks with the MEG signals of house-related gaze tracks. Here, the house-related gaze tracks were taken as a control for face-related gaze tracks because there were eye movements and visual stimulation, but a lack of a structural pattern (for statistical evidence, see The structural pattern of face-related gaze tracks in Supplementary Information). For each condition (SF, SH, OF, OH) in the Gaze Session, the ERF signal from 0-1500 ms relative to the onset of the gaze track was calculated. The difference in the estimated cortical current maps was calculated between the following conditions: 'SF - SH', 'OF - OH', and '(SF - SH) - (OF - OH)'. These results would reveal the temporal development of the brain networks involved in the face-related gaze tracks, respectively areas that were sensitive to self-generated face-gaze tracks. The face-related gaze tracks elicited stronger ERF signals than the house-related gaze tracks in the orbitofrontal cortex (OFC) and the ventral anterior temporal lobe (ATL), extending to the medial temporal lobe ('SF - SH' and 'OF - OH' in Fig. 2a, Bonferroni-corrected for time and spatial clustering; see also Supplementary Video 1-4). Although small, there are also signal differences reaching back to the occipital cortex (OCC) (at 600 ms and 1400 ms, depending on the contrast). Importantly, the activities in the OFC and ATL emerged earlier than the activities in the medial temporal lobe and occipital cortex, suggesting a top-down prediction of the face category guided by the face-related gaze tracks.
Moreover, this network was more active during the following of self-generated gaze tracks than another observer's gaze tracks, as revealed by the interaction contrast '(SF - SH) - (OF - OH)' (Fig. 2a and Supplementary Video 5). The source reconstruction was dominantly localized in the ventral stream, with the notable exception of dorsal stream areas in frontal cortex, including the frontal eye field (FEF) and supplementary eye field (SEF), in the interaction contrast (Fig. 2b and Supplementary Video 6).
The whole-brain source reconstruction was also performed based on the signal difference between Face and House in the Image Session. In contrast to the Gaze Session, here the strongest signal difference was localized in the posterior occipitotemporal areas, beginning already 200 ms post-stimulus onset (Fig. 3 and Supplementary Video 7-8).
In order to provide statistical evidence for the temporal order of signal development observed in the event-related magnetic field analysis, we tested if there was an information flow from the anterior areas (e.g., OFC and ATL) to the posterior areas (e.g., FFA) during gaze following. We performed spatial gradient analysis, which quantified how the MEG signals gradually changed along a specific dimension (e.g., anterior-posterior) in the brain space. Here, the top-down hypothesis of the gaze-track representations predicted information flow from the anterior areas to the posterior areas along the ventral pathway. This can be probed with the gradually decreased activity from the anterior areas to the posterior areas, in particular, how the gradient became face-selective during gaze following. As an area or neural network with stronger signals would be faster to exceed the neural threshold of maintaining sensory selectivity or perceptual preference, the stronger signal changes in the anterior areas indicated that the face-selective representation of the gaze tracks emerged earlier than the posterior areas.
To test our hypothesis, here the spatial gradient was analyzed based on the ERF signal differences (e.g., 'SF - SH', 'OF - OH') to show how the signal changed at the anterior-posterior dimension. The analysis was performed at each time point during gaze following to show how the gradient pattern became face-selective over time. To provide a complete gradient pattern at the whole-brain level, the analysis was also performed for the dorsal-ventral and left-right dimensions. For each of the three dimensions (x, y, z, hence left-right, anterior-posterior, dorsal-ventral dimension), we modeled the coordinates with the ERF difference at each time point. Note that we included the ERF difference as the fixed factor and the spatial coordinates as the model estimations because the ERF difference at each time point was fixed across the three dimensions. The R of the model was calculated to assess the accounted variance. The first-order derivatives of the estimated model were calculated to test if the spatial gradient increased or decreased monotonically along a specific dimension.
As shown in Fig. 4a (left), the ERF signal difference between SF and SH showed a significant gradient pattern along the y and z dimensions, cluster-based permutation correction at p < 0.05 (see Fig. 4a for the significant time ranges), whereas the x dimension did not reach significance (no time ranges reached significance). The significant gradient pattern that emerged over time during the gaze following (i.e., significantly higher R than baseline) indicated that the gradient pattern was not due to the general intrinsic brain dynamics but rather to the face-specific gaze following. Along both the y and z dimensions, the estimated model showed a monotonic characteristic, with 97.5% of the derivative values > 0 at the time point with the strongest gradient pattern along the y dimension and 98.9% of the derivative values < 0 along the z dimension (Supplementary Fig. S2). The signal difference between OF and OH showed a similar pattern, with 98.7% of the derivative values > 0 along the y dimension and 97.7% of the derivative values < 0 along the z dimension (Fig. 5a right, Supplementary Fig. S2). The significant results at the y dimension indicated that the signal difference between SF and SH, and between OF and OH, decreased along the anterior-to-posterior axis, as shown by the detailed pattern in Fig. 4b, c. The peak time points (marked by triangles in Fig. 4a) denote that the changes of the MEG signal difference as a function of the spatial coordinates at a specific dimension became the strongest. Specifically, at the time point where the gradient pattern reached its peak at the y dimension (910 ms for 'SF - SH', 1485 ms for 'OF - OH'), the values of the y coordinates increased as a function of the amplitudes of the MEG signal difference (marked in magenta). These results indicated strongest neural representation of face-related gaze tracks in the anterior part of the brain, which decreased along the anterior-posterior axis. Importantly, this gradient pattern became more evident from the onset of the gaze tracks (i.e., the zero time point) to the 25%, 50%, 75% and 100% peak time points, indicating that the gradient pattern emerged as gaze following. The significant results at the z dimension indicated that the signal difference between SF and SH, and between OF and OH, decreased along the ventral-to-dorsal axis of the brain, as shown by the detailed pattern in Fig. 4d, e. Specifically, at the time point where the gradient pattern reached its peak at the z dimension (670 ms for 'SF - SH', 650 ms for 'OF - OH'), the values of the z coordinates decreased as a function of the amplitudes of the MEG signal difference (marked in magenta). Such a pattern became more evident from the onset of the gaze tracks (i.e., the zero time point) to the 25%, 50%, 75% and 100% peak time points, indicating that the gradient pattern emerged as gaze following. However, there was no difference between the left and right hemispheres, as shown by the symmetry-like pattern of the spatial gradient at the x dimension (Fig. 4b-e).
Summarizing the results at the anterior-posterior and the ventral-dorsal dimensions, these findings provided statistical evidence for the temporal sequence observed in the event-related magnetic field analysis, showing that the representations of face-related gaze tracks dynamically progressed along the ventral pathway, starting from the ventral anterior areas (e.g., OFC and ATL) via medial temporal lobe (MTL) to ventral occipitotemporal cortex.
The spatial patterns at the time point where the gradient reached its peak are shown in Fig. 4a (left: 'SF - SH', R peaked at 910 ms along the y dimension and at 670 ms along the z dimension; right: 'OF - OH', R peaked at 1485 ms along the y dimension and at 650 ms along the z dimension). Importantly, the spatial gradient emerged earlier for 'SF - SH' than the spatial gradient for 'OF - OH' along the y dimension (i.e., the posterior-to-anterior dimension), mean latency difference = -310 ms, 95% CI = [-460 ms, -165 ms], but not along the z dimension mean latency difference = 20 ms, 95% CI = [-360 ms, 450 ms] (Fig. 5a). Together with the cross-experiment classifications of gaze patterns, these results indicated that self-generated gaze tracks were more sensitive than other-generated gaze tracks to activate the face-selective neural representations.
In the Image Session, the ERF signal difference between Face and House also showed a significant gradient pattern along the y and z dimensions, cluster-based permutation at p < 0.05, whereas the x dimension did not reach significance (Fig. 4f). Along both the y and z dimensions, the estimated model showed a monotonic characteristic, with 99.4% of the derivative values < 0 along the y dimension and 82.3% of the derivative values < 0 along the z dimension (Supplementary Fig. S2). The spatial patterns at the time point where the gradient reached its peak are shown in Fig. 4g, h (R peaked at 260 ms along the y dimension and at 710 ms along the z dimension). While the gradient pattern along the z dimension was consistent with the Gaze Session, with signal difference decreasing from the ventral to the dorsal part of the brain (Fig. 4h), the gradient pattern along the y dimension was reversed, with signal difference decreased from the posterior to the anterior part of the brain (Fig. 4g). The opposite anterior-posterior pattern between the Gaze Session and the Image Session can be seen from the positive function in the Gaze Session (y coordinates, marked in magenta in Fig. 4b, c) and the negative function in the Image Session (y coordinates, marked in magenta in Fig. 4g). Importantly, the reversed gradient pattern along the y dimension between the Gaze Session and the Image Session again indicated that the observed gradient pattern was not due to the general intrinsic brain dynamics along the ventral pathway, but rather reflected the feedback-dominant vs. feedforward-dominant processing specific to the current task (i.e., gaze following vs. image processing).
The reversed pattern along the anterior-posterior direction between the Gaze Session and the Image Session was further confirmed by the statistical evidence that the spatial gradient along the y dimension showed a negative correlation between the two sessions, r = -0.99, 95%CI = [-0.994, -0.957], p < 0.001 at the peak time point for 'SF - SH', and r = -0.98, 95%CI = [-0.991, -0.934], p < 0.001 at the peak time point for 'OF - OH' (Fig. 5b, left). By contrast, the spatial gradient along the z dimension showed a positive correlation between the two sessions, r = 0.90, 95%CI = [0.863, 0.951], p < 0.001 at the peak time point for 'SF - SH', and r = 0.92, 95%CI = [0.856, 0.953], p < 0.001 at the peak time point for 'OF - OH' (Fig. 5b, right). Collectively, the reversed pattern along the anterior-posterior direction between the Gaze Session and the Image Session suggested a combination of feedback and feedforward processing in natural face perception (i.e., when we look at a face using eye movements).
A classification analysis was also performed to show how the different categories of gaze tracks could be distinguished by the multivariate MEG signals of different channels. The classification was performed for 'SF vs. SH' and 'OF vs. OH', respectively, using all 306 channels as features. The whole-brain source reconstruction was conducted based on the coefficients of the linear regression between the predicted category and the features of MEG signals. For 'SF vs. SH', as shown in Fig. 6a, c, five spatio-temporal clusters were indentified as significantly above chance, with the earliest cluster at 200-265 ms (p = 0.024) in right OFC, and later clusters at 495-545 ms (p = 0.024) in left OCC, 495-540 ms (p = 0.024) in right MTL and OCC, 505-575 ms (p = 0.033) in right OCC, and 700-790 ms (p = 0.021) in right MTL. For 'OF vs. OH', as shown in Fig. 6b, d, five spatio-temporal clusters were indentified as significantly above chance, with the earliest cluster at 575-705 ms (p = 0.004) in right OFC, followed by two immediate clusters at 645-740 ms (p = 0.018) in left ATL and MTL, 675-805 ms (p = 0.001) in left OCC, a later cluster at 820-870 ms (p = 0.020) in left OCC, and a final cluster at 1350-1400 ms (p = 0.011) in right OFC and ATL. These results are consistent with the above ERF-based results, indicating earlier involvement of OFC than ATL, MTL, and OCC in distinguishing the categorical gaze tracks. Note that, in contrast to the ERF-based results, here OFC and ATL seemed to be more pronounced in the classification-based results. This might be due to that the ERF-based results revealed the neural patterns that were stronger for face-related gaze tracks than house-related gaze tracks, which relied on the interaction between the top-down prediction in OFC and ATL and the perceptual representation of faces in ventral occipitotemporal cortex. However, the classification-based results revealed the neural patterns that distinguished the two categories of gaze tracks, which relied more on the top-down prediction of the object categories.
The classification-based results in the Image Session showed a similar pattern to the ERF-based results (Fig. 7). The distinct neural activities between Face and House were strongest in the posterior occipitotemporal areas, peaking around 200 ms post-stimulus onset.