Anticipating others’ actions is innate and essential in order for humans to navigate and interact well with others in dense crowds. This ability is urgently required for unmanned systems such as service robots and self-driving cars. However, existing solutions struggle to predict pedestrian anticipation accurately, because the influence of group-related social behaviors has not been well considered. While group relationships and group interactions are ubiquitous and significantly influence pedestrian anticipation, their influence is diverse and subtle, making it difficult to explicitly quantify. Here, we propose the group interaction field (GIF), a novel group-aware representation that quantifies pedestrian anticipation into a probability field of pedestrians’ future locations and attention orientations. An end-to-end neural network, GIFNet, is tailored to estimate the GIF from explicit multidimensional observations. GIFNet quantifies the influence of group behaviors by formulating a group interaction graph with propagation and graph attention that is adaptive to the group size and dynamic interaction states. The experimental results show that the GIF effectively represents the change in pedestrians’ anticipation under the prominent impact of group behaviors and accurately predicts pedestrians’ future states. Moreover, the GIF contributes to explaining various predictions of pedestrians’ behavior in different social states. The proposed GIF will eventually be able to allow unmanned systems to work in a human-like manner and comply with social norms, thereby promoting harmonious human-machine relationships.
Xueyang Wang, Xuecheng Chen, Puhua Jiang, Haozhe Lin, Xiaoyun Yuan, Mengqi Ji, Yuchen Guo, Ruqi Huang, Lu Fang.
The Group Interaction Field for Learning and Explaining Pedestrian Anticipation.
Engineering, 2024, 34(3): 75-87 DOI:10.1016/j.eng.2023.05.020
Understanding pedestrian dynamics is critical in a variety of real-world tasks, such as autonomous driving [1], [2], robot navigation [3], [4], pedestrian flow analysis [5], [6], and crowd evacuation [7], [8]. Interestingly, humans have an instinctive ability to anticipate the future actions of other people while navigating in crowded spaces and interacting with other pedestrians [9], [10], [11], [12], [13], which permits them to avoid head-on collisions and keep pace with peer partners while maintaining a comfortable distance. As shown in Fig. 1(a), such an ability would allow unmanned systems to work in urban environments intelligently by comprehending and anticipating the actions of pedestrians.
In the past decades, pedestrian anticipation has been modeled using bidirectional flow [13], [14], cellular automation [10], [15], and time to collision [12], [16], [17] to simulate collective behaviors. Recently, machine learning technology has been utilized for this purpose, allowing the future states of pedestrians to be forecasted [17], [18], [19], [20], [21], [22], [23]. In essence, the above methods model each individual’s behavior in collision avoidance without considering group-related social behaviors. However, humans are naturally social beings who gather to interact socially and thus form social groups [24], [25]; for example, up to 70% of the observed pedestrians on a street are in groups [26]. Pedestrians conform to expected social norms in groups and act accordingly under the influence of group neighbors [27], where intra-/inter-group interactions are considered to be critical influencers of pedestrians’ social cognition [27], [28] and behavior patterns [29], [30]. To model group or interaction information, state-of-the-art graph neural network (GNN) methods have been utilized for an understanding of pedestrian/agent dynamics [9], [27], [31], [32], [33], [34], [35]. However, for pedestrians, the influence of group behaviors is not only diverse but also subtle, and different group relationships or interaction states will have very different impacts on pedestrians’ future states. For example, a family group (e.g., mother and daughter) and a tour group usually show quite different behaviors under similar scenarios, as the attention of children is less focused than that of adults. These subtle differences cannot be well modeled by simple relationship or interaction graphs. Because they fail to distinguish between different group relationships among pedestrians, existing methods are insufficient for accurately predicting the differences in pedestrian anticipation influenced by group behaviors [13], [19], [23], [27], [31], [34], [35], [36], [37].
In complex scenes, it is important yet challenging to understand the influence of group relationships and social interactions on pedestrian behaviors. For such contexts, we propose the group interaction field (GIF), a novel group-aware representation, to quantify implicit pedestrian anticipation. More specifically, the GIF consists of a proxemics field and an attention field, which respectively represent pedestrian anticipation using the probability fields of pedestrians’ future locations and their attention orientations. Moreover, we tailor GIFNet to estimate the GIF from explicit multidimensional observations, including the trajectory, visual orientation, and group interaction state. GIFNet can quantify the diverse and subtle influence of group behaviors by formulating a group interaction graph with propagation and group attention that is adaptive to the group size and dynamic interaction states. Our main contributions are threefold:
•We propose the GIF, a group-aware representation of pedestrian anticipation. It consists of a proxemics field and an attention field, which represent the variation of pedestrian anticipation, and thus delivers a comprehensive understanding of the social nature of pedestrians.
•We tailor GIFNet to estimate the GIF; taking explicit observations into consideration, GIFNet uses the advantages of long short-term memory (LSTM) and graph attention network (GAT) to learn implicit spatiotemporal representation and estimate the GIF.
•Extensive validation in various real-world scenarios shows that the GIF can effectively represent changes in pedestrian anticipation under the prominent impact of group behaviors and accurately predict pedestrians’ future states.
2. The GIF
As an estimation of implicit pedestrian anticipation, the GIF consists of a proxemics field and an attention field, which respectively represent predictions of pedestrians’ future location and visual attention. The GIF is estimated by means of GIFNet from explicit observations of pedestrians, including their trajectory, visual orientation, and state of group interaction (Fig. 1(b)). We generated pedestrian data from the PANDA dataset [38], which consists of large-scale natural outdoor scenes with a diversity of scenarios, as well as pedestrian density, trajectory distribution, and group activities. The proxemics field is a sequence of two-dimensional (2D) probabilistic distribution maps denoting the future locations of the pedestrian of interest (Fig. 1(c)) with a timespan and temporal resolution . Similarly, the attention field is a sequence of angular ranges , representing the pedestrian’s possible orientation and range of visual attention. More formally, given a timestamp from the observation sequence , with the ending timestamp , of the pedestrian of interest , the GIF is defined as , with the proxemics field and attention field .
As shown in Fig. 1(a), the GIFs of solitary pedestrians (cyan), grouped pedestrians without interaction (purple), and grouped pedestrians with interaction (orange) have apparent differences: The single pedestrian has a long and wide proxemics field, while the grouped pedestrians without interaction have shorter and narrower fields, and the grouped pedestrians with interaction tend to approach each other closely. As it can predict pedestrians’ future location and attention orientation, the GIF has great potential in unmanned system applications. Fig. 1(a) shows two representative applications of the GIF. The proxemics field can help an unmanned system (blue) plan its path to avoid disturbing pedestrians, while the attention field can guide an unmanned system (red) to approach a pedestrian from the orientation of attention.
3. GIFNet
To accurately estimate the GIF, we tailor GIFNet, as illustrated in Fig. 2(a). GIFNet takes three explicit observations as inputs—namely, the trajectory of the pedestrian of interest , the visual orientation of the pedestrian of interest , and the neighbor trajectories in an interaction graph , with timestamp t, and outputs the GIF of the pedestrian of interest. Given the pedestrian of interest (purple), the remaining pedestrians in the same group (other colors) are denoted as that pedestrian’s neighbors. More specifically, the group interaction graph is a graph sequence for organizing the group interaction state, whose edges represent whether the pedestrian of interest is interacting with neighbors at each timestep.
GIFNet consists of four modules: ① a trajectory encoder that models the historical trajectory of the pedestrian of interest, ② an visual orientation encoder that models the pedestrian of interest’s visual orientation information, ③ the GIF-GAT, which models the interaction information between the pedestrian of interest and that pedestrian’s neighbors, and ④ a visual orientation decoder and proxemics decoder that respectively generate an estimation of the proxemics field and of the attention field of the pedestrian of interest. In GIFNet, three encoders composed of a fully connected (FC) layer and an LSTM unit are used to extract features from , , and . For the neighbor trajectories , the encoder produces two embedding vectors ( and ) for the th neighbor, encoding the features of the neighbor’s absolute displacement and the displacement relative to the pedestrian of interest, respectively (Fig. 2(b)). The group interaction graph and the features of the neighbor trajectories are further processed by means of a graph attention module (the GIF-GAT; Fig. 2(c)). For each timestep t, an FC layer is used to calculate the weights of the neighbors from the relative displacement feature of the neighbors. The weights are multiplied by the group interaction graph to obtain the final weight for the th neighbor. The absolute displacement features of the neighbors () are then summed with weighting, using , as the final neighbor embedding vector. In this way, GIFNet propagates the influence of the neighbors and the group interactions through the graph to learn an embedding feature vector. Finally, the embedding feature vectors of the four kinds of explicit observations are input to the decoders for estimating the proxemics field and the attention field (Fig. 2(d)). For the proxemics decoder, a Gaussian sampling module is added to learn the uncertainty of the proxemics field and produce a sequence of probability distribution maps representing the pedestrian of interest’s future location. In the following, we will elaborate the design of the trajectory encoder, the visual orientation encoder, GIF-GAT, and the decoders.
3.1. Trajectory encoder
The purpose of the trajectory encoder is to encode the historical trajectory information and generate a trajectory embedding. The trajectory encoder consists of . The past trajectory information of pedestrian of interest is represented by the ordered set of the pedestrian’s relative displacement to the previous timestep (Fig. 2(b)) and is formed as follows:
where is the spatial location of person of interest at timestamp .
For the timesteps , we perform the following update operation to embed the relative displacement into a fixed-length vector corresponding to the FC layer in Fig. 2(a):
Then, the embedding vector is used as input to the LSTM cell, as follows:
where the function is the FC layer to embed the past trajectory information of pedestrian , is the embedding weight, is the hidden state of the at timestep , and is the weight of the cell. These parameters are shared among all the pedestrians in the scene.
3.2. Visual orientation encoder
The purpose of the visual orientation encoder is to encode the historical visual orientation information and generate a visual orientation embedding. The past visual orientation information of pedestrian of interest is represented by the ordered set of the pedestrian’s orientation in a unit vector and is formed as follows:
where is the inner angle of visual orientation concerning the forward orientation. Similar to the trajectory encoder module, the visual attention sequence with the hidden state is fed into the visual orientation encoder . The operation is as follows:
where is the weight of the cell. For simplicity, we reuse the notations of and to represent the embedding function and the embedding weight and hidden state, respectively. The final vector is the ensemble of the information from the visual orientation of pedestrian of interest .
3.3. GIF-GAT
For efficiency and simplicity, we adopt a mechanism similar to the trajectory encoder to encode the neighbor trajectories. For the pedestrian of interest , as shown in Fig. 2(b), in addition to the displacement of each neighbor to the previous timestep as, we calculate the relative location of each neighbor in relation to the pedestrian of interest at each timestep; that is,. We encode the neighbor location in both and , which represent the absolute displacement and relative displacement, respectively. The operations are as follows:
Then, by feeding the corresponding vectors to the neighbor encoder, we obtain two distance-sensitive context embeddings: the neighbor’s relative embedding and the neighbor’s absolute embedding . The operations are as follows:
where Wa represents the weight of the correspondingly LSTM cell.
We use a GAT as a sharing mechanism to aggregate the information on interactions between the pedestrian of interest and that pedestrian’s neighbors. As shown in Fig. 2, we consider the pedestrians in a scene as nodes and use edges on the graph to represent information on human-human interaction. The GAT is constructed by stacking graph attention layers. The group interaction graph of the pedestrian of interest is represented by a sequence of dummy variables, as follows:
where is the dummy variable indicating the existence of interaction between the pedestrian of interest and group neighbor , and is the number of group neighbors of the pedestrian of interest . We adopt temporal pooling for to generate a pooled context vector , which is composed of the interaction information across the observation period; that is, .
Let denote the final relative embedding and denote the final absolute embedding of neighbor . In the observation period, is fed to the graph attention layer. The coefficients in the attention mechanism of the node pair can be computed by multiplying and as follows:
The output of one graph attention layer for node (pedestrian of interest ) is given by the following:
where is a nonlinear function and represents the neighbors of node . is the parameter matrix of a shared linear projection that is applied to each neighbor separately ( is the dimension of the input, and is the dimension of the output). In addition, is a fixed-length embedding for the pedestrian of interest for the observed time, representing the influence of all neighbors on the pedestrian of interest.
3.4. Proxemics and attention decoder
We use the decoders to generate the proxemics field and attention field conditioned on , where , , and are the embeddings of the trajectory, visual orientation, and neighbors’ influences, respectively:
Then, we directly concatenate a noise vector sampled from a Gaussian distribution and the context embeddings as the input for the proxemics decoder :
Moreover, the attention field of the pedestrian of interest is updated using the attention decoder :
where and are respectively the location and the visual orientation of the pedestrian of interest at . We use the notations and to represent the hidden state of the proxemics decoder and of the attention decoder, respectively, and use and to represent the embedding weight of the proxemics decoder and of the attention decoder, respectively.
4. Experiments
4.1. Experimental settings
4.1.1. Dataset
The performance of our models was evaluated on the PANDA dataset [38]. The videos in the PANDA dataset are captured by gigapixel cameras, and each video frame contains hundreds to thousands of pedestrians, with rich group interaction information. As our method only requires the trajectories, visual orientations, and group interaction information, we extracted this information from the PANDA labels and formed a new dataset with 21 704 trajectories. We divided the trajectories into training, testing, and validation sets, with 15 511, 3052, and 3141 trajectories, respectively. Next, we computed a homography matrix to map images to the top view in order to obtain the locations of the pedestrians in world coordinates.
Unlike the existing group-based trajectory-prediction datasets [22], each group was assigned several category labels, denoting the kinds of group relationships (i.e., acquaintance and family) and interaction states (i.e., no interaction, non-physical interaction, and physical interaction). For example, eye contact, body language such as hand waving, and talking are non-physical interactions, while holding hands is a type of physical interaction. Group relationship information is identified through the interactions and characteristics of the members, such as appearance, gender, age, and exchanges.
4.1.2. Evaluation metrics
During the test time, we made k predictions of the future position of the pedestrian of interest i; we set k=20. Then, we applied a Gaussian model to fit the predicted locations for all k predictions and then sampled the point with the highest probability as the optimal predicted location , which was calculated as follows:
where is the fitted Gaussian model for all predicted locations of pedestrian i at time t. We used the average displacement error (ADE) [19] and the final displacement error (FDE) [38] to evaluate the predicted trajectory as follows:
where N is the number of predicted timesteps, and is the ground-truth value of location of pedestrian of interest i at time t.
Similarly, we used the average angular error (AAE) and the final angular error (FAE) to evaluate the predicted visual orientation:
where is the ground-truth value of visual orientation of pedestrian of interest i at time t, and is the optimal predicted visual orientation.
4.1.3. Training details
In our experiments, we observed the trajectories and visual orientations of nine timesteps (3 s) and tried to predict the next N=9 timesteps (3 s). The pedestrians’ visual orientation has the same form as the pedestrians’ relative location, . Thus, a sequence-to-sequence model can be used to predict both the pedestrians’ locations and their visual orientations. We replaced the input of state-of-the-art trajectory-prediction methods with for visual orientation training and prediction. All experiments were performed on the same personal computer (PC) with a NVIDIA RTX 3090 graphics processing unit (GPU).
For training the proxemics field decoder, the variety loss was used:
where is the q th predicted location of pedestrian of interest ii.
We also applied the loss in order to measure the difference between the prediction and the ground truth of the attention field:
4.2. Experimental discussion
4.2.1. Predicting the proxemics field
As the proxemics field represents the future location distribution of the pedestrian of interest, we evaluated our GIFNet using the accuracy of the predicted locations on the dataset. Recent studies on crowd forecasting have indicated that the short-term motion of pedestrians is highly predictable [39], [40]. Here, we adopt a similar setting with a timespan T=3s and temporal resolution R=1/3s. As shown in Fig. 3(a), the ADE and FDE (i.e., the displacement error at the endpoint, shown as stars in Fig. 3(a)) of the predicted locations are used as the evaluation metrics. For each timestep, the predicted location with the highest probability is used to calculate the ADE and FDE. As illustrated in Table 1 [19], [21], [34], [35], [41], [42], [43], [44], [45], [46], [47], [48], [49], GIFNet outperforms the state-of-the-art learning-based trajectory-prediction methods (SoPhie [21], spatial-temporal graph attention network (STGAT) [34], social generative adversarial networks (SGAN) [35], social-spatial-temporal graph convolutional neural network (STGCNN) [19], sparse graph convolution network (SGCN) [41], etc.). Among these methods, only our GIFNet encodes all four kinds of features—that is, trajectory, visual orientation, neighbor trajectory, and group interaction state. SoPhie, STGAT, SGAN, social-STGCNN, and SGCN encode only the trajectory and integrate the information of all the surrounding neighbors with a relative-distance-dependent method. The baseline method “Linear” is a linear regressor that takes only the past trajectory as input. A more detailed ablative analysis is provided in Section 4.2.3.
For a more in-depth analysis of the neighbor and group interaction information, we divide the pedestrians into several categories (i.e., solitary pedestrians, members of an acquaintance group, members of a family group, group members without interaction, group members with non-physical interaction, and group members with physical interaction) and plot the statistical analysis results in Figs. 3(b)-(f). We use a nonparametric single-side Mann-Whitney U test to prove the statistical significance of the mean difference between the two groups of data. Figs. 3(b) and (c) illustrate the distribution of the ground truth versus the estimated forward (i.e., movement direction of the current timestep) and lateral (i.e., orthogonal to the forward direction) speeds. The prediction of GIFNet (red) shows a high consistency with the ground truth (black). The solitary pedestrians move faster than the grouped pedestrians in both directions (p<0.001,N=12 245), and the pedestrians in the acquaintance group move faster than those in the family group in both directions (p<0.001,N=12 245). However, grouped pedestrians with and without interactions show no significant difference, meaning that group interactions do not affect pedestrians’ walking speed. In addition, the proxemics fields of solitary pedestrians are more dispersed than those of grouped pedestrians; that is, the walking direction of solitary pedestrians has higher uncertainty. These results indicate that being in a group directly affects a pedestrian’s speed and walking direction.
The spatial organization of a walking pedestrian group can be measured by the angle (i.e., the inner angle between the neighbor and the forward orientation, in Fig. 3(d)) and the distance between the pedestrians in the group [26]. Figs. 3(e) and (f) illustrate the influence of interactions on the pair-wise distance and angle of pedestrian pairs within groups. The distances between pedestrian pairs with physical, non-physical, and no interaction increase significantly (both p<0.001 and N=3668). The angles of pedestrian pairs with physical and non-physical interactions are clustered at about 90°, while the angles of pedestrian pairs without interactions are smaller (p<0.001,N=3668) and dispersed with higher uncertainty. The distributions of pair-wise distance and angle are presented in Fig. 3(g). A total of 559 pairs of pedestrians with no interaction, 137 with non-physical interaction, and 199 with physical interaction are plotted on three 2D histograms: The pedestrian of interest is located in the center, with a 90° forward direction, and the neighbors are plotted based on distance and inner angle. Pedestrian pairs with interaction are more concentrated than those without interaction and tend to walk in parallel (i.e., angles clustered at 90°).
Fig. 3(h) illustrates the changes in the proxemics field and the pair-wise distance at the initiating time, during, and at the ending time of physical and non-physical interaction. We randomly sample 400 pairs for each state to plot the time-distance curve (bottom part of Fig. 3(h)), and the proxemics field of a representative pair is plotted in the top part of Fig. 3(h) for each stage. Pedestrian pairs with both physical and non-physical interaction show similar changes: When the pedestrians initiate interaction, they move close to each other; during the interaction, the distance between them remains stable; and, when ceasing interaction, they tend to separate. Compared with non-physical interaction, pedestrian pairs with physical interaction have smaller pair-wise distances. The high correlation between the predicted curves and the ground truth shows that GIFNet can effectively capture the changes in the group interaction state and predict accurate future locations under all states.
4.2.2. Predicting the attention field
As depicted in Fig. 1(c), the attention field is an angular range denoting the visual attention of the pedestrian of interest. Here, we fix the angular range at 30°, corresponding to the aperture of the cone of visual attention [41], and predict its central orientation. The ground-truth attention fields are calculated from the annotated visual orientations in the dataset. Similarly, we set the timespan T=3s and the temporal resolution R=1/3s, and evaluate GIFNet using the AAE and FAE. Since there is no visual orientation prediction method, we modify state-of-the-art trajectory-prediction methods for visual orientation prediction, denoted as SoPhie, STGAT, SGAN, social-STGCNN, SGCN, and so forth. “Linear” denotes the linear regression method. Table 2 [19], [21], [34], [35], [41], [42], [43], [44], [45], [46], [47], [48], [49] shows that our GIFNet achieves the best AAE and FAE among all the methods. As in the proxemics field prediction, the group neighbor and interaction information have a notable impact on the attention field prediction.
For a more in-depth analysis of the influence of such information on pedestrian anticipation, we evaluated the forward-attention angle ( in Fig. 4(a)), cross-attention angle ( in Fig. 4(a)), and neighbor-attention angle ( in Fig. 4(a)). The forward-attention angle measures the consistency of the pedestrians’ attention orientation and forward direction, the cross-attention angle measures the consistency of the attention orientation of pedestrian pairs, and the neighbor-attention angle reflects whether pedestrians’ attention is attracted by their neighbors. We used a nonparametric single-side Mann-Whitney U test to demonstrate the statistical significance of the mean difference between the two groups of data. As illustrated in Figs. 4(b)-(f), all three angles predicted by our GIFNet (red) show good consistency with the ground truth (black).
As shown in Figs. 4(b), (d), and (f), the forward-attention angles of pedestrians with physical interaction, without interaction, and with non-physical interaction increase significantly (both p<0.001,N=893), the cross-attention angle of pedestrian pairs with non-physical interaction is significantly smaller than that of pedestrian pairs without interaction (p<0.001,N=2986), and the neighbor-attention angle of pedestrian pairs with physical interaction, no interaction, and non-physical interaction decreases significantly (both p<0.001 and N=2986). These results indicate that grouped pedestrians in physical interaction tend to focus on the direction forward (forward-attention angles close to 0°, cross-attention angles close to 0°, and neighbor-attention angles close to 90°), while pedestrians with non-physical interaction are more likely to look at each other. This may be because non-physical interactions mainly include verbal communication and eye-to-eye behaviors that require visual attention, while pedestrians in physical interaction can be more focused on walking because they can effectively perceive the location of the partner through touch instead of sight. As shown in Figs. 4(c) and (e), the forward-attention angles of the pedestrians in a family group, solitary pedestrians, and those in an acquaintance group increase significantly (both p<0.001 and N=4641), and the cross-attention angle of pedestrian pairs in an acquaintance group is significantly smaller than that of pedestrian pairs in the family group (p < 0.001, N = 12 946). This may be because family members are more likely to interact with each other physically, while acquaintances are almost equally likely to interact with each other physically and non-physically. Hence, group interactions have a significant and diverse effect on pedestrians’ visual attention.
We further demonstrate the changes in the attention field at the initiating time, during, and at the ending time of physical and non-physical interactions. Similar to Fig. 3(h), the top row of Fig. 4(g) shows the representative predicted attention fields for pedestrian pairs, and the bottom row shows the changes in the neighbor-attention angle. When the pairs start to interact, the pedestrians tend to look at each other; during the interaction, both pedestrians look forward, while sometimes looking at each other (more often during non-physical interactions); when interaction ceases, the pedestrians look at each other again, and then turn to forward orientations. Similar to Fig. 3(h), the curve in Fig. 4(g) shows the high correlation between the predicted results and the ground truth, demonstrating GIFNet’s ability to capture the influence of group interactions on the attention field.
4.2.3. Ablation study
We conducted a careful ablation study to demonstrate the capacity of GIFNet. As shown in Fig. 5, for the proxemics field prediction, the visual orientation information input and the group interaction information input improve the performance at every timestep. For the attention field prediction, the trajectory information input and the group interaction information input improve the performance in the first six timesteps.
From a technical perspective, the effective encoding of the group and group interactions contributes to the superior accuracy of our method. Although pooling-like operations [50] and GNNs [22], [34], [51], [52], [53] have been used in existing machine learning methods to encode influences among pedestrians, only the relative spatial distance between pedestrians is used in such studies, while the group and group interactions are not well considered. In addition to encoding physical features such as spatial distance, GIFNet introduces a group interaction graph with a graph attention module to propagate the group neighbor information. In this way, GIFNet explicitly reasons the importance of each group neighbor from the relative displacement and dynamic interaction states to enable better quantification of the influence of group behavior. Anthropologists recognize that visual orientation is strongly related to pedestrians’ walking paths [54], and visual perception is conducive to forming group cohesion [55], [56]. However, most of the existing methods analyze the pedestrian trajectory and visual orientation separately. In contrast, GIFNet simultaneously encodes both types of information, which mutually improves the prediction accuracy of the proxemics field and the attention field.
4.2.4. GIFNet with group detection algorithm
With the development of computer-vision-based pedestrian motion perception technology (e.g., pedestrian detection and tracking), pedestrians can be accurately positioned in video and used as input for trajectory prediction. We consider that it is also possible to realize group perception/inference; in fact, there are numerous studies on this topic, including Refs. [57], [58]. Instead of taking group annotations as inputs, we supplemented a series of experiments that used the result of the group perception algorithm as input, in order to test the usability of GIFNet. We tested two methods to detect pedestrian groups: self-supervised human group detection (SHGD) [61] and correlation clustering [62]. As shown in Table 3 [57], [58], compared with the baseline method that does not use group information, the application of both algorithms effectively improves GIFNet’s prediction accuracy. In particular, when using the SHGD algorithm results as input, the performance is very close to that of using group annotations as input. Based on these results, we believe that many computer vision algorithms for detecting group states, tracking pedestrians, and recognizing visual orientation can be easily combined with our algorithms, which implies the usability of our algorithms in the real world. In 4.2.1 Predicting the proxemics field, 4.2.2 Predicting the attention field, given the rigor of the evaluation, we use annotations from datasets (which can be seen as “ideal” perception data) as input to our algorithms. In this way, noise from other models and several unknown uncertainties can be eliminated, allowing a more accurate assessment of our algorithm’s true performance.
4.2.5. GIFNet for understanding pedestrian anticipation in a small-scale scene
Although GIFNet is designed for understanding pedestrian anticipation in large-scale scenes (in the PANDA dataset), we also evaluated how GIFNet performs in small-scale scenes (in the ETH + UCY datasets). Since the ETH + UCY datasets contain only pedestrian trajectory information, we remove the pedestrian face orientation encoder in GIFNet and use the group states detected by the SHGD algorithm as input. The ETH and UCY dataset group consists of five different scenes: ETH and Hotel (from ETH), and Univ, Zara1, and Zara2 (from UCY). Table 4 shows the ADE + FDE comparison of GIFNet and other state-of-the-art methods on the ETH + UCY datasets. The performance of GIFNet is comparable with those of SGCN and Social-Implicit in terms of the average error. Due to a lack of facial orientation information and the use of algorithmic inferred groups, GIFNet does not take full advantage of the benefits provided by its innovative structural design. However, the above results are sufficient to demonstrate that GIFNet is an advanced pedestrian trajectory-prediction algorithm that is applicable to different datasets.
4.2.6. The GIF for crowd-aware robot navigation
With the booming development of unmanned systems (e.g., autonomous driving, service robots, etc.), such systems’ environments are envisioned to expand from isolated areas to social spaces shared with humans. People expect unmanned systems to not only have powerful functions but also provide smart interactions with comfort, naturalness, and sociability [59]. Our proposed group-aware understanding of pedestrian anticipation may enable unmanned systems to work in a human-like manner and comply with social norms, which is shown in Fig. 6. As a validation, we propose a new robot navigation method based on the GIF (Fig. 6(a)). Existing robot navigation approaches usually regard pedestrians as simple circular obstacles and avoid pedestrians based on their current locations [60], which makes it difficult to strike a balance between not disturbing pedestrians and maintaining navigation efficiency. In contrast, by imparting the robot with the human-like capability of anticipation, our method can adaptively plan the robot’s path according to the pedestrians’ proxemics field and attention field, which reflect the pedestrians’ behavioral intention in a fine-grained way, thus effectively preventing the robot from disturbing pedestrians while maintaining the robot’s driving efficiency (Figs. 6(e)-(g)). We believe that the GIF can promote a harmonious human-machine relationship in broader applications.
5. Conclusions
Understanding pedestrian anticipation is a long-standing problem with significant application value. In this paper, we studied how different group relationships influence pedestrian anticipation. More specifically, we proposed the GIF, a novel group-aware representation of pedestrian anticipation, which can quantitatively explain how people’s anticipation of others’ speed, others’ attention, and the spatial organization of groups is dynamically affected by group interactions. Furthermore, we tailored GIFNet to estimate the GIF based on the explicit observations of pedestrians. By encoding multidimensional data, including pedestrian trajectory, visual attention, and state of group interaction, GIFNet succeeds in representing changes in pedestrian anticipation under the prominent impact of group behaviors and in accurately predicting the future states of pedestrians.
In practice, the GIF will contribute to a group-aware-based understanding of pedestrian anticipation and pedestrian group behavior. The GIF can enable unmanned systems to accurately anticipate pedestrians’ actions and safely and comfortably interact with them, thereby promoting a harmonious human-machine relationship. We believe that the GIF has enormous potential for application in interdisciplinary areas, such as intelligent unmanned systems, the social-aware modeling of pedestrian dynamics, and emergency evacuation.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (NSFC; 62125106, 61860206003, and 62088102), in part by the Ministry of Science and Technology of China (2021ZD0109901), and in part by the Provincial Key Research and Development Program of Zhejiang (2021C01016).
Compliance with ethics guidelines
Xueyang Wang, Xuecheng Chen, Puhua Jiang, Haozhe Lin, Xiaoyun Yuan, Mengqi Ji, Yuchen Guo, Ruqi Huang, and Lu Fang declare that they have no conflict of interest or financial conflicts to disclose.
A.Rasouli, I.Kotseruba, J.K.Tsotsos. Pedestrian action anticipation using contextual feature fusion in stacked RNNs. Proceedings of the 30th British Machine Vision Conference (BMVC 2019)2019 Sep 9-12, BMVA Press, Cardiff, UK. London (2019).
[2]
Y.Luo, P.Cai, A.Bera, D.Hsu, W.S.Lee, D.Manocha. PORCA: modeling and planning for autonomous driving among many pedestrians. IEEE Robot Autom Lett, 3 (4) (2018), pp. 3418-3425.
[3]
P.Trautman, J.Ma, R.M.Murray, A.Krause. Robot navigation in dense human crowds: the case for cooperation. Proceedings of the IEEE International Conference on Robotics and Automation; 2013 May 6-10; Karlsruhe, Germany, IEEE, New York City (2013), pp. 2153-2160.
[4]
YaoX, ZhangJ, OhJ. Following social groups: socially compliant autonomous navigation in dense crowds. 2019. arXiv:1911.12063.
[5]
J.Zhou, Z.K.Shi. A new lattice hydrodynamic model for bidirectional pedestrian flow with the consideration of pedestrian’s anticipation effect. Nonlinear Dyn, 81 (3) (2015), pp. 1247-1262.
[6]
S.Hoogendoorn, P.H.L.Bovy. Simulation of pedestrian flows by optimal control and differential games. Optim Control Appl Methods, 24 (2003), pp. 153-172.
[7]
X.Zheng, Y.Cheng. Conflict game in evacuation process: a study combining Cellular Automata model. Physica A Stat Mech Appl, 390 (2011), pp. 1042-1050.
[8]
S.Bouzat, M.N.Kuperman. Game theory in models of pedestrian room evacuation. Phys Rev E Stat Nonlinear Soft Matter Phys, 89 (2014), 032806.
[9]
Q.Xu, M.Chraibi, A.Seyfried. Anticipation in a velocity-based model for pedestrian dynamics. Transp Res Part C Emerg Technol, 133 (2021), 103464.
[10]
Y.Suma, D.Yanagisawa, K.Nishinari. Anticipation effect in pedestrian dynamics: modeling and experiments. Physica A Stat Mech Appl, 391 (2012), pp. 248-263.
[11]
S.Nowak, A.Schadschneider. Quantitative analysis of pedestrian counterflow in a cellular automaton model. Phys Rev E Stat Nonlin Soft Matter Phys, 85 (6) (2012), 066128.
[12]
R.Bailo, J.A.Carrillo, P.Degond. Pedestrian models based on rational behaviour. L. Gibelli, N. Bellomo (Eds.), Crowd dynamics. Volume 1—modeling and simulation in science, engineering and technology, Springer, Berlin (2018).
[13]
H.Murakami, C.Feliciani, Y.Nishiyama, K.Nishinari. Mutual anticipation can contribute to self-organization in human crowds. Sci Adv, 7 (12) (2021), eabe7758.
[14]
H.Murakami, C.Feliciani, K.Nishinari. Lévy walk process in self-organization of pedestrian crowds. J R Soc Interface, 16 (153) (2019), 20180939.
[15]
R.M.Roe, J.R.Busemeyer, J.T.Townsend. Multialternative decision field theory: a dynamic connectionist model of decision making. Psychol Rev, 108 (2) (2001), pp. 370-392.
[16]
I.Karamouzas, B.Skinner, S.J.Guy. Universal power law governing pedestrian interactions. Phys Rev Lett, 113 (23) (2014), p. 238701.
[17]
F.Zanlungo, T.Ikeda, T.Kanda. Social force model with explicit collision prediction. EPL, 93 (6) (2011), p. 68005.
[18]
V.Kosaraju, A.Sadeghian, R.Martín-Martín, I.Reid, H.Rezatofighi, S.Savarese. Social-BiGAT: multimodal trajectory forecasting using bicycle-GAN and graph attention networks. Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019); 2019 Dec 8-14; Vancouver, BC, Canada (2019).
[19]
A.Mohamed, K.Qian, M.Elhoseiny, C.Claudel. Social-STGCNN: a social spatio-temporal graph convolutional neural network for human trajectory prediction. Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2020 Jun 14-19; online, IEEE, New York City (2020), pp. 14424-14432.
[20]
A.Rudenko, L.Palmieri, A.J.Lilienthal, K.O.Arras. Human motion prediction under social grouping constraints. Proceedings of the IEEE International Workshop on Intelligent Robots and Systems (IROS 2018)2018 Oct 1-5;Madrid, Spain, IEEE, New York City (2018), pp. 3358-3364.
[21]
A.Sadeghian, V.Kosaraju, A.Sadeghian, N.Hirose, S.H.Rezatofighi, S.Savarese. SoPhie: an attentive GAN for predicting paths compliant to social and physical constraints. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019 Jun 16-20; Long Beach, CA, USA, IEEE, New York City (2019), pp. 1349-1358.
[22]
J.Sun, Q.Jiang, C.Lu. Recursive social behavior graph for trajectory prediction. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition2019 Jun 16-20;Long Beach, CA, USA, IEEE, New York City (2019), pp. 660-669.
[23]
K.Mangalam, H.Girase, S.Agarwal, K.H.Lee, E.Adeli, J.Malik, et al. It is not the journey but the destination: endpoint conditioned trajectory prediction. Proceedings of the 2020 European Conference on Computer Vision; 2020 Aug 23-28; Glasgow, UK, Springer, Berlin (2020), pp. 759-776.
[24]
T.Salzmann, B.Ivanovic, P.Chakravarty, M.Pavone. Trajectron++: dynamically-feasible trajectory forecasting with heterogeneous data. Proceedings of the 2020 European Conference on Computer Vision; 2020 Aug 23-28; Glasgow, UK, Springer, Berlin (2020), pp. 683-700.
[25]
C.Zhou, M.Han, Q.Liang, Y.F.Hu, S.G.Kuai. A social interaction field model accurately identifies static and dynamic social groupings. Nat Hum Behav, 3 (8) (2019), pp. 847-855.
[26]
M.Moussaïd, N.Perozo, S.Garnier, D.Helbing, G.Theraulaz. The walking behaviour of pedestrian social groups and its impact on crowd dynamics. PLoS One, 5 (4) (2010), e10047.
[27]
Y.Liu, Q.Yan, A.Alahi. Social NCE: contrastive learning of socially-aware motion representations. Proceedings of the 2020 IEEE/CVF International Conference on Computer Vision; 2020 Jun 13-19; Seattle, WA, USA, IEEE, New York City (2020), pp. 15118-15129.
[28]
H. DeJaegher, E. DiPaolo, S.Gallagher. Can social interaction constitute social cognition?. Trends Cogn Sci, 14 (10) (2010), pp. 441-447.
[29]
L.Cheng, R.Yarlagadda, C.B.Fookes, P.K.Yarlagadda. A review of pedestrian group dynamics and methodologies in modelling pedestrian group behaviours. World J Mech Eng, 1 (2014), pp. 1-13.
[30]
Z.Yücel, F.Zanlungo, M.Shiomi. Modeling the impact of interaction on pedestrian group motion. Adv Robot, 32 (3) (2018), pp. 137-147.
[31]
R.Zhou, H.Zhou, H.Gao, M.Tomizuka, J.Li, Z.Xu. Grouptron: dynamic multi-scale graph convolutional networks for group-aware dense crowd trajectory forecasting. Proceedings of the 2022 International Conference on Robotics and Automation (ICRA 2022); 2022 May 23-27; Philadelphia, PA, USA, IEEE, New York City (2020), pp. 805-811.
[32]
S.Casas, C.Gulino, R.Liao, R.Urtasun. SpAGNN: spatially-aware graph neural networks for relational behavior forecasting from sensor data. Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA 2020); 2020 May 31-Aug 31; online, IEEE, New York City (2020), pp. 9491-9497.
[33]
H.Girase, H.Gang, S.Malla, J.Li, A.Kanehara, K.Mangalam, et al. LOKI: long term and key intentions for trajectory prediction. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision; 2021 Oct 11-17; Montreal, BC, Canada, IEEE, New York City (2021), pp. 9803-9812.
[34]
Y.Huang, H.Bi, Z.Li, T.Mao, Z.Wang. STGAT: modeling spatial-temporal interactions for human trajectory prediction. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision; 2019 Oct 27-Nov 2; Seoul, Republic of Korea, IEEE, New York City (2019), pp. 6272-6281.
[35]
A.Gupta, J.Johnson, F.F.Li, S.Savarese, A.Alahi. Social GAN: socially acceptable trajectories with generative adversarial networks. Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2018 Jun 18-23; Salt Lake City, UT, USA, IEEE, New York City (2018), pp. 2255-2264.
[36]
B.Zhang, W.Chen, X.Ma, P.Qiu, F.Liu. Experimental study on pedestrian behavior in a mixed crowd of individuals and groups. Physica A Stat Mech Appl, 556 (2020), 124814.
[37]
A.C.Gallup, J.J.Hale, D.J.Sumpter, S.Garnier, A.Kacelnik, J.R.Krebs, et al. Visual attention and the acquisition of information in human crowds. Proc Natl Acad Sci USA, 109 (19) (2012), pp. 7245-7250.
[38]
X.Wang, X.Zhang, Y.Zhu, Y.Guo, X.Yuan, L.Xiang, et al. PANDA: a gigapixel-level human-centric video dataset. Proceedings of the 2020 IEEE/CVF conference on computer vision and pattern recognition; 2020 Jun 14-19 ; online, IEEE, New York City (2020), pp. 3268-3278.
[39]
P.Raksincharoensak, T.Hasegawa, M.Nagai. Motion planning and control of autonomous driving intelligence system based on risk potential optimization framework. Int J Automot Eng, 7 (AVEC14) (2016), pp. 53-60.
[40]
A.Alahi, V.Ramanathan, F.F.Li. Socially-aware large-scale crowd forecasting. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition2014 Jun 23-28;Columbus, OH, USA, IEEE, New York City (2014), pp. 2211-2218.
[41]
L.Shi, L.Wang, C.Long, S.Zhou, M.Zhou, Z.Niu, et al. SGCN: sparse graph convolution network for pedestrian trajectory prediction. Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2021 Jun 19-25; online, IEEE, New York City (2021), pp. 8994-9003.
[42]
A.A.A.Osman, T.Bolkart, M.J.Black. STAR: sparse trained articulated human body regressor. Proceedings of the Computer Vision-ECCV 2020: 16th European Conference; 2020 Aug 23-28; Glasgow, UK, Springer International Publishing, Berlin (2020), pp. 598-613.
[43]
Y.Yuan, X.Weng, Y.Ou, K.Kitani. AgentFormer: agent-aware transformers for socio-temporal multi-agent forecasting. Proceedings of the IEEE/CVF International Conference on Computer Vision; 2021 Oct 10-17; Montreal, QC, Canada, IEEE, New York City (2021), pp. 9813-9823.
[44]
A.Mohamed, D.Zhu, W.Vu, M.Elhoseiny, C.Claudel. Social-Implicit: rethinking trajectory prediction evaluation and the effectiveness of implicit maximum likelihood estimation. Proceedings of the Computer Vision-ECCV 2022: 17th European Conference; 2022 Oct 23-27; Tel Aviv, Israel, Springer, Berlin (2022), pp. 463-479.
[45]
I.Bae, J.H.Park, H.G.Jeon. Learning pedestrian group representations for multi-modal trajectory prediction. Proceedings of the Computer Vision-ECCV 2022: 17th European Conference; 2022 Oct 23-27; Tel Aviv, Israel, Springer, Berlin (2022).
[46]
P.Xu, J.B.Hayet, I.Karamouzas. SocialVAE: human trajectory prediction using timewise latents. Proceedings of the Computer Vision-ECCV 2022: 17th European Conference; 2022 Oct 23-27; Tel Aviv, Israel, Springer, Berlin (2022), pp. 511-528.
[47]
T.Gu, G.Y.Chen, J.Li, C.Lin, Y.Rao, J.Zhou, et al. Stochastic trajectory prediction via motion indeterminacy diffusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022 Jun 21-24;New Orleans, LU, USA, IEEE, New York City (2022).
[48]
I.Bae, J.H.Park, H.G.Jeon. Non-probability sampling network for stochastic human trajectory prediction. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition2022 Jun 21-24;New Orleans, LU, USA, IEEE, New York City (2022).
[49]
Y.Chen, B.Ivanovic, M.Pavone. ScePT: scene-consistent, policy-based trajectory predictions for planning. Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022 Jun 21-24;New Orleans, LU, USA, IEEE, New York City (2022).
[50]
P.Kothari, S.Kreiss, A.Alahi. Human trajectory forecasting in crowds: a deep learning perspective. IEEE Trans Intell Transp Syst, 23 (7) (2021), pp. 7386-7400.
[51]
C.Yu, X.Ma, J.Ren, H.Zhao, S.Yi. Spatio-temporal graph transformer networks for pedestrian trajectory prediction. Proceedings of the 2020 European Conference on Computer Vision2020 Aug 23-28; online, Springer, Berlin (2020), pp. 507-523.
[52]
J.Qiu, J.Tang, H.Ma, Y.Dong, K.Wang, J.Tang. DeepInf: social influence prediction with deep learning. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; 2018 Aug 19-23; London, UK, Association for Computing Machinery (ACM), New York City (2018), pp. 2110-2119.
[53]
C.Liu, Y.Chen, M.Liu, B.E.Shi. AVGCN: trajectory prediction using graph convolutional networks guided by human attention. Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA); 2021 May 30-Jun 5; Xi’an, China, IEEE, New York City (2021), pp. 14234-14240.
[54]
I.Hasan, F.Setti, T.Tsesmelis, A. DelBue, M.Cristani, F.Galasso. “Seeing is believing”: pedestrian trajectory forecasting using visual frustum of attention. Proceedings of the 2018 IEEE Workshop on Applications of Computer Vision (WACV 2018); 2018 Mar 12-15;Lake Tahoe, NV, USA, IEEE, New York City (2018), pp. 1178-1185.
[55]
R.Bastien, P.Romanczuk. A model of collective behavior based purely on vision. Sci Adv, 6 (6) (2020), eaay0792.
[56]
F.A.Lavergne, H.Wendehenne, T.Bäuerle, C.Bechinger. Group formation and cohesion of active particles with visual perception-dependent motility. Science, 364 (80) (2019), pp. 70-74.
[57]
J.Li, R.Han, H.Yan, Z.Qian, W.Feng, S.Wang. Self-supervised social relation representation for human group detection. Proceedings of the Computer Vision-ECCV 2022: 17th European Conference; 2022 Oct 23-27; Tel Aviv, Israel, Springer, Berlin (2022).
[58]
F.Solera, S.Calderara, R.Cucchiara. Socially constrained structural learning for groups detection in crowd. IEEE Trans Pattern Anal Mach Intell, 38 (5) (2016), pp. 995-1008.
[59]
T.Kruse, A.K.Pandey, R.Alami, A.Kirsch. Human-aware robot navigation: a survey. Robot Auton Syst, 61 (12) (2013), pp. 1726-1743.
[60]
F.Gul, W.Rahiman, S.S. NazliAlhady, K.Chen. A comprehensive study for robot navigation techniques. Cogent Eng, 6 (1) (2019), 1632046.