Humans achieve cognitive development through continuous interaction with their environment, enhancing both perception and behavior. However, current robots lack the capacity for human-like action and evolution, posing a bottleneck to improving robotic intelligence. Existing research predominantly models robots as one-way, static mappings from observations to actions, neglecting the dynamic processes of perception and behavior. This paper introduces a novel approach to robot cognitive learning by considering physical properties. We propose a theoretical framework wherein a robot is conceptualized as a three-body physical system comprising a perception-body (P-body), a cognition-body (C-body), and a behavior-body (B-body). Each body engages in physical dynamics and operates within a closed-loop interaction. Significantly, three crucial interactions connect these bodies. The C-body relies on the P-body’s extracted states and reciprocally offers long-term rewards, optimizing the P-body’s perception policy. In addition, the C-body directs the B-body’s actions through sub-goals, and subsequent P-body-derived states facilitate the C-body’s cognition dynamics learning. At last, the B-body would follow the sub-goal generated by the C-body and perform actions conditioned on the perceptive state from the P-body, which leads to the next interactive step. These interactions foster the joint evolution of each body, culminating in optimal design. To validate our approach, we employ a navigation task using a four-legged robot, D’Kitty, equipped with a movable global camera. Navigational prowess demands intricate coordination of sensing, planning, and D’Kitty’s motion. Leveraging our framework yields superior task performance compared with conventional methodologies. In conclusion, this paper establishes a paradigm shift in robot cognitive learning by integrating physical interactions across the P-body, C-body, and B-body, while considering physical properties. Our framework’s successful application to a navigation task underscores its efficacy in enhancing robotic intelligence.
Humans interact with the world in a complex and highly intelligent manner [1]; they employ their eyes for sight and their hands/legs for action, with both working collaboratively under brain control. Navigating a cluttered room, for example, necessitates specific abilities: perception, through which the individual detects doors, walls, and obstacles; behavior, which determines the person’s navigation while avoiding collisions; and cognition, in which the person’s body issues commands for perception and behavior via brain functions. Acting as feedback systems [2], humans refine their cognitive abilities through the reciprocal processes of perception and behavior. Trial and error enhance planning when the observed outcomes deviate from the expected results.
Mimicking such closed-loop multifunctional systems in robots is challenging. Artificial intelligence and robotics researchers often model robots as agents [3], [4]. Observational input (e.g., camera images and force sensors’ tactile feedback) is mapped to actions through policies optimized using imitation learning (IL) [5] or reinforcement learning (RL) in order to maximize feedback rewards. While these methods enable robots to perform real-world tasks, such as navigation [6], locomotion [7], and Rubik’s cube manipulation [8], they have limitations. First, current approaches yield static, one-way mappings from observation to action, ignoring the intricate interaction between robot-embodied perception, behavior, and the environment. Human systems, in contrast, exhibit complex physical integration in which perception, behavior, and cognition interact and evolve holistically. Second, existing methods lack the capability to represent human-like cognition dynamics.
In this paper, we introduce the concept of physics-informed robot cognitive learning to address these limitations. Our approach boasts several advantages over previous methods. We conceptualize the robot as a three-body physical system comprising a perception-body (P-body), a cognition-body (C-body), and a behavior-body (B-body). Each body engages in physical dynamics, with the P-body receiving environmental information and altering its configuration for observation, the C-body handling cognition and generating sub-goals based on the P-body’s states, and the B-body executing behavior directed by the C-body’s sub-goals.
Significantly, our approach introduces three critical interactions among these bodies:
•C-body to P-body: The C-body relies on the P-body’s states, reciprocating with future sub-goals to enhance the P-body’s perception policy.
•C-body to B-body: The C-body guides the B-body’s actions through sub-goals, and P-body-derived states following the B-body’s actions facilitate cognition dynamics learning in the C-body.
•B-body to P-body: The B-body interacts with the environment and changes the state, causing the P-body to acquire new information that updates its perception.
Fig. 1 illustrates the overarching framework, encompassing the P-body, C-body, and B-body interactions.
This unique interaction framework fosters joint evolution, leading to optimal design. Although it shares similarities with multi-agent RL (MARL) [9], our integrated system presents distinct features. The three bodies—perception, cognition, and behavior—possess diverse dynamics and pursue different sub-goals, whereas regular MARL agents are homogeneous, with common decision processes and rewards.
To validate our approach, we apply it to a navigation task employing a four-legged robot, D’Kitty [10], with a movable global camera. The task demands intricate coordination between the camera and D’Kitty, showcasing the effectiveness of our formulation. The experimental results, while indicative, underscore the potential applicability of our method to diverse practical tasks, including air–ground robot systems for emergency rescue.
2. Related works
2.1. Vision perception
In recent years, encouraging advances have been demonstrated in various vision applications, such as image classification [11], [12], [13], object detection [14], [15], [16], [17], [18], and instance segmentation [19], largely thanks to the rapid development of deep learning. Here, we review recent developments in object-detection tasks, in which algorithms require an image as input and then deliver the category and location information of a certain object of interest. Typically, traditional object-detection algorithms fall into two categories: two-stage object detectors and single-stage object detectors. Both types involve the steps of extracting features, generating proposals, regressing position coordinates, and classifying classes. One representative two-stage object-detection method is the faster region-based convolutional neural network (RCNN) [20], which proposes an anchor mechanism for predicting object boundaries and objectness scores per location instead of selective search methods. Another example worth mentioning is the feature pyramid network (FPN) [21], which exploits the fusion of features between different depth feature layers to give better object-detection results at various scales. As a single-stage object-detection approach, the “you only look once” (YOLO) object-detection system [22] drastically reduces detection time compared with two-stage detection methods, albeit at the cost of lower accuracy.
To name a few more one-stage detection methods, the single-shot detector (SSD) [23] extracts feature maps and utilizes a prior box of different scales and aspect ratios for detection, while RetinaNet [24] introduces focal loss to resolve the extremely uneven sample distribution issue and thereby improves the accuracy of the single-stage detector. Anchor-free object-detector methods [25], [26], [27] also exist, which employ features of the network structure to replace the anchor. The recent development of transformers [28], [29] has also promoted research on object detection [30], [31].
The aforementioned methods come under the passive visual perception category, into which most existing algorithms fall. Unfortunately, passive visual perception methods depend on a well-captured still image as an input, which is not easy to obtain in practice. To be specific, passive visual perception methods struggle to locate and classify objects accurately in complex and dynamic circumstances, hindered by noise interference such as occlusion, overexposure, or moving objects. In contrast, humans are able to adjust their position and viewpoints to obtain images of interest—an observation that has inspired the study of active perception trends. In our previous works, we explored a robot with a fixed vision sensor [32], [33], [34], promoting the development of active object detection. However, these methods depend on a handcrafted reward design, such that the agent trained by RL needs to maximize the return to yield better perception performance [35], [36]. In our architecture, the P-body is able to move and seek a desirable configuration for better detection, which falls into the category of active perception methods. Moreover, our P-body is both data-driven and knowledge-driven. In particular, the perception policy of the P-body is not only learned from data by RL under the guidance of the perception reward but also influenced by the sub-goal guidance generated from the C-body. Conditional on the generated sub-goals, the active perception process of the P-body will leverage the knowledge from the C-body, and such knowledge will be continuously updated and optimized through interactions between the B-body and the environment, until the final task goal is attained.
2.2. Reinforcement learning
RL has made impressive advances in artificial intelligence, surpassing human performance in various domains (Atari, Go, etc.). It formalizes the problem of goal-seeking intelligence, which corresponds to maximizing accumulated rewards in the task and environment respectively [37]. RL algorithms are generally divided into the two categories of model-free RL and model-based RL. In model-free RL, agent-training methods can be further divided into two predominant approaches:
(1)Value-based methods. In these methods, the agent learns a state (s)–action (a) value function approximator with as the neural network parameters and selects an action accordingly. The deep-Q network (DQN) [38] was the first deep learning model to successfully learn the value function directly from high-dimensional sensory input.
(2)Policy-based methods. These methods focus on modeling and optimizing the policy directly. For example, REINFORCE [39] updates the policy parameter via gradient ascent on an estimated return (using episode samples collected by Monte–Carlo methods).
More specifically, actor-critic methods interpolate between policy evaluation and policy optimization. For example, deep deterministic policy gradient (DDPG) agents [40] concurrently learn a deterministic policy and a Q-function and use each to improve the other, while soft actor critic (SAC) agents [41] learn a stochastic policy by incorporating entropy regularization and some other tricks. Unlike previous model-free RL methods, our B-body in the Bcent framework can be guided by the planning prompts from the C-body and thus achieve better exploration performance when interacting with the environment.
Model-based RL methods usually achieve better data efficiency but have a weaker asymptotic performance than their model-free counterparts. Model-based algorithms are typically grouped into three categories according to the various usages of the learned dynamical models [42]:
(1)Dyna-style algorithms. In Dyna algorithms, the learned dynamics are used to generate imagined data, and the training iterates between policy optimization with imagined data and model correction with real samples. Model-based policy optimization (MBPO) [43] utilizes an ensemble of neural networks to model the dynamics and adopts SAC [41] as the policy optimization algorithm.
(2)Policy search with back-propagation through time. These methods exploit the model derivatives and improve the policy based on its analytic gradient. The iterated linear quadratic Gaussian (iLQG) method [44] assumes the reward function to be quadratic and the dynamics to be linear and then derives a controller from these simple parametrizations using dynamic programming.
(3)Shooting algorithms. These algorithms sample candidate actions from a designed distribution, evaluate the sampled action under a model, and choose the most promising action in order to deal with nonlinear dynamics and non-convex reward functions. In a recommender system (RS) [45], the agent generates candidate action sequences from a uniform distribution, while the cross entropy method (CEM) [46] and probabilistic ensembles with trajectory sampling (PETS) [47] iteratively adjust the sampling distribution. The knowledge base in the C-body of Bcent can be regarded as an extension of these learning dynamics models. Moreover, our knowledge base can provide sub-goals to guide the P-body and B-body to improve the training efficiency.
3. Methodology
As depicted in Fig. 1, our architecture comprises three essential physical components—P-body, C-body, and B-body—that intricately collaborate within a closed loop. In essence, the P-body acquires comprehensive information from the environment and distils it into low-dimensional states for the C-body. Subsequently, the C-body dispatches specific sub-goals to the B-body based on the received states. Eventually, the B-body enacts precise actions to follow the sub-goals designated by the C-body. This cyclical process persists as the environment evolves, iteratively advancing until the robot attains its ultimate objective.
The forthcoming sections delve into the comprehensive particulars of each body’s role and functionality.
3.1. The perception-body
The primary objective of the P-body is to provide the C-body with a low-dimensional representation after processing high-dimensional observations obtained from the environment. The physical implementation of the P-body may involve various types of sensors, such as cameras for visual input, force arrays for tactile feedback, and voice recorders for audio data collection. A fundamental characteristic of the P-body is its dynamic perception process. The term “dynamic” signifies that the P-body possesses the capability to adjust its configuration to enhance its observation quality if necessary. For example, if the P-body is realized as a movable camera, its observations are closely tied to its configuration, including the camera location and orientation. We refer to the policy-governing configuration changes as the “perception policy.” Furthermore, as previously mentioned, the perception policy is influenced by cues (i.e., sub-goals) received from the C-body.
The computational flow within the P-body is mathematically defined as follows:
where ap is the action of the P-body’s policy. In the above equations, represents the time step (); T is the task horizon; represents the high-dimensional and multi-modal observation from the environment, such as images, depths, and audio; is a low-dimensional abstract state vector that characterizes the perception state; is the perception state defined as a combination of the sensors’ parameters and configurations, such as the position and the inner and outer parameters of the sensors, which are determined based on the policy , where denotes the future sub-goal from the C-body after a delay time ; and is an embedding function that maps the high-dimensional observation vector to a low-dimensional and abstract state vector , conditioned on the configuration . Subsequently, the P-body transitions to a new configuration, , based on the action taken from the current configuration, as governed by the transition probability function . The underlying principle of this formulation is rooted in the concept of a Markov decision process (MDP).
3.1.1. Model-free perception policy learning via P–C interaction
In the context of an MDP, training the policy typically involves two primary approaches: model-based methods [48] and model-free methods (e.g., proximal policy optimization (PPO) [49] and trust region policy optimization (TRPO)) [50]. In the case of the P-body, we opt for the model-free strategy because of the potential complexity and cost associated with learning the transition probability of the observations . An important question arises regarding the provision of the reward signal, . While one option is to set as a measure of how accurately the perception state estimates the ground-truth state , this criterion may not suffice in many scenarios.
3.1.2. P–C interaction
Given that robots typically operate with long-term goals, such as reaching a target position, it becomes essential to ensure minimal perception error throughout the entire movement trajectory, especially at the target point. Moreover, in the Bcent framework, the C-body will transmit sub-goals to the P-body in order to provide guidance and thus improve the perception performance. Compared with conventional active perception methods [50], [51], our P–C interaction offers distinct advantages:
•Efficiency: The P-body’s explicit modeling of the perception dynamics allows the C-body’s planning information to aid the P-body’s movement through the dynamic model, enhancing the interpretability and effectiveness of the perception. In contrast, active perception methods using RL often guide sensor movement solely based on maximizing the perceptive reward, resulting in high exploration costs.
•Flexibility: The generality of our framework (Eqs. (1), (2), (3)) accommodates various policy implementations, including predefined rules, control methods, and RL.
3.2. The cognition-body
The C-body plays a pivotal role in our model, akin to the human brain’s responsibility for coordinating and directing various bodily functions. More specifically, the C-body’s key tasks involve extracting and updating knowledge based on accumulated experiences gained through interactions with the environment by the other two bodies. In return, the C-body provides guidance and instruction to the other two bodies.
The computational flow within the C-body is defined as follows:
In the above equations, is the planning instruction from the C-body, aimed at guiding the exploration of the P-body and B-body; and is the action of the C-body at each planning instant, which then leads to the generation of the next planning instruction. Furthermore, (is a positive integer) denotes the update interval in the C-body, with the temporal values of ( steps) as possible time points; and refer tothe cognition transition probability function in the C-body, respectively; and represents the knowledge base in C-body.
It is important to note that is of multiple forms to guide the exploration and execution of the P-body and B-body, representing planned sub-goals or instructions generated by the cognition dynamics. Unlike the P-body (and the upcoming B-body), which employs a unit update interval (), we deviate from this default setting by allowing for . This modification is motivated by the idea that the C-body focuses more on long-term, coarse-level goals and evolves over a larger temporal scale.
It is worth highlighting that both the cognition policy and the dynamics are contingent on the knowledge of the C-body, denoted as . While this paper currently defines as the parameters of the cognition transition probability function , it can also take the form of memory buffers, knowledge databases, knowledge graphs, or other informative constructs. The incorporation of knowledge enables the C-body to mimic the functioning of a human brain, encompassing knowledge utilization and knowledge updates.
We further specify the planning instructions in Fig. 2, Fig. 3. In particular, on the right side of Fig. 3, a D’Kitty robot would start from the start point and focus on reaching the target point, passing by an obstacle represented by an orange area. Then, in the knowledge base, we can divide this task into a triple skill sequence, “Move to obstacle,” “Move around obstacle,” and “Move to target”, each equipped with specified action primitives and parameter spaces.
More specifically, in the first skill, “Move to obstacle” the action primitives include detecting the distance and shape of the obstacle, approaching the turning point 1, and then arriving in front of the obstacle. It should be noted that the obstacle is a short wall or a pit that the robot cannot pass through. Thus, the robot must move around the obstacle to reach the target. When reaching the first turning point, the D’Kitty robot initiates the second skill, “Move around obstacle,” which includes newly detecting the obstacle for its coordinates, planning to pass by the obstacle, and tracking the planning trajectories in this process. Finally, the D’Kitty robot performs the third skill, “Move to target,” by detecting the target point and arriving at the target. By using the three skills in sequence, the D’Kitty robot can autonomously detect both the obstacle and the target point, plan a reasonable route based on the detection results, and finally perform actions to achieve the sub-goals one by one. In this case, the C-body acts as the skill policy in order to generate the planning instructions, and the B-body acts as the parameter sub-policy in order to eventually reach the target point.
3.2.1. Knowledge utilization
The objective of knowledge utilization within the C-body is to generate sub-goals conditioned on the task goal , where computes the greatest integer less than or equal to , and is a positive integer. This is achieved through model-based RL in this paper. Given our focus on goal-based tasks, the reward is computed by measuring how closely the perception state at time approaches the target state . The reward is defined as , where represents the distance measure function in the perception state space, such as the Euclidean norm used in our implementation. We opt for model-based policy learning in C-body for two main reasons:
•The cognition dynamics operate within a concise and low-dimensional state space, rendering it more tractable to learn a model in this abstract domain.
•Learning the cognition dynamics is imperative, as we will soon outline that the C-body is responsible for supplying both the P-body and B-body with sub-goals, which depend on the outputs of the cognition model.
Various model-based RL techniques have been proposed; we select the iLQR method [52], which has been widely applied in robot control. The generation of sub-goals involves formulating an optimal control problem:
The solution results in a sub-goal sequence , which is then conveyed as guidance to the P-body and B-body.
3.2.2. Knowledge update
In the Bcent framework, initialization of knowledge can occur in two ways: learning from prior knowledge or random initialization. When prior knowledge exists, it can be encoded into the knowledge base via function approximation or hard coding. Otherwise, for general cases without prior knowledge, initialization can be performed with random hyperparameters to the knowledge base. Then, the question arises: How do we update knowledge —more specifically, the parameters of the cognition model ? In essence, signifies the expected state projected by the C-body and should closely approximate the actual feedback . In other words, the learning objective of the cognition model is to minimize the residual error via the mean squared error. Given that , we propose to instantiate with the actual state during the model learning process, as our experiments have shown that this instantiation accelerates learning significantly. Overall, the model learning problem can be formulated as follows:
where is derived from the knowledge utilization in Eq. (7).
3.2.3. P–C interaction
Once the optimal cognition transition probability model is learned, it can facilitate policy learning in both the P-body and the B-body. The sub-goals are integrated into the perception policy (Eq. (2)) and the long-term reward for the P-body for active sensing.
3.2.4. C–B interaction
For the B-body, the sub-goals will be harnessed as the expected states to guide behavior policy optimization, which will be introduced in the subsequent subsection.
In the aforementioned P–C and B–C interaction processes, the C-body leverages historical information and prior knowledge to update the cognition body . This feature underscores the advantages of our formulation over prior works:
•Imitation of human cognition: The C-body emulates the environment’s characteristics and refines the knowledge of the planning process, aligning more closely with human cognitive growth.
•Sample efficiency: The introduction of taps into historical information and prior knowledge, enhancing the sample efficiency of cognition training.
•Cooperativity: The C-body imparts planning information to both the P-body and the B-body, enabling them to execute their actions with increased precision and efficacy.
3.3. The behavior-body
The B-body orchestrates its actions in accordance with the sub-goals obtained from the C-body. To streamline this explanation, we will outline the process of reaching the sub-goal ; the computational flows for other sub-goals follow a similar pattern. We consider the MDP below, where .
We have duplicated the state extraction stage from the P-body for the sake of clarity. Apart from this, is the action of the B-body derived by behavior policy, which is applied to the environment, leading to the next observation . is the behavior body’s policy and is the behavior dynamics model.
The B-body is motivated by two primary objectives. First, it aims to accurately track the sub-goals provided by the C-body. We refer to this reward as the goal-guided reward for each th phase to measure the distance from to . Second, in addition to reaching the target position, the B-body must fulfil other requisites (e.g., in navigation, the robot needs to learn how to stand up and avoid obstacles before reaching the target location). Therefore, we introduce another reward that can be obtained from the environment. In combination, we compute the reward , where is a weight parameter that balances the significance of the two rewards. To facilitate the learning of the behavior policy , we lean toward employing model-free RL techniques. This choice is influenced by the typical complexity of behavior dynamics, which are often challenging to accurately characterize. In cases where the dynamics are straightforward or known, opting for model-based policy learning can offer the advantage of enhanced exploration efficiency.
After receiving the guided sub-goals from the C-body, the B-body divides the original long-term complex task into short-term simple sequential subtasks. Within each subtask, the B-body performs actions to reach the given sub-goal. When completing the subtask, the B-body feeds the real exploration results back to the C-body in order to update the knowledge base.
We summarize the advantages of the B-body compared with conventional RL methods:
•Reducibility: By leveraging planning information, the B-body can decompose the original task into a series of simpler sub-tasks. The successful completion of each sub-task contributes to the successful achievement of the overarching goal.
•Interactivity: Following interaction with the environment, the interactive residual error computed by the B-body effectively contributes to the update of the C-body’s knowledge.
•Enhanced efficiency: The task decomposition into sequential sub-tasks enables the B-body to enhance the interaction efficiency, facilitating effective and streamlined execution.
4. Experiments
All experiments were conducted using a four-legged robot called D’Kitty. The primary objective of the robot was to navigate to a predefined target location. Below is a description of the implementation of our experimental framework:
•The overhead camera functions as the P-body and is responsible for tracking the location of D’Kitty. To enhance its tracking capabilities, the camera can freely move along the three-dimensional (3D) directions—up and down, left and right, and forward and backward.
•D’Kitty, which has 12 degrees of freedom for locomotion, serves as the B-body. Its task includes learning how to stand up and move forward, based on the angles of its legs detected by internal sensors and the two-dimensional (2D) location coordinates provided by the P-body.
•The planner, acting as the C-body, is tasked with generating a virtual trajectory of sub-goals leading to the target location. This trajectory guides the movements of both the P-body and the B-body.
The subsequent sections present a series of experiments to demonstrate the necessity of interactions between the three bodies in the context of the navigation task. More specifically, we conduct evaluations for the perception task, followed by an investigation into the navigation task.
When conducting these experiments, we instantiate the symbols in Bcent framework as follows: represents the internal and external parameters of the camera; is a top-view image received by the camera; denote the position and velocity of D’Kitty extracted from the image, as well as the angle and angle velocity of D’Kitty detected by the internal sensors; is the control strategy of the camera configuration; represents the parameters of the prior knowledge structure of D’Kitty and the task; is the expected position and attitude of D’Kitty in the next step; is the cognition strategy of the position and attitude of D’Kitty and is the control input of D’Kitty, such as the joint torques.
While our scenario may seem straightforward, with only one obstacle, it presents considerable complexity. Traditional control-based methods struggle to meet our task requirements due to two primary challenges. The first challenge involves the robot’s need to perceive its environment, such as identifying the obstacle’s location and size. Without the perception module (the P-body), relying solely on low-level sensing makes obstacle avoidance difficult. The second challenge arises from the unique configuration of the D’Kitty robot, which is equipped with four legs and 12 degrees of freedom. In our task, the robot must know how to stand up and move. It is difficult to control all four legs by means of traditional learning-free methods.
In general, both the perception and the behavior capabilities mentioned above are better achieved through learning-based methods rather than traditional control-based approaches.
4.1. Evaluations of perception
In our experiment, we enable the camera to move in three dimensions, allowing us to thoroughly assess the significance of active perception in the robot tasks and fully explore the potential of active perception in a perfect setting. At time , the P-body (represented by the camera) receives a top-view image of the environment denoted as , based on the internal and external parameters of the camera . Using this input, the P-body employs an object-detection neural network to detect the 2D state of the moving D’Kitty, as described by in Eq. (1). Our P-body exhibits two distinct advantages compared with previous object-detection methods:
•Active perception: The P-body has the capability to move and ensure that D’Kitty remains within its field of view. This ability is achieved through its perception policy .
•P–C interaction: The trajectory envisioned by the C-body serves as valuable guidance for the P-body, enabling it to better track D’Kitty’s behavior. This results in an improved policy , where represents the sub-goal at time in Eq. (4).
In this section, we will demonstrate the benefits of these specific designs. An overview of the experimental flowchart is provided in Fig. 4, Fig. 5.
We compiled two distinct datasets for our training: an offline dataset and an online dataset. The offline dataset was constructed using DeepMind Control Suite and MuJoCo in combination with the Robel environment. This was achieved by integrating a fixed camera into the setup. The implemental details of P-body are shown in Fig. 6. After training on this offline dataset, the passive perception module was established and kept static for the subsequent training of the active perception module. The online dataset was generated through real-time interactions between the environment and the moving camera. It should be noted that the online images do not need to be stored, since the active perception model dynamically engages with the environment. All images have a consistent input size of .
4.1.1. Evaluation metrics
Our evaluation employs five key metrics: precision, recall, F1 score, F2 score, and mean average precision (mAP). The utilization of the mAP is a prevalent practice in the computer vision research community for assessing the performance of object-detection models.
4.1.2. Results and analyses
We present the quantitative outcomes in Table 1[33], [52], highlighting the best results in bold font. In this table, the method labelled as “fixed” [33] denotes the scenario where the camera is stationary and unmoving. The “fixed” approach delivers the weakest performance because D’Kitty consistently moves out of the camera’s field of view. The term “random” signifies the case where the camera is set in motion by initializing three vectors to determine the extent of movement along the x, y, and z axes. The “active” method [52] refers to active perception, where the perception agent is trained by an RL method within hand-crafted rewards, in order to adjust the sensors’ configurations and positions. The designation “policy” refers to a variation of our method that lacks the guidance of sub-goals, while “policy + sub-goal” represents the complete implementation of our proposed method.
Our observations reveal that our method outperforms all other three baselines across the most fundamental and rigorous evaluation metric—namely, mAP. This outcome substantiates the superiority of our approach. It is noteworthy that the recall of the “policy” method surpasses that of our method. We speculate that, being devoid of assistance from sub-goals, the “policy” method tends to gather an excessive number of samples at the beginning. This likely contributes to its improved recall value but diminished precision in comparison with our method.
4.2. Evaluations of locomotion and navigation
The experiments conducted in this subsection provide a comprehensive validation of D’Kitty’s capabilities with respect to both locomotion (i.e., tasks such as standing up and walking) and navigation (i.e., strategies for reaching a designated target location).
4.2.1. Observation and sub-goal spaces
In the context of these evaluations, the complete observation space for D’Kitty at a given time step encapsulates a rich set of variables. This includes D’Kitty’s current position and velocity, and the degree of alignment of its heading with respect to the target location. In addition, information about the 12 joint angles and their corresponding angular velocities are encompassed within this space. Moreover, the state vectors and sub-goal are intricately linked to the 2D location of D’Kitty within the environment.
4.2.2. Predictive planning
Upon receiving the initial state and target position, the C-body engages in the formulation of a predictive trajectory of states, specifically represented as . Subsequently, contingent upon each predictive state , the behavior policy of the body (i.e., the B-body) orchestrates a series of actions, with the objective of attaining the subsequent sub-goal within the course of steps.
4.2.3. Knowledge update
Following an interaction window spanning steps, an accumulated residual error, characterized by , is conveyed to the C-body. This information serves as the catalyst for the updating of its knowledge parameters through Eq. (10). Concurrently, the B-body undertakes the task of refining its policy to mitigate this error, thereby facilitating improved performance during subsequent training iterations.
In essence, this cohesive interaction loop seamlessly integrates perception, planning, and action, fostering the acquisition of enhanced knowledge by the C-body while concurrently enabling the B-body to fine-tune its policy for optimized execution.
4.2.4. Implementation details
Within the C-body, the cognition transition probability model is instantiated using a multilayer perceptron (MLP). During each episode, the C-body provides sub-goals that guide the interactions of the B-body with the environment. The experiences obtained from these interactions are stored in a cache for self-supervised C-body training. To overcome non-stationarity issues caused by the simultaneous training of the P-body and B-body, an “all-in all-out” training strategy is employed. This involves the cyclic processing of the cache, wherein experiences are divided into training and test sets for iterative model training. Besides, the implement details of C-body and B-body, as well as their interaction, are illustrated in Fig. 7.
For the B-body, the task is divided into a sequence of goal-conditioned subtasks with intervals of . Each subtask incorporates a sub-goal from the C-body to facilitate interactions with the environment. The behavior policy is trained using PPO, where a three-layer MLP serves as the policy approximator. All hyper-parameters are listed in Table 2.
4.3. Evaluation results
To assess the training and testing stages, the D’Kitty robot’s initial location is fixed at the origin (0, 0) with a standard stance, and target locations are chosen randomly within the x–y coordinate space. After every 50 training episodes, ten tests are conducted, and the average cumulative reward and success rate are computed for presentation.
4.3.1. Reward function and success indicator
The reward (r) function is defined as follows referring to Ref. [54]:
where is the reward encouraging the robot to keep upright and if or , is the cosine of the angle between the body normal and the ground, represents the distance from current state to the target, is the cosine of the angle of the robot's orientation in relation to the target line and and are the bonuses for reaching the desired target. From this reward, it can be seen that this task requires the simulated D’Kitty to reach the target and maintain its balance.
The success indicator is whether the goal distance is within a certain threshold and D’Kitty is sufficient upright at the last step of the episode:
where τ is the state-action trajectory of single task; is the distance from the current position of D’Kitty to the target when task completes; and is the cosine of the final angle when task completes.
4.3.2. Results and analysis
The evaluation results for the success rate and reward are depicted in Fig. 8, Fig. 9, respectively. And, we also provide visualization results in Appendix A Video S1 to show the dynamic motion process. Two variants of the method are compared: one employing the ground-truth states, denoted as CB-body (the C-body and the B-body) , and the other employing the predicted states by the P-body, denoted as PCB-body (the P-body, C-body, and B-body). Several baseline methods utilizing PPO [49], SAC [41], or temporal difference model (TDM) [53] for individual B-body control are also included. CB exhibits better control performance and success rates, with the method showing resilience to local optima. PPO, however, converges to a lower reward due to limited exploration, resulting in the robot learning only to stand and orient itself toward the target, without mastering the walking motion.
When comparing PCB-body with CB-body, a minor performance drop is observed. This is attributed to the fact that the states predicted by the P-body do not precisely match the ground truth, leading to a slight degradation in PCB-body’s performance. Without ground-truth perception information such as the position and velocity of the D’Kitty robot and the coordinates of the target and obstacle, the P-body will enable information extraction from the images. Then, the C-body and B-body will receive the perception information from the P-body and drive the robot to the target. In short, the PCB-body (P-body, C-body, and B-body) presents in a more challenging environment than CB-body, while the performance degradation is slight, which demonstrates the effectiveness of the inherent cooperation of the three bodies in the Bcent framework. Nevertheless, PCB-body significantly outperforms the baselines, highlighting the efficacy of the interplays among the three bodies in enhancing the control performance and task success rate.
We also report the training wall time and the computing infrastructure in Table 3, to show the computational efficiency. From the results, it can be seen that, even within the synchronous training of three bodies, Bcent can be trained efficiently in general computing infrastructures.
5. Discussion and Conclusions
In this paper, we introduce a pioneering decision-making framework that harnesses the collaborative development of a P-body, a C-body, and a B-body. Our framework transcends the conventional agent paradigm by incorporating the C-body, enabling a more faithful emulation of human cognitive processes and dynamic decision-making. The physical synergy among these three bodies contributes to their collective enhancement. Through extensive experiments involving D’Kitty’s navigation and locomotion, we validate the necessity and effectiveness of each proposed component compared with existing methods and alternative variations.
This work lays a foundation for numerous future explorations in cognitive robotics:
•The collaboration of multiple cognitive bodies: Drawing inspiration from the biology of creatures like octopuses, in which different brains serve distinct purposes, a fascinating avenue would be to explore how multiple cognitive bodies can synergize to achieve superior performance.
•Human–robot collaboration in dynamic environments: As robots increasingly collaborate with humans and fellow robots in diverse tasks, there is a compelling need to study how robots can effectively interact with dynamic environments while working alongside humans.
•Distributed sensing and learning in the cloud: The concept of cloud robots is gaining momentum. The integration of distributed local sensing with the global sharing of information through the cloud could revolutionize robotic capabilities.
•Non-Markov decision-making: Traditional decision processes are often Markovian, with the agent lacking the capability to remember past trajectories. Embracing non-Markovian decision processes could allow robots to leverage their complete history and experiences, leading to more informed and holistic decision-making.
In conclusion, our framework introduces a ground-breaking perspective in the field of cognitive robotics, emphasizing collaboration among distinct cognitive bodies. This work opens the door to a multitude of exciting avenues for future research and innovation. Our proposed Bcent framework is generally versatile and can be applied to various types of robots or tasks by selecting specific state and action spaces for the three bodies involved. In human–robot scenarios, potential challenges exist in two aspects: In the aspect of perception, in order to better communicate with humans, a robot should understand human intentions and know what the humans plan to do during human–robot interactions. In the aspect of behavior, the robot should be able to learn from human demonstrations and imitate skills from humans in order to improve its behavior capability. In our Bcent framework, in addition to perception and behavior, we utilize cognition to update the knowledge in order to realize continuous learning and life-long learning, which can help Bcent deal with complex human-robot interaction scenarios.
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was jointly funded by the National Science and Technology Major Project of the Ministry of Science and Technology of China (2018AAA0102900) and the “New Generation Artificial Intelligence” Key Field Research and Development Plan of Guangdong Province (2021B0101410002).
MiriyevA, KovaMč.Skills for physical artificial intelligence.Nat Mach Intell2020; 2(11):658-660.
[2]
MurrayRM.Feedback systems: an introduction for scientists and engineers.Princeton University Press, Princeton (2010)
[3]
SünderhaufN, BrockO, ScheirerW, HadsellR, FoxD, LeitnerJ, et al.The limits and potentials of deep learning for robotics.Int J Robot Res2018; 37(4–5):405-420.
[4]
WangW, SiauK.Artificial intelligence, machine learning, automation, robotics, future of work and future of humanity: a review and research agenda.J Database Manage2019; 30(1):61-79.
KretzschmarH, SpiesM, SprunkC, BurgardW.Socially compliant mobile robot navigation via inverse reinforcement learning.Int J Robot Res2016; 35(11):1289-1307.
[7]
KohlN, StoneP.Policy gradient reinforcement learning for fast quadrupedal locomotion.In: Proceedings of theIEEEInternationalConference onRobotics andAutomation; 2004 Apr 26–May 1; NewOrleans, LA, USA. NewYorkCity: IEEE; 2004. p. 2619–24.
[8]
AkkayaI, AndrychowiczM, ChociejM, LitwinM, McGrewB, PetronA, et al.Solving rubik’s cube with a robot hand.2019. arXiv: 1910.07113.
[9]
ZhangK, YangZ, BasarT.Multi-agent reinforcement learning: a selective overview of theories and algorithms.K.G. Vamvoudakis, Y. Wan, F.L. Lewis, D. Cansever (Eds.), Handbook of reinforcement learning and control, Springer, Berlin2021; 321-384.
[10]
AhnC, KimE, OhS.Deep elastic networks with model selection for multi-task learning.In: Proceedings of theIEEE/CVFInternationalConference onComputerVision; 2019 Oct 27–Nov 2; Seoul, Republic ofKorea. NewYorkCity: IEEE; 2019. p. 6529–38.
[11]
HeK, ZhangX, RenS, SunJ.Deep residual learning for image recognition.In: Proceedings of theIEEE Conference onComputerVision andPatternRecognition; 2016 Jun 27–30; LasVegas, NV, USA. NewYorkCity: IEEE; 2016. p. 770–8.
[12]
HowardAG, ZhuM, ChenB, KalenichenkoD, WangW, WeyandT, et al.MobileNets: efficient convolutional neural networks for mobile vision applications.2017. arXiv: 1704.04861.
[13]
KrizhevskyA, SutskeverI, HintonGE.ImageNet classification with deep convolutional neural networks.In: Proceedings of the 26thAnnualConference onNeuralInformationProcessingSystems; 2012 Dec 3–6; LakeTahoe, NA, USA. Trier: the dblp computer science bibliography; 2012. p. 1097–105.
[14]
GirshickR.Fast R-CNN.In: Proceedings of theIEEEInternationalConference onComputerVision; 2015 Dec 7–13; Santiago, Chile. NewYorkCity: IEEE; 2015. p. 1440–48.
[15]
GirshickR, DonahueJ, DarrellT, MalikJ.Rich feature hierarchies for accurate object detection and semantic segmentation.In: Proceedings of theIEEE Conference onComputerVision andPatternRecognition; 2014 Jun 23–28; Columbus, OH, USA. NewYorkCity: IEEE; 2014. p. 580–7.
[16]
HeK, ZhangX, RenS, SunJ.Spatial pyramid pooling in deep convolutional networks for visual recognition.IEEE Trans Pattern Anal Mach Intell2015; 37(9):1904-1916.
[17]
RedmonJ, FarhadiA.YOLO9000: better, faster, stronger.In: Proceedings of theIEEE Conference onComputerVision andPatternRecognition; 2017 Jul 21–26; Honolulu, HI, USA. NewYorkCity: IEEE; 2017. p. 7263–71.
[18]
RedmonJ, FarhadiA.YOLOv3: an incremental improvement.2018. arXiv: 1804.02767.
[19]
HeK, GkioxariG, DollárP, GirshickR.Mask R-CNN.In: Proceedings of theIEEEInternationalConference onComputerVision; 2017 Oct 22–29; Venice, Italy. NewYorkCity: IEEE; 2017. p. 2961–9.
[20]
RenS, HeK, GirshickR, SunJ.Faster R-CNN: towards real-time object detection with region proposal networks.In: Proceedings of theAdvances inNeuralInformationProcessingSystems 28: AnnualConference onNeuralInformationProcessingSystems; 2015 Dec 7–12; Montreal, QC, Canada. Cambridge: TheMIT Press; 2015. p. 91–9.
[21]
LinTY, DollárP, GirshickR, HeK, HariharanB, BelongieS.Feature pyramid networks for object detection.In: Proceedings of theIEEE Conference onComputerVision andPatternRecognition-2017; 2017 Jul 21–26; Honolulu, HI, USA. NewYorkCity: IEEE; 2017. p. 2117–25.
[22]
RedmonJ, DivvalaS, GirshickR, FarhadiA.You only look once: unified, real-time object detection.In: Proceedings of theIEEE Conference onComputerVision andPatternRecognition-2016; 2016 Jun 27–30; LasVegas, NV, USA. NewYorkCity: IEEE; 2016. p. 779–88.
[23]
LiuW, AnguelovD, ErhanD, SzegedyC, ReedS, FuCY, et al.SSD: single shot multibox detector.In: Proceedings of theEuropeanConference onComputerVision; 2016 Oct 11–14; Amsterdam, theNetherlands. Berlin: Springer; 2016. P. 21–37.
[24]
LinTY, GoyalP, GirshickR, HeK, DollárP.Focal loss for dense object detection.In: Proceedings of theIEEEInternationalConference onComputerVision; 2017 Oct 22–29; Venice, Italy. NewYorkCity: IEEE; 2017. p. 2980–8.
[25]
LawH, DengJ.CornerNet: detecting objects as paired keypoints.In: Proceedings of theEuropeanConference onComputerVision (ECCV 2018); 2018 Sep 8–14; Munich, Germany. Berlin: Springer; 2018. p. 734–50.
[26]
ZhouX, ZhuoJ, KrahenbuhlP.Bottom-up object detection by grouping extreme and center points.In: Proceedings of theIEEE Conference onComputerVision andPatternRecognition-2019; 2019 Jun 15–20; LongBeach, CA, USA. NewYorkCity: IEEE; 2019. p. 850–9.
DosovitskiyA, BeyerL, KolesnikovA, WeissenbornD, ZhaiX, UnterthinerT, et al.An image is worth 16 ×16 words: transformers for image recognition at scale.2020. arXiv: 2010.11929.
[29]
LiuZ, LinY, CaoY, HuH, WeiY, ZhangZ, et al.Swin transformer: hierarchical vision transformer using shifted windows.In: Proceedings of theIEEE/CVFInternationalConference onComputerVision; 2021 Oct 10–17; Montreal, QC, Canada. NewYorkCity: IEEE; 2021. p. 10012–22.
[30]
FangY, LiaoB, WangX, FangJ, QiJ, WuR, et al.You only look at one sequence: rethinking transformer in vision through object detection.In: Proceedings of the 35thAnnualConference onNeuralInformationProcessing; 2021 Dec 6–14; online. SanDiego: NeuralInformationProcessingSystems; 2021.
[31]
SongH, SunD, ChunS, JampaniV, HanD, HeoB, et al.ViDT: an efficient and effective fully transformer-based object detector.2021. arXiv: 2110.03921.
[32]
JingM, MaX, HuangW, SunF, YangC, FangB, et al.Reinforcement learning from imperfect demonstrations under soft expert guidance.Proc Conf AAAI Artif Intell2020; 34(04):5109-5116.
LiuH, WangF, GuoD, LiuX, ZhangX, SunF.Active object discovery and localization using sound-induced attention.IEEE Trans Industr Inform2021; 17(3):2021-2029.
[35]
BajcsyR, AloimonosY, TsotsosJK.Revisiting active perception.Auton Robots2018; 42(2):177-196.
[36]
LiuH, DenY, GuoD, FangB, SunF, YangW.An interactive perception method for warehouse automation in smart cities.IEEE Trans Industr Inform2021; 17(2):830-838.
[37]
SilverD, SinghSP, PrecupD, SuttonRS.Reward is enough.Artif Intell2021; 299:103535.
[38]
MnihV, KavukcuogluK, SilverD, RusuAA, VenessJ, BellemareMG, et al.Human-level control through deep reinforcement learning.Nature2015; 518(7540):529-533.
[39]
SuttonRS, McAllesterD, SinghS, MansourY.Policy gradient methods for reinforcement learning with function approximation.In: Proceedings of theAnnualConference onNeuralInformationProcessingSystems (NIPS 1999); 1999 Nov 29–Dec 4; Denver, CO, USA. Cambridge: TheMIT Press; 1999.
[40]
LillicrapTP, HuntJJ, PritzelA, HeessN, ErezT, TassaY, et al.Continuous control with deep reinforcement learning.In: Proceedings of the 4thInternationalConference onLearningRepresentations, ICLR 2016; 2016 May 2–4; SanJuan, PuertoRico. Trier: the dblp computer science bibliography; 2016.
[41]
HaarnojaT, ZhouA, AbbeelP, LevineS.Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor.In: Proceedings of the 35thInternationalConference onMachineLearning; 2018 Jul 10–15; Stockholm, Sweden. NewYorkCity: Proceedings ofMachineLearningResearch; 2018. p. 1856–65.
JannerM, FuJ, ZhangM, LevineS.When to trust your model: model-based policy optimization.In: Proceedings of theAnnualConference onNeuralInformationProcessingSystems; 2019 Dec 8–14; Vancouver, BC, Canada. SanDiego: NeuralInformationProcessingSystemsFoundation, Inc.; 2019. p. 12498–09.
[44]
TassaY, ErezT, TodorovE.Synthesis and stabilization of complex behaviors through online trajectory optimization.In: Proceedings of the 2012 IEEE/RSJInternationalConference onIntelligentRobots andSystems; 2012 Oct 7–12; Vilamoura, Portugal. NewYorkCity: IEEE; 2012. p. 4906–13.
[45]
ZhouZ, YanN.A survey of numerical methods for convection-diffusion optimal control problems.J Numer Math2014; 22(1):61-85.
[46]
DePT Boer, KroeseDP, MannorS, RubinsteinRY.A tutorial on the cross-entropy method.Ann Oper Res2005; 134(1):19-67.
[47]
ChuaK, CalandraR, McAllisterR, LevineS.Deep reinforcement learning in a handful of trials using probabilistic dynamics models.In: Proceedings of theAnnualConference onNeuralInformationProcessingSystems; 2018 Dec 3–8; Montreal, QC, Canada. RedHook: CurranAssociatesInc.; 2018. p. 4759–70.
[48]
YildizC, HeinonenM, LähdesmäkiH.Continuous-time model-based reinforcement learning.In: Proceedings of theInternationalConference onMachineLearning; 2021 Jun 18–24; online. NewYorkCity: Proceedings ofMachineLearningResearch; 2021. p. 12009–18.
SchulmanJ, LevineS, AbbeelP, JordanM, MoritzP.Trust region policy optimization.In: Proceedings of theInternationalConference onMachineLearning; 2015 Jul 6–11; Lille, France. NewYorkCity: Proceedings ofMachineLearningResearch; 2015. p. 1889–97.
[51]
ChangD, Johnson-RobersonM, SunJ.An active perception framework for autonomous underwater vehicle navigation under sensor constraints.IEEE Trans Control Syst Technol2022; 30(6):2301-2316.
[52]
AmosB, JimenezI, SacksJ, BootsB, ZicoKolterJ.Differentiable MPC for end-to-end planning and control.In: Proceedings of the 32ndAnnualConference onNeuralInformationProcessingSystems (NIPS 2018); 2018 Dec 2–8; Montreal, QC, Canada. SanDiego: NeuralInformationProcessingSystems; 2018.
[53]
PongV, GuS, DalalM, LevineS.Temporal difference models: model-free deep RL for model-based control.In: Proceedings of theInternationalConference onLearningRepresentations; 2018 Apr 30–May 3; Vancouver, BC, Canada. Trier: the dblp computer science bibliography; 2018.
[54]
AhnM, ZhuH, HartikainenK, PonteH, GuptaA, LevineS, KumarV.ROBEL: robotics benchmarks for learning with low-cost robots.In: Proceedings of theConference onRobotLearning; 2020 Nov 16–18; online; 2020.