Toward Trustworthy Decision-Making for Autonomous Vehicles: A Robust Reinforcement Learning Approach with Safety Guarantees

Xiangkun He , Wenhui Huang , Chen Lv

Engineering ›› 2024, Vol. 33 ›› Issue (2) : 86 -99.

PDF (884KB)
Engineering ›› 2024, Vol. 33 ›› Issue (2) :86 -99. DOI: 10.1016/j.eng.2023.10.005
Research

Toward Trustworthy Decision-Making for Autonomous Vehicles: A Robust Reinforcement Learning Approach with Safety Guarantees

Author information +
History +
PDF (884KB)

Abstract

While autonomous vehicles are vital components of intelligent transportation systems, ensuring the trustworthiness of decision-making remains a substantial challenge in realizing autonomous driving. Therefore, we present a novel robust reinforcement learning approach with safety guarantees to attain trustworthy decision-making for autonomous vehicles. The proposed technique ensures decision trustworthiness in terms of policy robustness and collision safety. Specifically, an adversary model is learned online to simulate the worst-case uncertainty by approximating the optimal adversarial perturbations on the observed states and environmental dynamics. In addition, an adversarial robust actor-critic algorithm is developed to enable the agent to learn robust policies against perturbations in observations and dynamics. Moreover, we devise a safety mask to guarantee the collision safety of the autonomous driving agent during both the training and testing processes using an interpretable knowledge model known as the Responsibility-Sensitive Safety Model. Finally, the proposed approach is evaluated through both simulations and experiments. These results indicate that the autonomous driving agent can make trustworthy decisions and drastically reduce the number of collisions through robust safety policies.

Graphical abstract

Keywords

Autonomous vehicle / Decision-making / Reinforcement learning / Adversarial attack / Safety guarantee

Cite this article

Download citation ▾
Xiangkun He, Wenhui Huang, Chen Lv. Toward Trustworthy Decision-Making for Autonomous Vehicles: A Robust Reinforcement Learning Approach with Safety Guarantees. Engineering, 2024, 33(2): 86-99 DOI:10.1016/j.eng.2023.10.005

登录浏览全文

4963

注册一个新账户 忘记密码

1. Introduction

In recent years, autonomous vehicles have gained momentum with the rapid development of emerging technologies such as advanced mobile communication [1] and artificial intelligence (AI) [2], and are expected to revolutionize human mobility and transportation systems [3], [4], [5]. However, real-world traffic scenarios involve unpredictable noise or uncertainties, making it challenging to ensure the robustness and safety of driving policies. Hence, the trustworthiness of autonomous driving raises major concerns for various institutions and the general public [6], [7], [8]. Given these intricate challenges, meeting the rigorous requirements and high expectations pertaining to autonomous driving remains a significant concern [9], [10], [11].

The decision-making system can be likened to the brain of an autonomous vehicle, primarily responsible for determining the optimal driving mode or policy based on perception information [12], [13], [14]. Numerous studies have reported advances in decision-making methods for autonomous driving [15], [16], [17]. The finite-state machine (FSM), a rule-based technique, is the most popular approach for developing decision-making systems [18], [19]. Although such a scheme is simple to implement and interpret, it relies heavily on the prior knowledge of specialists, thus making it difficult to design driving rules for complex traffic scenarios.

As a vital component of modern AI technologies, reinforcement learning (RL) provides a feasible and effective paradigm for solving complex sequential decision-making tasks via interactions with an environment [20], [21], [22]. Consequently, several studies have attempted various RL methods to address the sequence of autonomous driving tasks [23], [24], [25]. Researchers have leveraged RL algorithms to learn lane-change policies for autonomous driving [26], [27]. For instance, a lane-change decision-making framework for autonomous vehicles was developed using a risk-awareness-prioritized replay deep Q-network (RA-PRDQN) method [28]. A safe lane-change decision scheme for autonomous driving was developed using an RL approach with a rule-based safety verification [29]. Some studies have employed RL algorithms to learn optimal target speeds or speed patterns (e.g., acceleration, deceleration, and maintenance) of autonomous vehicles [30], [31]. For example, a cooperation-aware on-ramp merging decision-making scheme for autonomous vehicles was developed using the belief-state RL method [32]. The subgoal-based speed patterns of autonomous vehicles were determined using a state-attention-model-based hierarchical RL approach [33]. To ensure the robustness of the on-ramp merging policies against environmental uncertainties, a robust decision-making solution for autonomous driving was proposed using a constrained adversarial RL technique [34]. Many researchers have leveraged RL algorithms to simultaneously learn optimal lane-change policies and speed patterns of autonomous vehicles [35], [36], [37]. For instance, longitudinal and lateral decision-making behaviors for autonomous driving can be learned via a double deep Q-network (DDQN) with a short-horizon safety checker [38], while target speeds and lane-change policies of autonomous vehicles can be determined using a hierarchical program-triggered RL technique based on multiple agents [39]. In another study, a trustworthy improvement RL scheme with a rule-based policy was developed to enable an autonomous driving agent to learn safe longitudinal and lateral driving velocities [40].

Although existing research on driving decisions has achieved numerous compelling results that can enhance the performance of autonomous vehicles, there is still room for improvement and perfection in terms of trustworthiness. Moreover, most studies assume that traffic scenarios are devoid of environmental uncertainty or involve only one specified type of uncertainty. Unfortunately, real-world scenarios involve substantial and inevitable uncertainties that can cause autonomous driving agents to make undesired or even unsafe decisions. In real-world traffic scenarios, multiple sources of uncertainty, such as observational noise and environmental changes, may coexist, leading to complex and challenging driving situations. Hence, policy robustness against multiple uncertainties should be considered in the autonomous driving domain. However, few studies have addressed the challenge of guaranteeing the safety of RL-based autonomous driving agents during training and testing in stochastic dynamic traffic flows with adversarial environmental uncertainties.

Consequently, all the above insights motivated us to explore a new technique to ensure the trustworthiness of autonomous driving decisions, including policy robustness and collision safety. In this study, we introduce a novel robust RL approach with safety guarantees (RRL-SG) aimed at achieving trustworthy decision-making for autonomous vehicles. The main contributions of this study are summarized as follows:

(1)An adversarial agent is trained online to model the worst-case multiple uncertainties by approximating the optimal adversarial perturbations for both observed states and environmental dynamics. An adversarial robust actor (ARAC) algorithm is developed to enable the agent to learn robust policies against observational noises and environmental changes.

(2) Using an interpretable knowledge model proposed by Intel, Responsibility-Sensitive Safety (RSS) [41], [42], a safety mask is developed to guarantee the collision safety of the autonomous driving agent during both the training and testing processes, which can transform the probability corresponding to an unsafe decision into zero (i.e., a safe action space is formed by shielding risky actions).

(3) Numerical simulation results with Simulation of Urban Mobility (SUMO) [43] indicate that the proposed RRL-SG approach guarantees the trustworthiness of autonomous vehicles in stochastic dynamic traffic flows with adversarial environmental perturbations. Experiments using a real autonomous vehicle further confirm the effectiveness of the proposed technique.

The remainder of this paper is organized as follows. Section 2 describes the proposed RRL-SG solution. Section 3 presents details of the technical implementation. Section 4 details the simulations and experiments, and analyzes the resulting performance. Finally, Section 5 concludes the study.

2. Methodology

2.1. Overview

In this section, we provide an overview of the proposed technique. Fig. 1 illustrates a block diagram of our RRL-SG framework designed to realize trustworthy decision-making for autonomous vehicles. Δo and Δd represent the optimal adversarial perturbations on observed states and environmental dynamics, respectively. Ms, s, a, r, and π denote the safety mask, state, action, reward, and policy of the agent, respectively. πs represents a safe policy. t is the time step and T is the last time step. Δ, γ, β, and Qπ denote the environmental uncertainty, discount factor, weight, and action–value function in our optimization objectives, respectively.

The input of the adversary model is the state s of the agent, and its output contains adversarial perturbations Δo and Δd. Δo simulates the worst-case observational noise, which aims to maximize the average variation distance on perturbed policies. Moreover, Δd models the worst-case environmental dynamics uncertainty, which seeks to minimize the expected return of the agent.

The input to the RSS-based safety mask is state s of the agent. A safety mask can create a safe action space by shielding it against risky actions. Hence, the autonomous driving agent interacts with the environment through actions sampled from safety policy πs. The ARAC algorithm enables an agent to learn robust policies against perturbations in observations and dynamics.

Our autonomous driving agent was an intelligent vehicle colored gold, as shown in Fig. 1. Furthermore, the surrounding vehicles of other colors were controlled using an intelligent driving model (IDM) based on SUMO. The action space of our autonomous driving agent is discrete, encompassing five distinct decision-making behaviors: maintaining the current state, accelerating, decelerating, and changing lanes to either the left or the right.

2.2. Adversary model

The adversary model aims to generate optimal adversarial perturbations in the observed states and environmental dynamics.

To measure the variations in the policy caused by adversarial perturbations on observations, we leverage the Jensen–Shannon (JS) divergence, which can be considered a symmetrized and smoothed Kullback–Leibler (KL) divergence [44], [45]. One of its key characteristics is that the JS divergence binds the distance between the two probability distributions to within 1.0. Thus, the objective function related to the perturbations in the observations, Jo, can be defined as follows:

Jo(s,π,Δo)=DJS[π(a|s)πa|s]=DJS[π(a|s)πa|s+Δo]=12DKL[π(a|s)M]+12DKL[π(a|s+Δo)M]
M=12[π(a|s)+π(a|s+Δo)]

where DJS represents the distance based on the JS divergence, DKL denotes the distance based on the KL divergence, Δo represents the perturbation on observations, M is an expression regarding the agent policy and perturbed policy, and s and a are the state and action perturbed by Δo, respectively.

In this study, the adversarial perturbation of the dynamics attempted to minimize the expected return of the agent. We leverage an action–value function Qπ(s) to estimate the expected return based on a pair of the state s and the action a when the agent follows the policy π. As the action space of our agent is discrete, the input to the action–value function Qπ(s) does not include the action a. Hence, the objective function related to the perturbations on dynamics, Jd, can be designed as

Jd(s,Qπ,Δd)=ΔdQπ(s)

where Δd represents the perturbation of dynamics in the form of a probability distribution. Furthermore, the objective function of the adversary, JΔ, can be defined as

JΔ(s,π,Qπ,Δ)=(α-1)Jo(s,π,Δo)+αJd(s,Qπ,Δd)

where α0,1 denotes a weight, Δ=[Δo,Δd] represents the environmental uncertainty.

The optimization problem with regard to the adversary model can be formulated as

ΔargminΔE[JΔ(s,π,Qπ,Δ)],subjecttoΔoη1,Δdη2

where Δ represents the optimal environmental uncertainty, the notion of “argmin”, which stands for argument of the minimum, and η1 and η2 denote the bounds of the perturbations on observations and dynamics, respectively. Hence, the adversarial agent aims to maximize Jo and minimize Jd.

To simplify the aforementioned constrained optimization problem, we constrained the magnitude of the perturbations using the hyperbolic tangent and softmax functions. Specifically, the perturbations on observations and dynamics can be represented as Δo = ηtanh[x(s;θ¯)], Δd = softmax[x(s;θ¯)], respectively. In addition, η represents the scale factor, x denotes the output of the hidden layer of the adversary network, and θ¯ represents the adversary model parameter.

Consequently, to determine the optimal adversarial perturbation, Eq. 5 can be converted into

θ¯argminθ¯E[JΔ(s,π,Qπ;θ¯)]

where θ¯ represents the parameters of the optimal adversary model. Clearly, the optimal adversarial perturbations on observations and dynamics can be expressed as Δo = ηtanh[x(s;θ¯)], Δd = softmax[x(s;θ¯)], respectively.

2.3. RSS-based safety mask

In this section, a safety mask is developed using an interpretable RSS model to guarantee the collision safety of autonomous vehicles.

To consider driving comfort, we leveraged the jerk-bounded RSS model [42] proposed by Intel to design a safety mask. This model describes the following braking processes: a vehicle starts decreasing its acceleration with a maximum jerk jmax until it reaches a minimum deceleration amin,r, and then the vehicle continues to brake with the deceleration amin,r until reaching a full stop. The jerk-bounded RSS model, DminRSS, yields the following expression for the minimum safe distance between front and rear vehicles:

DminRSS=|vrT¯+12arT¯2-16jmaxT¯3+(vr+arT¯-12jmaxT¯2)22amin,r-vf22amax,f|

where ar is the initial acceleration of the rear vehicle; vf and vr denote the initial speeds of the front and rear vehicles; amax,f denotes the maximum deceleration of the front vehicle; T¯ represents the time from the beginning until the rear vehicle’s deceleration first equals amin,r or its speed decreases to zero.

We illustrate the proposed safety mask technique using the two cases shown in Fig. 2. As shown in Fig. 2(a), if the distance from the front vehicle in the same lane (denoted Df) is less than or equal to DminRSS, the mask will transform the probability corresponding to the acceleration decision-making (denoted a4) to zero (i.e., the safe action space including a1,a2,a3, and a5) is formed by shielding the risky action a4. Although only the minimum longitudinal safe distance model is provided in Ref. [42], we can still employ this model to evaluate the lane-change risk if we assume that the vehicle can move laterally to the target lane instantaneously. Such an assessment is risky because the distance between the two vehicles may be further shortened during lane changing. Here, we designed a simple minimum lateral safety distance model based on DminRSS as follows:

D¯minRSS=ξDminRSS

where ξ represents a scale coefficient greater than 1.0, and D¯minRSS denotes the minimum lateral safety distance model.

In Fig. 2(b), if the distance from the rear vehicle in the left lane (denoted Drl) is less than or equal to D¯minRSS, the mask transforms the probability corresponding to the left lane-changing decision (denoted a2) to zero.

In Fig. 2(c), when the distance from the rear vehicle in the left lane (denoted Drl), distance from the front vehicle in the right lane (denoted Dfr), and distance from the front vehicle in the same lane (denoted Df) are less than or equal to their corresponding minimum safety distances, the mask transforms the probability corresponding to the left lane-changing (denoted a2), right lane-changing (denoted a1), and accelerating (denoted a4) decisions to zero.

Algorithm 1 provides an overview of the design of our RSS-based safety-mask module, where Dr,Dfl,Dfr, and Drr represent the distances from the rear, front-left, front-right, and rear-right vehicles, respectively; Dmin,fRSS,Dmin,rRSS,D¯min,flRSS,D¯min,rlRSS,D¯min,frRSS, and D¯min,rrRSS denote the minimum safe distances from the front, rear, front-left, rear-left, front-right, and rear-right vehicles, respectively. Moreover, Ms[m] denotes the m-th element in safety mask Ms. The mask element associated with the hazardous action is assigned a negative infinity value.

Algorithm 1. RSS-based safety mask.
Input: State of the autonomous driving agent
Initialize a mask Ms = [0, 0, 0, 0, 0]
if DfDmin,fRSS then
Ms[4] = −∞ *Mask accelerating decision-making
end if
if DrDmin,rRSS then
Ms[5] = −∞ *Mask decelerating decision-making
end if
if DflD¯min,flRSS or DrlD¯min,rlRSS then
Ms[2] = −∞ *Mask left lane-changing decision-making
end if
if DfrD¯min,frRSS or DrrD¯min,rrRSS then
Ms[1] = −∞ *Mask right lane-changing decision-making
end if
Output: Ms

2.4. ARAC-critic

2.4.1. Safe robust Markov decision process (MDP)

A MDP provides a mathematical paradigm for RL problems, aiming to find optimal policies [46]. In this section, the existing standard MDP mathematical formalism is extended to explicitly model the behavior of an autonomous driving agent under adversarial perturbation and a safety mask. Here, we introduce a safe robust MDP (SR-MDP) defined as follows:

A SR-MDP can be defined via a seven-tuple [S, A, p, r,Δ, Ms, γ] with state space S, action space A, state transition probability p, reward function r, safety mask Ms, environmental uncertainty Δ, and discount factor γ ∈ (0, 1).

In our study, SR-MDP attempts to solve the following problem:

maxπminΔEt=0Tγtr(st,at)+βJΔ(st,π(st),Qπ(st),Δ)

where T is the last time step, and β>0 is a trade-off coefficient.

We employ a novel policy iteration (PI) algorithm—known as the safe robust PI (SR-PI)—to solve the SR-MDP. The SR-PI method comprises two critical phases: safe–robust policy evaluation and robust policy improvement. Furthermore, both phases were updated iteratively until convergence was achieved.

2.4.2. Safe robust policy evaluation

In the safe robust policy evaluation stage, we aim to estimate the expected return of the policy π under environmental uncertainty Δ. For a fixed policy, the action–value function Qπ(·) can be approximated iteratively by employing the following Bellman backup operator Tπ,Δ:

Tπ,ΔQπ(st)=r(st,at)+γE[Vπ,Δ(st+1)]

where

Vπ,Δst+1=π(st+1)Qπ(st+1)+βJΔ(st+1,πst+1,Qπst+1,Δ)

denotes the value function of the agent based on π under the adversarial perturbations.

Here, we can rewrite Eq. 10 as:

Tπ,ΔQπ(st)=ra(st,at)+γπ(st+1)Qπ(st+1)

where ra(st,at)=r(st,at)+γβJΔ(·) is the augmented reward. Hence, the convergence of our policy evaluation can be guaranteed by drawing upon findings related to policy evaluation convergence in standard RL algorithms.

To enhance the efficiency of model training, we employ two parameterized action–value functions with parameters ϕp, p1,2. The parameters of the two action–value functions can be optimized by minimizing the following objective function concerning the critic network:

JQ(ϕp)=ETsB[(ytΔ-Qπ(st;ϕp))2]

where Ts represents state transitions sampled from the replay buffer B, and ytΔ denotes the target value of the action–value function with the uncertainty at the time step t, JQ is the function for optimizing the critic network. A smaller value is used for both action–value functions to mitigate the overestimation of the value function during the training of the critic network. As a result, ytΔ can be defined as:

ytΔ=r(st,at)+γπ(st+1)Q^minπ(st+1;ϕ¯p)+βJΔ(st+1,π(st+1),Q^minπ(st+1;ϕ¯p),Δ)

where Q̂π(s;ϕ¯p) is the target action–value function with the parameter ϕ¯p, Q̂minπ(s;ϕ¯p) represents the smaller value of both the target action–value functions, for example, Q̂minπ(s;ϕ¯p)=minp1,2Q̂π(s;ϕ¯p).

Here, the gradient of Eq. 13 can be derived as:

ϕpJQ(ϕp)=ϕpETsB[((ytΔ-Qπ(st;ϕp))2]=-2ETsB[(ytΔ-Qπ(st;ϕp))ϕpQπ(st;ϕp)]

Furthermore, we can update ϕ¯p via Polyak averaging:

ϕ¯pμϕ¯p+(1-μ)ϕp

where μ0,1 denotes a scale coefficient.

2.4.3. Safe robust policy improvement

In the safe robust policy improvement stage, we attempt to optimize the policy given the action–value function Qπ(·) under the adversarial perturbations. Since the action–value function Qπ(s) is employed to estimate the expected return based on a pair of the state s and the action a when the agent follows the policy π, the optimization problem Eq. 9 can be rewritten as:

maxπminΔEJ(π,Δ)

where J(·) represents the objective function of the proposed SR-MDP, and J(π,Δ)=π(s)Qπ(s)+βJΔ(s,π,Qπ,Δ).

Consequently, the optimal policy π and optimal adversarial perturbation Δ for the observed states and environmental dynamics can be approximated using the following alternating procedure: Firstly, fix a policy π, then solve the optimal adversarial perturbation Δ through minimizing J(π,Δ). Secondly, with Δ, learn the optimal policy π through maximizing J(π,Δ). According to Eq. 17, the following relational expression is derived:

Δ=argminΔE[J(π,Δ)]
π=argmaxπE[J(π,Δ)]

We observe that Eq. 17 represents a zero-sum game. In addition, the theoretical results [47], [48], [49] were established to guarantee the convergence of solutions for a zero-sum game, which can also ensure the convergence of our policy improvement.

To decrease the learning error of the policy π, we utilize the double Qπ(·) trick in Ref. [50]. Consequently, the policy model parameter θ can be learned by maximizing the following objective function concerning the actor network:

Jπ(θ)=ETsB[π(st;θ)Qminπ(st;ϕp)+βJΔ(st,π(st;θ),Qminπ(s;ϕp),Δ)]

where Qminπ(s;ϕp) represents the smaller value of both the action–value functions, for example, Qminπ(s;ϕp)=minp1,2Qπ(st;ϕp), Jπ is the function for optimizing the actor network.

We are able to derive the gradient of Eq. 20, as follows:

θJπ(θ)=θETsB[π(st;θ)Qminπ(st;ϕp)+βJΔ(st,π(st;θ),Qminπ(s;ϕp),Δ)]=ETsB[θπ(st;θ)Qminπ(st;ϕp)+(α-1)βθJo(st,π(st;θ))]=ETsB[θπ(st;θ)Qminπ(st;ϕp)+12(α-1)β(θDKL(π(a|s;θ)M(s;θ))+θDKL(π(a|s+Δo;θ)M(s;θ))]

In addition, according to Eqs. 4, 5, the adversary’s model can be optimized by minimizing the following objective function:

Jπ¯(θ¯)=ETsB[JΔ(st,π(st;θ),Qminπ(st;ϕp);θ¯)]=ETsB[(α-1)Jo(st,π(st);θ¯)+αJd(st,Qminπ(st;ϕp);θ¯)]

where θ¯ represents the adversary model parameter, Jπ¯ is the function for optimizing the adversary network.

Here, the gradient of Eq. 22 can be derived as:

θ¯Jπ¯(θ¯)=θ¯ETsB[JΔ(st,π(st;θ),Qminπ(st;ϕp);θ¯)]=θ¯ETsB[(α-1)Jo(st,π(st);θ¯)+αJd(st,Qminπ(st;ϕp);θ¯)]=θ¯ETsB[12(α-1)(DKL(π(a|s)M(s;θ¯))+DKL(π(a|s+Δo(s;θ¯))M(s;θ¯))+αΔd(s;θ¯)Qπ(s)]=ETsB[12(α-1)(θ¯DKL(π(a|s)M(s;θ¯))+θ¯DKL(π(a|s+Δo(s;θ¯))M(s;θ¯))+αθ¯Δd(s;θ¯)Qπ(s)]

3. Technical implementation

3.1. Algorithm

Here, we provide a detailed introduction to the implementation specifications of the proposed technique. Algorithm 2 outlines the RRL-SG approach for trustworthy autonomous driving decision-making. The initial model parameters for the actor, adversary, and critic were set using a random distribution. In terms of interaction with the environment, our agent interacts with the environment based on actions sampled from the safety policy πs. In terms of policy learning, an agent policy can be optimized by combining Eqs. 13, (16), (20), and (22). dt represents a completed signal, implying that the ego vehicle encounters a collision at time step t. The details of the neural networks and hyperparameters are provided in Table S1 in Appendix A.

Algorithm 2. Robust RL with safety guarantees.
Initialize actor model parameters θ, adversary model parameter θ¯, critic model parameters φ1 and φ2, target action–value function parameters φ¯1φ1 and φ¯2φ2, and an empty replay buffer B
for episode step e = 1, 2,…, E do
Reset state s0
for time step in the environment t = 1, 2,…, T do
Determine a safe policy πs(st; θ) via Algorithm 1:
πs(st; θ) = softmax(π(st; θ) + Ms)
Select an action via the safe policy πs(st; θ):
atπs(st; θ)
Execute at in the environment and receive a transition:
st+1, rt, dtp(st+1|st, at)
Store the transition in the replay buffer B:
BB ∪ {(st, at, rt, st+1, dt)}
end if
for gradient step g = 1, 2,…, G do
Sample a batch of transitions from the replay buffer B
Update the actor model parameters via Eq. 21:
θθJπ(θ)
Update the critic model parameters via Eq. 15:
φ1φ1 JQ(φ1), φ2φ2 JQ(φ2)
Update the target action–value function parameters via Eq. 16:
φ¯1μφ¯1+1-μφ1,φ¯2μφ¯2+1-μφ2
if g mod δ then
Update the adversary model parameters via Eq. 23:
θ¯θ¯Jπ¯θ¯
end if
end for
end for

3.2. State space and action space

Designing the state, action, and reward functions of the autonomous driving agent was essential to implement the proposed scheme. In this study, we consider the relevant states of the six nearest social vehicles in the ego vehicle lane and adjacent lanes as observations for the autonomous driving agent (i.e., the ego vehicle). The state space of the autonomous driving agent has 15 dimensions, including the relative distance and velocity of the surrounding social vehicles and the velocity, acceleration, and lane index of the ego vehicle. The lane index is the index of the lane where the ego vehicle is located.

The action space of our autonomous driving agent is discrete and contains five decision-making behaviors: changing lanes to the right, changing lanes to the left, maintaining the current state, accelerating, and decelerating. According to the research results in Ref. [51], typically, the acceleration of the vehicle operated by a normal driver does not exceed 1.47 m∙s−2, and the deceleration is not less than −2 m∙s−2. Consequently, when our autonomous driving agent executes acceleration decision-making, the ego vehicle will accelerate at a fixed acceleration of 1.47 m∙s−2. Moreover, if the agent performs the decelerating decision-making, the ego vehicle decelerates at a fixed deceleration of −2.00 m∙s−2.

3.3. Reward function

The reward function plays a pivotal role in the performance of the RL agents. Our reward function was designed by considering factors related to travel efficiency, driving safety, and passenger comfort. Specifically, we encouraged autonomous driving agents to operate at high speeds. In addition, we penalized the agent if its driving policy caused a collision. An autonomous driving agent is subject to penalties if it performs high-speed lane-change maneuvers. Eq. 24 is the designed reward function r(·), where e denotes the natural logarithm, and v0 is the ego vehicle speed. Moreover, A={vehiclechangeslane}, B={v0>30}, and C={collision} are the event sets. Here, collision refers to the collision between an ego vehicle and the surrounding social vehicles.

r(·)=ev0/35-1-v0/350AB¬C=1ev0/35-1-0.5-v0/100¬(AB)C=1ev0/35-1-v0/350-0.5-v0/100ABC=1ev0/35-1otherwise

4. Simulations and experiments

4.1. Baseline

We set up comparisons with state-of-the-art RL agents in both simulations and experiments to benchmark the RRL-SG approach for trustworthy autonomous driving decision-making.

As the dueling DDQN (D3QN) is a state-of-the-art Q-learning algorithm [52], [53], D3QN was adopted as one of the baselines in this study. Moreover, we leverage the proximal policy optimization (PPO) [54], soft actor-critic (SAC) [50], [55], and observation adversarial RL (OARL) [56] algorithms as competitive baselines, representing state-of-the-art on-policy, off-policy, and robust RL technologies, respectively.

4.2. Metric

We employed the expected return to assess the comprehensive performance of the autonomous driving agents. The average running speed and number of collisions were utilized to evaluate the travel efficiency and traffic safety of autonomous vehicles. In addition, Eq. 1 is used to measure policy robustness against adversarial perturbations, implying that the smaller the policy change attacked by the adversary, the stronger the robustness of the policy.

In the on-ramp merging scenario, in addition to the above metrics, we assessed the vehicle performance using the merging success rate. In this study, a successful on-ramp merging was defined as a vehicle entering the main lane completely from the ramp without experiencing any collisions within a test episode.

4.3. Simulations with SUMO

To assess the performance of the proposed decision-making technique for autonomous vehicles, we implemented model training and testing using the SUMO simulator. We leveraged SUMO to create stochastic dynamic traffic flows with different densities in highway and on-ramp merging scenarios. In addition, we trained five different runs of each approach with different random seeds and 400 episodes in a highway scenario with a normal-density traffic flow (P = 0.12). P denotes the probability of starting the vehicle in seconds. The maximum time step for each episode is 200 s. The maximum traffic speed for all the lanes was set to 35.0 m∙s−1.

Unlike the highway scenario, the on-ramp merging scenario was utilized only for model testing.

4.3.1. Highway scenario

Fig. 3 illustrates our evaluation scheme for the highway scenario. An ego vehicle is the golden RL-driven autonomous vehicle. P is set as 0.06, 0.12, and 0.24 to produce the traffic flows with low, normal, and high densities, respectively. Autonomous driving agents were trained only in traffic flows with normal density. In the model-testing phase, traffic flows with low, normal, and high densities were leveraged for assessment. Each trained agent (including different random seeds) was evaluated for over 100 episodes. Each evaluation calculated the average metrics for ten episodes in testing. As we employ stochastic dynamic traffic flows, the environmental dynamics are continuously changing. To further verify policy robustness, each autonomous driving agent was attacked by optimal adversarial observational perturbations from the trained adversary during model testing. In other words, unlike the model training phase, in the test case with adversarial attacks, the autonomous driving agent receives states s perturbed by the adversary model.

Fig. 4 shows the learning curves of the proposed RRL-SG method and the baselines for normal-density stochastic dynamic traffic flows. Overall, the results indicate that the proposed scheme outperforms the baselines in terms of return and safety. Clearly, our autonomous driving agent drastically reduces the number of collisions and enhances the learning efficiency during model training compared with the baselines because the proposed RSS-based safety mask forms a safe action subspace by shielding it against risky actions. Thus, sampling actions from the safe action subspace ensures decision safety and avoids redundant exploration.

During model testing, the final policy models based on five random seeds were evaluated for each method. Qualitatively, we report the average metrics in Table 1 for the model evaluation results. Bold numbers indicate the best values for each metric. In general, the results indicate that the RRL-SG agent surpasses the baseline by a large margin for all tasks in terms of robustness and safety. In contrast to the baselines, the JS divergence is approximately zero for changes in the RRL-SG policy attacked by the adversary model in the three stochastic dynamic traffic flows with different densities, implying that the RRL-SG policies were hardly affected by adversarial attacks. Moreover, unlike the D3QN, PPO, SAC, and OARL autonomous driving agents, the RRL-SG agent did not cause collisions in any of the test cases.

More specifically, in low-density traffic flows with and without adversarial attacks, the OARL autonomous driving agent performs comparably to the OARL agent and outperforms the D3QN, PPO, and SAC agents by a large margin in terms of return. In normal-density traffic flows without adversarial attacks, compared to the D3QN, PPO, SAC, and OARL agents, the RRL-SG agent gains approximately 22.31%, 7.22%, 10.34%, and 1.97% improvements with respect to the return, respectively. In high-density traffic flows without adversarial attacks, compared to the D3QN, PPO, SAC, and OARL agents, the returns of the RRL-SG agent improved by approximately 78.63%, 47.41%, 25.45%, and 13.84%, respectively. Additionally, in high-density traffic flows with adversarial attacks, compared with the D3QN, PPO, SAC, and OARL agents, the return of the RRL-SG agent was enhanced by approximately 7669.57%, 2666.25%, 511.57%, and 8.99%, respectively.

Fig. 5 illustrates the performance of the D3QN, PPO, SAC, OARL, and RRL-SG autonomous driving agents in stochastic dynamic traffic flows with different densities and attack situations. As shown in Fig. 5, adversarial attacks based on trained adversary models distinctly impact the comprehensive performance, travel efficiency, and safety of autonomous vehicles driven by baseline agents. For instance, in normal-density traffic flows, compared to the case without adversarial attacks, the number of collisions of the attacked D3QN, PPO, SAC, and OARL autonomous driving agents increased by approximately 358.82%, 583.33%, 1378.57%, and 5.71%, respectively. In contrast, the proposed RRL-SG autonomous driving agent performed consistently across all test cases, with zero collision accidents recorded.

Here, we empirically assessed policy robustness against perturbations in environmental dynamics by calculating the mean square deviation of returns for each method across all testing scenarios, including various traffic densities and attack scenarios. According to Table 1, the mean square deviations of the returns for the D3QN, PPO, SAC, OARL, and RRL-SG agents across all testing cases are 31.23, 16.11, 39.60, 12.73, and 7.50, respectively, indicating that the RRL-SG agent was the least affected by environmental changes compared to the baselines. In other words, the RRL-SG policy is robust and safe and exhibits stability, thus highlighting the primary contribution of this study toward achieving trustworthy decision-making for autonomous vehicles.

4.3.2. On-ramp merging scenario

To further evaluate the trustworthiness of the decisions of the autonomous driving agents, on-ramp merging was added as an additional testing scenario. Inappropriate merging behaviors can lead to typical outcomes, including congestion, collisions, and increased travel time.

The proposed evaluation scheme—based on an on-ramp merging scenario—is illustrated in Fig. 6(a). We directly deployed the model trained in the highway scenario to the on-ramp merging scenario for testing purposes. All autonomous driving agents were assessed in stochastic dynamic traffic flows with high density (i.e., P=0.24) under different attack situations across a total of 100 episodes. Similar to the highway scenario, each model evaluation computed the average metrics over ten testing episodes with a maximum of 200 time steps in each episode.

As seen in Fig. 6(b) and (c), the RRL-SG autonomous driving agent outperforms the baselines by a significant margin, with or without adversarial attacks, in terms of both travel efficiency and merging success rate.

Table 2 presents the average metrics for the results of the model evaluation in the on-ramp merging scenario. Bold numbers represent the best in each column. For instance, without adversarial attacks, compared to the D3QN, PPO, SAC, and OARL agents, the RRL-SG agent gains approximately 16.97%, 21.91%, 29.63%, and 21.18% improvements with respect to return, respectively. Without adversarial attacks, compared with D3QN, PPO, SAC, and OARL, the speed of the RRL-SG agent increased by approximately 31.84%, 42.60%, 62.62%, and 40.69%, respectively. As shown in Table 2, the robustness of the RRL-SG policy was significantly better than that of the baseline policies.

As shown in Fig. 6(c) and Table 2, our RRL-SG agent can complete the on-ramp merging task with a probability of 100.00%, regardless of the presence or absence of adversarial attacks. In other words, adversarial attacks on observations have almost no impact on the RRL-SG policy. In addition, with adversarial attacks, compared to the D3QN, PPO, SAC, and OARL agents, the RRL-SG agent gains approximately 13.00%, 9.00%, 6.00%, and 1.00% improvements in the merging success rate, respectively.

The environmental dynamics associated with the on-ramp merging scenario were notably distinct from those of the highway scenario. Because we utilize stochastic dynamic traffic flows, the environmental dynamics are subject to continuous changes. Here, we empirically evaluate policy robustness against perturbations in environmental dynamics using the mean square deviation of returns for each agent under different attack situations. According to Table 2, the mean square deviations of the returns for the D3QN, PPO, SAC, OARL, and RRL-SG agents under the different attack conditions were 19.81, 7.15, 20.51, 6.59, and 4.04, respectively, implying that the RRL-SG agent exhibited superior policy robustness against environmental changes, thus making it the least susceptible to environmental changes compared to the baselines. These results highlight our pivotal contribution toward trustworthy decision-making for autonomous vehicles.

4.4. Experiments with a real autonomous vehicle

We conducted physical platform experiments using a real low-speed autonomous vehicle, Hunter (AgileX Robotics, China), to further verify the trustworthiness of the proposed approach. As shown in Fig. 7(a), Hunter is equipped with a 16-line light detection and ranging (LiDAR), two stereo cameras, eight ultrasonic sensors, and one “Jetson Xavier NX 16 GB” edge computing system (NVIDIA, USA). Hence, the RL policy model can generate decision commands in real-time based on the perceived states from the onboard sensors, with all computations performed on the NVIDIA Jetson platform. All models trained in the SUMO simulator were directly deployed in Hunter and tested in a laboratory environment with a free space measuring 8 m × 8 m. Only the trained policy models were tested here and were not trained further (i.e., the model parameters were fixed). The policy model required approximately 0.002 s to perform a single inference. The sampling frequency of Hunter was 30 Hz. As the evaluated policy model executes a decision once it receives a set of sampled states, Hunter’s decision-making frequency is 30 Hz.

Fig. 7(b) and (c) illustrate the experimental schemes. Similar to model testing in the simulator, we instantiated five final policy models trained by each algorithm using five different random seeds and evaluated each model under varying conditions, with and without adversarial attacks. In the experimental case shown in Fig. 7(b), Hunter’s perception information consisted only of the original environmental observations without any adversarial observational perturbations. In contrast, as shown in Fig. 7(c), Hunter senses the driving environment information containing both the original environmental observations and the adversarial perturbations generated by the trained adversarial models. Additionally, the environmental dynamics change significantly from the simulation environment to the real-world physical platform.

The experimental space was free of static or dynamic obstacles, implying that Hunter should be able to maintain a straight run without attacks from an adversary model. We assessed each policy model during the period (150 time steps) in which Hunter drove from one side to the other. In the test case with adversarial attacks, the attacks started at the 75th time step. Hunter can execute five decision-making behaviors: turning right, turning left, maintaining the current state, accelerating, and decelerating.

Fig. 8 shows the global motion trajectories of autonomous vehicles driven by different agents in different attack situations, wherein all policy models enable Hunter to continue running straight without adversarial attacks. However, in the test case with adversarial attacks, the performance of the baseline models was affected to varying degrees. Specifically, all D3QN autonomous driving agents, four-fifths of the PPO agents, all SAC agents, and one-fifth of the OARL agents make turning decisions under adversarial attacks. In contrast, the five proposed RRL-SG policy models perform consistently in all cases, for example, the RRL-SG-driven Hunter can maintain a straight run even when it suffers from attacks by the adversary model. For more visual results, please refer to Video S1 in Appendix A.

To illustrate the impact of adversarial attacks on the policy model, Fig. 9 shows the probability distribution of actions based on the D3QN and RRL-SG policies before and after encountering adversarial perturbations. We leverage the softmax function to convert the output of the D3QN policy model, which consists of Q values for each action, into a probability distribution over the actions. The action distribution based on the RRL-SG policy shows hardly any change compared with the distribution based on the D3QN policy. Specifically, in the absence of adversarial attacks, the probabilities of the five decision actions based on the D3QN policy were approximately 12.77%,19.55%,20.61%,34.45%, and 12.63%, respectively. Under adversarial attacks, the probabilities of the five decision actions based on the D3QN policy are approximately 38.14%,15.30%,17.95%,15.40%, and 13.21%, respectively, thus explaining why adversarial attacks can cause the D3QN-driven Hunter to continue running suddenly in a straight-line turn. In addition, without adversarial perturbations, the probabilities of the five decision actions regarding the RRL-SG policy are approximately 3.97×10-13%,1.74×10-13%,2.48×10-12%,100.00%, and 5.57×10-13%, respectively. With adversarial perturbations, the probabilities of the five decision actions concerning the RRL-SG policy were approximately 6.83×10-8%,3.47×10-8%,2.09×10-7%,100.00%, and 6.45×10-8%, respectively. Therefore, Hunter, based on the RRL-SG policy, can maintain its running status without being affected by adversarial perturbations.

5. Conclusions

In this study, we introduce the RRL-SG technique, which empowers autonomous vehicles to make trustworthy decisions. The proposed paradigm attempts to ensure trustworthiness in terms of policy robustness and collision safety. Specifically, the adversary model is trained online to simulate the worst-case uncertainty by generating optimal adversarial perturbations on observed states and environmental dynamics. Meanwhile, the ARAC approach is advanced to facilitate the agent in learning robust policies against multiple uncertainties from the adversary. In addition, we devise a safety mask to ensure the collision safety of the autonomous driving agent during both the training and testing processes using the interpretable knowledge model RSS.

The evaluation results of the simulations with stochastic dynamic traffic flow and the experiment with a real autonomous vehicle indicate that the proposed RRL-SG scheme enables the autonomous driving agent to learn trustworthy policies against adversarial environmental uncertainties. In addition, compared with the four baselines, the RRL-SG driving policies ensure superior robustness and safety. Notably, our autonomous agent consistently delivers a more stable performance than the baselines in both simulations and experiments.

Although we demonstrated the potential of the proposed approach, one limitation remains. While the RRL-SG solution leverages the worst-case setting and interpretable knowledge model, the provision of theoretical guarantees for robustness and safety of autonomous driving models remains a critical subject for future research. Consequently, in the future, we will investigate certifiable and interpretable decision-making techniques to further enhance the trustworthiness of autonomous driving systems.

Acknowledgment

This work was supported in part by the Start-Up Grant-Nanyang Assistant Professorship Grant of Nanyang Technological University, the Agency for Science, Technology and Research (A*STAR) under Advanced Manufacturing and Engineering (AME) Young Individual Research under Grant (A2084c0156), the MTC Individual Research Grant (M22K2c0079), the ANR-NRF Joint Grant (NRF2021-NRF-ANR003 HM Science), and the Ministry of Education (MOE) under the Tier 2 Grant (MOE-T2EP50222-0002).

Compliance with ethics guidelines

Xiangkun He, Wenhui Huang, and Chen Lv declare that they have no conflict of interest or financial conflicts to disclose.

References

[1]

B. Yang, X. Cao, K. Xiong, C. Yuen, Y.L. Guan, S. Leng, et al. Edge intelligence for autonomous driving in 6G wireless system: design challenges and solutions. IEEE Wireless Commun, 28 (2) (2021), pp. 40-47

[2]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Von Luxburg U, Guyon I, Bengio S, Wallach H, Fergus R, editors. NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems; 2017 Dec 4-9; Long Beach, CA, USA. New York City: Curran Associates Inc.; 2017. p. 6000-10.

[3]

J. Wang, H. Huang, K. Li, J. Li. Towards the unified principles for level 5 autonomous vehicles. Engineering, 7 (9) (2021), pp. 1313-1325

[4]

M.B. Mollah, J. Zhao, D. Niyato, Y.L. Guan, C. Yuen, S. Sun, et al. Blockchain for the internet of vehicles towards intelligent transportation systems: a survey. IEEE Internet Things J, 8 (6) (2021), pp. 4157-4185

[5]

Li J, Shao W, Wang H. Key challenges and Chinese solutions for SOTIF in intelligent connected vehicles. Engineering 2023 ;31(12):27-30.

[6]

S. Feng, X. Yan, H. Sun, Y. Feng, H.X. Liu. Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment. Nat Commun, 12 (1) (2021), p. e748

[7]

J. Liu, Y. Luo, Z. Zhong, K. Li, H. Huang, H. Xiong. A probabilistic architecture of long-term vehicle trajectory prediction for autonomous driving. Engineering, 19(12) (2022), pp. 228-239

[8]

X. He, J. Wu, Z. Huang, Z. Hu, J. Wang, A. Sangiovanni-Vincentelli, et al. Fear-neuro-inspired reinforcement learning for safe autonomous driving. IEEE Trans Pattern Anal Mach Intell (2023 Oct:), pp. 1-13

[9]

Yuan K, Huang Y, Yang S, Zhou Z, Wang Y, Cao D, et al. Evolutionary decisionmaking and planning for autonomous driving based on safe and rational exploration and exploitation. Engineering. In press.

[10]

W. Huang, Y. Zhou, X. He, C. Lv. Goal-guided transformer-enabled reinforcement learning for efficient autonomous navigation. IEEE Trans Intell Transp Syst (2023 Sep), pp.1-14

[11]

Y. Zhang, C. Li, T.H. Luan, C. Yuen, Y. Fu. Collaborative driving: learning-aided joint topology formulation and beamforming. IEEE Veh Technol Mag, 17 (2) (2022), pp. 103-111

[12]

J. Wu, Z. Huang, Z. Hu, C. Lv. Toward human-in-the-loop AI: enhancing deep reinforcement learning via real-time human guidance for autonomous driving. Engineering, 21(2) (2023), pp. 75-91

[13]

H. Wang, A. Khajepour, D. Cao, T. Liu. Ethical decision making in autonomous vehicles: challenges and research progress. IEEE Intell Transp Syst Mag, 14 (1) (2022), pp. 6-17

[14]

X. He, C. Lv. Toward personalized decision making for autonomous vehicles: a constrained multi-objective reinforcement learning technique. Transp Res Part C Emerging Technol, 156 (2023), Article 104352

[15]

X. Tang, K. Yang, H. Wang, J. Wu, Y. Qin, W. Yu, et al. Prediction-uncertainty-aware decision-making for autonomous vehicles. IEEE Trans Intell Veh, 7 (4) (2022), pp. 849-862

[16]

J. Liu, H. Wang, L. Peng, Z. Cao, D. Yang, J. Li. PNNUAD: perception neural networks uncertainty aware decision-making for autonomous vehicle. IEEE Trans Intell Transp Syst, 23 (12) (2022), pp. 24355-24368

[17]

G. Li, Y. Qiu, Y. Yang, Z. Li, S. Li, W. Chu, et al. Lane change strategies for autonomous vehicles: a deep reinforcement learning approach based on transformer. IEEE Trans Intell Veh, 8 (3) (2023), pp. 2197-2211

[18]

C. Urmson, J. Anhalt, D. Bagnell, C. Baker, R. Bittner, M.N. Clark, et al. Autonomous driving in urban environments: boss and the urban challenge. J Field Rob, 25 (8) (2008), pp. 425-466

[19]

M. Montemerlo, J. Becker, S. Bhat, H. Dahlkamp, D. Dolgov, S. Ettinger, et al. Junior: the Stanford entry in the urban challenge. J Field Rob, 25 (9) (2008), pp. 569-597

[20]

V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, et al. Human-level control through deep reinforcement learning. Nature, 518 (7540) (2015), pp. 529-533

[21]

O. Vinyals, I. Babuschkin, W.M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nature, 575 (7782) (2019), pp. 350-354

[22]

X. He, H. Chen, C. Lv. Robust multiagent reinforcement learning toward coordinated decision-making of automated vehicles. SAE Int J Veh Dyn Stab NVH, 7 (4) (2023), p. 2023

[23]

N.Q. Hieu, D.T. Hoang, D. Niyato, P. Wang, D.I. Kim, C. Yuen. Transferable deep reinforcement learning framework for autonomous vehicles with joint radar-data communications. IEEE Trans Commun, 70 (8) (2022), pp. 5164-5180

[24]

J. Duan, S.E. Li, Y. Guan, Q. Sun, B. Cheng. Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data. IET Intell Transp Syst, 14 (5) (2020), pp. 297-305

[25]

B.R. Kiran, I. Sobh, V. Talpaert, P. Mannion, A.A. Al Sallab, S. Yogamani, et al. Deep reinforcement learning for autonomous driving: a survey. IEEE Trans Intell Transp Syst, 23 (6) (2022), pp. 4909-4926

[26]

Ye F, Wang P, Chan CY, Zhang J. Meta reinforcement learning-based lane change strategy for autonomous vehicles. In:Proceedings of 2021 IEEE Intelligent Vehicles Symposium (IV); 2021 Jul 11- 17 ; Nagoya, Japan. Piscataway: IEEE; 2021. p. 223-30.

[27]

G. Wang, J. Hu, Z. Li, L. Li. Harmonious lane changing via deep reinforcement learning. IEEE Trans Intell Transp Syst, 23 (5) (2022), pp. 4642-4650

[28]

G. Li, Y. Yang, S. Li, X. Qu, N. Lyu, S.E. Li. Decision making of autonomous vehicles in lane change scenarios: deep reinforcement learning approaches with risk awareness. Transp Res Part C, 134 (2022), p. e103452

[29]

Mirchevska B, Pek C, Werling M, Althoff M, Boedecker J. High-level decision making for safe and reasonable autonomous lane changing using reinforcement learning. In:Proceedings of 2018 21st International Conference on Intelligent Transportation Systems; 2018 Nov 4-7; Maui, HI, USA. Piscataway: IEEE; 2018. p. 2156-62.

[30]

Lubars J, Gupta H, Chinchali S, Li L, Raja A, Srikant R, et al. Combining reinforcement learning with model predictive control for on-ramp merging. In: Proceedings of 2021 IEEE International Intelligent Transportation Systems Conference; 2021 Sep 19-22; Indianapolis, IN, USA. Piscataway: IEEE; 2021. p. 942-7.

[31]

H. Wang, H. Gao, S. Yuan, H. Zhao, K. Wang, X. Wang, et al. Interpretable decision-making for autonomous vehicles at highway on-ramps with latent space reinforcement learning. IEEE Trans Veh Technol, 70 (9) (2021), pp. 8707-8719

[32]

Bouton M, Nakhaei A, Fujimura K, Kochenderfer MJ. Cooperation-aware reinforcement learning for merging in dense traffic. In: Proceedings of 2019 IEEE Intelligent Transportation Systems Conference; 2019 Oct 27-30; Auckland, New Zealand. Piscataway: IEEE; 2019. p. 3441-7.

[33]

Qiao Z, Tyree Z, Mudalige P, Schneider J, Dolan JM. Hierarchical reinforcement learning method for autonomous vehicle behavior planning. In:Proceedings of 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems; 2020 Oct 24-2021 Jan 24; Las Vegas, NV, USA. Piscataway: IEEE; 2021. p. 6084-9.

[34]

X. He, B. Lou, H. Yang, C. Lv. Robust decision making for autonomous vehicles at highway on-ramps: a constrained adversarial reinforcement learning approach. IEEE Trans Intell Transp Syst, 24 (4) (2023), pp. 4103-4113

[35]

C.J. Hoel, K. Driggs-Campbell, K. Wolff, L. Laine, M.J. Kochenderfer. Combining planning and deep reinforcement learning in tactical decision making for autonomous driving. IEEE Trans Intell Veh, 5 (2) (2020), pp. 294-305

[36]

Y. Zhang, B. Gao, L. Guo, H. Guo, H. Chen. Adaptive decision-making for automated vehicles under roundabout scenarios using optimization embedded reinforcement learning. IEEE Trans Neural Networks Learn Syst, 32 (12) (2021), pp. 5526-5538

[37]

X. He, C. Lv. Toward intelligent connected e-mobility: energy-aware cooperative driving with deep multiagent reinforcement learning. IEEE Veh Technol Mag, 18 (3) (2023), pp. 101-109

[38]

Nageshrao S, Tseng HE, Filev D. Autonomous highway driving using deep reinforcement learning. In:Proceedings of 2019 IEEE International Conference on Systems, Man and Cybernetics; 2019 Oct 6- 9 ; Bari, Italy. Piscataway: IEEE; 2019. p. 2326-31.

[39]

B. Gangopadhyay, H. Soora, P. Dasgupta. Hierarchical program-triggered reinforcement learning agents for automated driving. IEEE Trans Intell Transp Syst, 23 (8) (2022), pp. 10902-10911

[40]

Z. Cao, S. Xu, X. Jiao, H. Peng, D. Yang. Trustworthy safety improvement for autonomous driving using reinforcement learning. Transp Res Part C, 138 (2022), Article 103656

[41]

Shalev-Shwartz S, Shammah S, Shashua A. On a formal model of safe and scalable self-driving cars. 2017. arXiv:1708.06374.

[42]

Shalev-Shwartz S, Shammah S, Shashua A. Vision zero: can roadway accidents be eliminated without compromising traffic throughput? 2018. arXiv:1901.05022.

[43]

Lopez PA, Behrisch M, Bieker-Walz L, Erdmann J, Flötteröd YP, Hilbrich R, et al. Microscopic traffic simulation using SUMO. In:Proceedings of 2018 21st International Conference on Intelligent Transportation Systems; 2018 Nov 4-7 ; Maui, HI, USA. Piscataway: IEEE; 2018. p. 2575-82.

[44]

J. Lin. Divergence measures based on the Shannon entropy. IEEE Trans Inf Theory, 37 (1) (1991), pp. 145-151

[45]

Huszár F. How (not) to train your generative model: scheduled sampling, likelihood, adversary? 2015.arXiv:1511.05101.

[46]

W. Huang, C. Zhang, J. Wu, X. He, J. Zhang, C. Lv. Sampling efficient deep reinforcement learning through preference-guided stochastic exploration. IEEE Trans Neural Networks Learn Syst (2023 Oct), pp.1-12

[47]

A.J. Hoffman, R.M. Karp. On nonterminating stochastic games. Manage Sci, 12 (5) (1966), pp. 359-370

[48]

T.D. Hansen, P.B. Miltersen, U. Zwick. Strategy iteration is strongly polynomial for 2-player turn-based stochastic games with a constant discount factor. J ACM, 60 (1) (2013), pp. 1-16

[49]

V. Mazalov. Mathematical game theory and applications. John Wiley & Sons Ltd, Chichester (2014)

[50]

Haarnoja T, Zhou A, Abbeel P, Levine S. Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In: Proceedings of the 35th International Conference on Machine Learning; 2018. p. 1861-70.

[51]

Bae I, Moon J, Jhung J, Suk H, Kim T, Park H, et al. Self-driving like a human driver instead of a robocar: personalized comfortable driving experience for autonomous vehicles. 2020. arXiv:2001.03908.

[52]

Wang Z, Schaul T, Hessel M, van Hasselt H, Lanctot M, Dueling network architectures for deep reinforcement learning. In: Balcan MF, Weinberger KQ, editors. ICML'16: Proceedings of the 33rd International Conference on International Conference on Machine Learning—Volume 48; 2016 Jun 19-24; New York City, NY, USA. JMLR.org; 2016. p. 1995-2003.

[53]

Hessel M, Modayil J, van Hasselt H, Schaul T, Ostrovski G, Dabney W, et al. Rainbow:combining improvements in deep reinforcement learning. In: McIlraith SA, Weinberger KQ, editors. AAAI'18/IAAI'18/EAAI'18:Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence; 2018 Feb 2-7; New Orleans, LA, USA. Palo Alto: AAAI Press; 2018. p. 3215-22.

[54]

Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. 2017. arXiv:1707.06347.

[55]

Christodoulou P. Soft actor-critic for discrete action settings. 2019. arXiv:1910.07207.

[56]

X. He, H. Yang, Z. Hu, C. Lv. Robust lane change decision making for autonomous vehicles: an observation adversarial reinforcement learning approach. IEEE Trans Intell Veh, 8 (1) (2023), pp. 184-193

RIGHTS & PERMISSIONS

THE AUTHOR

AI Summary AI Mindmap
PDF (884KB)

Supplementary files

Supplementary Material

7470

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/