1 University of Maryland College Park 2 Sun Yat-sen University
Hopper, PPO clean reward: 3104
Hopper, PPO reward under strongest previous attack: 636
Hopper, PPO reward under our PA-AD attack: 171
Walker, PPO clean reward: 4630
Walker, PPO reward under strongest previous attack: 1048
Walker, PPO reward under our PA-AD attack: 813
Pong, DQN clean reward: 21 (agent: right paddle)
Pong, DQN reward under strongest previous attack: -14
Pong, DQN reward under our PA-AD attack: -21
Boxing, DQN clean reward: 94 (agent: left boxer)
Boxing, DQN reward under strongest previous attack: 69
Boxing, DQN reward under our PA-AD attack: 8
Evaluating the worst-case performance of a reinforcement learning (RL) agent under the strongest/optimal adversarial perturbations on state observations (within
some constraints) is crucial for understanding the robustness of RL agents. However, finding the optimal adversary is challenging, in terms of both whether we
can find the optimal attack and how efficiently we can find it. Existing works on
adversarial RL either use heuristics-based methods that may not find the strongest
adversary, or directly train an RL-based adversary by treating the agent as a part of
the environment, which can find the optimal adversary but may become intractable
in a large state space. This paper introduces a novel attacking method to find the
optimal attacks through collaboration between a designed function named “actor”
and an RL-based learner named “director”. The actor crafts state perturbations for
a given policy perturbation direction, and the director learns to propose the best
policy perturbation directions. Our proposed algorithm, PA-AD, is theoretically
optimal and significantly more efficient than prior RL-based works in environments with large state spaces. Empirical results show that our proposed PA-AD
universally outperforms state-of-the-art attacking methods in various Atari and
MuJoCo environments. By applying PA-AD to adversarial training, we achieve
state-of-the-art empirical robustness in multiple tasks under strong adversaries.
Who Is the Strongest Enemy? Towards Optimal and Efficient Evasion Attacks in Deep RL
Yanchao Sun ,
Ruijie Zheng ,
Yongyuan Liang ,
Furong Huang [Paper][Code]
Prior works in evasion RL generally falls into two categories: Heuristic Attack and RL-based Attack
Heuristic Attack: at every step set an attacking objective by heuristic. For example, use fast gradient sign method (FGSM) to find
the perturbation that could cause the agent to select the action corresponding to the minimum current Q value (MaxWorst). Or at every step, use FGSM to stop the agents
from choosing the best action (MinBest). Or at every step, find a perturbation that cause the action distribution to have the largest KL divergence from the original
distribution (MaxDiff). Heuristic Attack is often fast to compute and works decently well in practice, but it has no optimality guarantee.
RL-based attack: Zhang et al. establish the theory that learning optimal attacker in RL corresponds to learning
an optimal policy in an adversary-induced MDP. Based on this insight, they use an end-to-end RL formulation to learn the optimal state adversary policy. However, because
both the state and action space of the attacker's policy have the same size as the state space of the original environment, it becomes intractable when the original environments
has a large state space, such as image space in Atari Games.
Our Approach (PA-AD) :
Instead of learning an end-to-end adversary policy that maps observations directly into observation perturbation, we decouple the whole attacking process into two
simpler components: policy perturbation and state perturbation, solved by a “director” and an “actor” respectively. The director learns the
optimal policy perturbing direction with RL methods, while the actor crafts adversarial states at every step such that the victim policy is perturbed towards the given direction.
In this way, we obtain a both optimal and efficient attacking algorithm which is extremely powerful on environments with large observation space.
Empirically, we verify that our PA-AD algorithm almost always outperforms all existing attacking methods by a large margin on various OpenAI Gym environments, including Atari and MuJoCo tasks.
Here is a comparison of PA-AD against existing attacking methods on seven Atari Games and four MuJoco tasks.
We compute the average episode rewards plus/minus standard deviation of vanilla DQN and A2C or PPO agents under different evasion
Here, we have the comparison of PA-AD attacks against other existing
attacks with different choices of epsilon on three Atari Games: Boxing, Pong, and RoadRunner respectively.
In addition to the given epsilon shown in the table above, we also have the comparison of PA-AD attacks against other attackers under different choices of
epsilon on three Atari Games: Boxing, Pong, RoadRunner respectively.
Besides, when combining with existing robust training methods, our algorithm PA-AD could significantly
improve the robustness of the trained RL policies.