Introduction

Author of the paper: from University of Southern California

  • Decentralized policy and centralized critic that share an attention mechanism to select relevant infos for each agent.
  • Based on MADDPG,but reduce the communication and the useless infos.
  • tested on multi-particle environment
  • paper

Approach

Attention

  • use MLP to encode $o_i,a_i$ to $e_i$
  • $v_j=h(V g_j(o_j,a_j))$ is agent j's embedding. $h$ is activate function.V is shared matrix.
  • $\alpha_j \propto exp(e_j^TW_k^TW_qe_i)$ (ps. $W_q$ transorm to query,$W_k$ transform to key) is the attention weight.
  • $x_i=\sum_{j\neq i}\alpha_j v_j$

Loss and Gradient

for global critic: for individual policy:

  • the $y_i$ calculation is based on SAC ,maxmizing entropy and rewards.

Algorithm

  • It's almost the MADDPG algorithm ,calculated by Soft Actor Critic
  • Add the Attention into MADDPG,which showed in the architechture.

Experiments

Tested on Cooperative Treasure Collection and the Rover-Tower task,compared with MADDPG,COMA,only MAAC gain well-performance on both tasks.

Notes

  • compared with MADDPG,MAAC use Attention to extract infos do good to scale to more agents scene,but still face themany agent problem.