Introduction

Author of the paper: Peking University and Huawei Technologies

  • Insight: Once each agent knows teh states and actions from other agents,the env could be treated stationary regradless of chaging policy
  • Propose 2 framework to solve multi-agent 'learning to communicate problem'.
  • Can work on both continuous and discrete action space and easy to scale up
  • tested on Network Routing problem and Traffic Junction task
  • paper

Approach

AC-CNet

Setting: PO-MDP,agents need communication at excuting Procedure:

1
2
3
4
5
6
7
8
init global message glo_m
for every agent i:
encode loc_obs to local message m
glo_m.append(m)
glo_m=encode(glo_m)
for every agent i:
state=[local_obs,glo_m]
then run the actor-critic algorithm

A-CCNet

Setting: PO-MDP,agents can excuting without communication

Procedure:

1
2
3
4
5
6
7
8
9
10
11
12
13
m is global message
for excuting:
for every agent :
a=\pi(local_obs)
s'=[local_obs,a]
m_i=encode(s')
m.append(m_i)
m=encode(m)
Critic 1:
Q() take m as state-action
Critic 2:
Q() take [m,m_i] as state-action
the learning as actor-critic

Comparisons

  • Comm-Net use single network with communication channel at each layer is not easy to scale up.
  • DIAL's message is delayed for one timestamp and the env will be non-stationary in multi-agent situation.
  • COMA focus on credit assignment,and limit to discrete action space and critic get the entire game screen .
  • MADDPG limited to continuous action spaces.

Experiments

Results analysis

In traffic junction task

  • message for braking and gassing are naturally seprated
  • for same type message,they can be speprated by different cars
  • PCA is vital for analysis communication message,and make the message easier to inteprete

Tricks for Implementations

  • The concurrent experience replay(CER): samples the concurrent experiences collected at the same timestep for all agents as they trained concurrently.
  • The current episode experience replay(CEER): keep all experiences of current episode in a temporary buffer and combines them with experiences form the main replay buffer as training examples at the end of each episode.
  • Disabled experience replay for discrete action space environments: for discrete action space,any replay methods would cause huge non-stationary.
  • activation functions: sigmoid,elu for conitnuous action spaces ; relu for discrete action spaces

Notes

  • the tricks metioned for MARL is useful when doing implementation.
  • encode both action and state to message when learning can generate better policy .
  • PCA can be used to inteprete message .