Introduction

Author of the paper: Oxford and Deepmind.

  • Agents must learn communication protocols to share infomation in parial-observe environments and get better cooperative ability.
  • Proposed Reinforced Inter-Agent Learning (RIAL) and Differetiable Inter-Agent Learning (DIAL)
  • Both of them are centralised learning but decentralised execution.
  • Communication channel here is discrete limited bandwdth.
  • Tested on Switch Riddle and MINIST Games.
  • paper

Approach

RIAL

  • take communicate as normal action,use Q-network to evaluate.
  • the Q-Net contains $Q_u(o_t^a,m_{t-1}^{a'},h_{t-1}^a,u_{t-1}^a,m_{t-1}^a,a,u_t^a)$ and $Q_m(\dot)$ (p.s. $m_{t-1}^{a'}$ are messages from other agents)
  • Parameter sharing ,learning only one network.
  • diable experience replay: because replay may be misleading when multi agents learn concurrently

DIAL

  • C-Net take $(o_t^a,m_{t-1}^{a'},h_{t-1}^a,u_{t-1}^a,a)$ as input ,then output the conituous message m and the Q-value
  • the connection between DRU and C-Net is $\mu$
  • agnets can give message feedback through backpropagate gradients of bellman-error.
  • Use DRU to add noise to regularise for execution.

comparasion

  • RIAL cannot give feedback of message,it's end to end trainable within agent,but DIAL can,which enable it learn faster and end to end trainable across agents.
  • RIAL communicate through one-hot infomation,DIAL use binary encoding ,which enable it scale to large discrete message space.

Experiments

Switch RIddle

  • DIAL learn faster than RIAL
  • parameter sharing can acclerate learning process
  • RIAL without paprameter sharing in n=4 environment cannot beat baseline without communication.

About Noise

  • without noise ,regularisation is not possible (tranning reward will be highly different with execution)
  • with noise,we can encode more info into one signal .

Notes

  • Discrete channel can use noise to represent more infomations.
  • Add impact of message can lead to faster learning
  • Message is delayed for one time-stamp and the env will be unstationary