Introduction

Author of the paper: from DeepMind,Massachusetts Institute of Technology,Princeton University.

  • Motivation: award for having a causal influence on other agents may encourage cooperation just as newborn infaints do.
  • Tested on Sequential Socical Dilemmas.
  • paper

Approach

Intrinsic reward

$z_t=< u_t^B,s_t^B >$

  • The causal influence intrinsic reward for A is $I_t^A$
  • $I_t^A$ can be thought as A's empowerment over B's actions
  • this assume A's action taken before B,which will be achieved by MOA(Model of Others).

Communication

  • use explict communication,$ m_t=[m_t^0,m_t^1,...,m_t^N] $
  • $V_c,\pi_c$ do not use intrinsic reward,$V_e,\pi_e$ use intrinsic reward.
  • hypothesize communication can only be influential if it's useful to another agent.

Model of Others

  • can be used to measure 'how would other agents reacted if i had choose another action'
  • apply only to visible agents ,because nearby agents' actions can be visble and the estimate of $p(a_{t+1}^B|a_t^A,s_t^A)$ will be more accurate.

Experiments

  • for figure a,d, without of MOA and Communication, the influencer will learn to use action as emergent communication
  • for figure b,e, with communication,the influence com at Clean Up scene do not use extrinsic reward,which suggests influence alone can be used to training an effiecient comm policy.
  • for figure c,f, with MOA agents cooperate more in Clean Up scene.

Notes

What we can take away

  • add intrinsic reward based on the influence to others can encourage agents explorate for cooporate .
  • use MOA for nearby agents can ease the centralization for training and reduce the complexity.
  • communication can be explicit to transmit code to others or use actions to communicate.
  • intrinsic reward can be used to multi-agent RL .

Thoughts

  • reward for influence may cause the wrong direction of training (A take a can be influential to B with harmful)
  • what the result will be when it running on particle-worlds?
  • ...