Abstract

Author of the paper: University College London and Shanghai Jiao Tong University

  • Background: Existing MARL methods becomes intractable due to the curse of dimensionality and the exponential growth of agent interactions.
  • Thoughts: approximate interactions within the population of agents with those between a single agent and the average effect from the overall population or neighboring agents.
  • Propose mean field Q-Learning and mean field Actor-critic algorithms which can be apply to many agent environments.
  • paper

Introduction

Background

Multi-agent reinforcement learning(MARL) is concerned with a set of autonomous agents that share a common environment. Independant Q-Learning that considering other agents as part of environments often fails as the changes of one agent will affect that of the others which make learning unstable.

Related works

For MARL part:

  • Studies show that an agent who learns the effect of joint actions has better performance than those who do not in many scenarios.
  • Solutions to address the nonstationary issue in MARL: opponent modeling, policy parameter sharing, centralized training with decentralized execution(MADDPG).

The above approaches limit their studies mostly to tens of agents, due to :

  • the input space of Q grows exponentially with the number of agents grows.
  • the accumulated noises by the exploratory actions of other agents make the Q learning not feasible.

Also related to Mean Field Game(MFG):

  • studies population behaviors resulting from the aggregations of decisions taken from individuals.
  • Yang combines MFG with RL to learn reward fuction and forward mean dynamics in Inverse RL.

But this paper's goal is to form a computable Q-learning algorithm.

Approach

To address the issue that $Q^j(s,a)$ become infeasible when the number of agents grow large,the author factorize it as follow: $Q ^ { j } ( s , \boldsymbol { a } ) = \frac { 1 } { N ^ { j } } \sum _ { k \in \mathcal { N } ( j ) } Q ^ { j } \left( s , a ^ { j } , a ^ { k } \right)$

and $a ^ { k } = \overline { a } ^ { j } + \delta a ^ { j , k }$ where $\overline { a } ^ { j } = \frac { 1 } { N ^ { j } } \sum _ { k } a ^ { k }$

combine above we can get:

For calculating Q-Value.

$Q _ { t + 1 } ^ { j } \left( s , a ^ { j } , \overline { a } ^ { j } \right) = ( 1 - \alpha ) Q _ { t } ^ { j } \left( s , a ^ { j } , \overline { a } ^ { j } \right) + \alpha \left[ r ^ { j } + \gamma v _ { t } ^ { j } \left( s ^ { \prime } ,\overline{a} \right) \right]$

$v _ { t } ^ { j } \left( s ^ { \prime } \right) = \sum _ { a ^ { j } } \pi _ { t } ^ { j } \left( a ^ { j } | s ^ { \prime } , \overline { a } ^ { j } \right) \mathbb { E } _ { \overline { a } ^ { j } ( \boldsymbol { a } - j ) \sim \pi _ { t } ^ { - j } } \left[ Q _ { t } ^ { j } \left( s ^ { \prime } , a ^ { j } , \overline { a } ^ { j } \right) \right]$

Loss function

For MFQ: $\mathscr { L } \left( \phi ^ { j } \right) = \left( y ^ { j } - Q _ { \phi ^ { j } } \left( s , a ^ { j } , \overline { a } ^ { j } \right) \right) ^ { 2 }$

$y ^ { j } = r ^ { j } + \gamma v _ { \phi _ { - } ^ { j } } ^ { \mathrm { MF } } \left( s ^ { \prime } \right)$

Algorithms

Experiments

This paper evaluate algorithm in 3 different scenarios, but here I only show the battle game example which is more related to my works.

Settings:

Two armies fighting against each other in a grid world.The goal of each army is to get more rewards.Action space contains move or attack nearby agents. Reward setting is : -0.005 for every move, 0.2 for attacking an enemy,5 for killing an enemy, -0.1 for attacking an empty grid, -0.1 for being attacked or killed.

Results

2000 rounds after self-plays trainning.

  • IL performs better than AC and MF-AC imply the effictiveness of off-policy learning with replay-buffer.
  • The replay-buffer and the maximum operator in calculating Q-values may be the reason why MFQ converge fast than MFAC .

Conclusion

Transform many-body problem into two-body problem using mean field theory enable the scalability in MARL.