# Abstract

Author of the paper: University College London and Shanghai Jiao Tong University

• Background: Existing MARL methods becomes intractable due to the curse of dimensionality and the exponential growth of agent interactions.
• Thoughts: approximate interactions within the population of agents with those between a single agent and the average effect from the overall population or neighboring agents.
• Propose mean field Q-Learning and mean field Actor-critic algorithms which can be apply to many agent environments.
• paper

# Introduction

## Background

Multi-agent reinforcement learning(MARL) is concerned with a set of autonomous agents that share a common environment. Independant Q-Learning that considering other agents as part of environments often fails as the changes of one agent will affect that of the others which make learning unstable.

## Related works

For MARL part:

• Studies show that an agent who learns the effect of joint actions has better performance than those who do not in many scenarios.
• Solutions to address the nonstationary issue in MARL: opponent modeling, policy parameter sharing, centralized training with decentralized execution(MADDPG).

The above approaches limit their studies mostly to tens of agents, due to :

• the input space of Q grows exponentially with the number of agents grows.
• the accumulated noises by the exploratory actions of other agents make the Q learning not feasible.

Also related to Mean Field Game(MFG):

• studies population behaviors resulting from the aggregations of decisions taken from individuals.
• Yang combines MFG with RL to learn reward fuction and forward mean dynamics in Inverse RL.

But this paper's goal is to form a computable Q-learning algorithm.

# Approach

To address the issue that $Q^j(s,a)$ become infeasible when the number of agents grow large,the author factorize it as follow: $Q ^ { j } ( s , \boldsymbol { a } ) = \frac { 1 } { N ^ { j } } \sum _ { k \in \mathcal { N } ( j ) } Q ^ { j } \left( s , a ^ { j } , a ^ { k } \right)$

and $a ^ { k } = \overline { a } ^ { j } + \delta a ^ { j , k }$ where $\overline { a } ^ { j } = \frac { 1 } { N ^ { j } } \sum _ { k } a ^ { k }$

combine above we can get:

For calculating Q-Value.

$Q _ { t + 1 } ^ { j } \left( s , a ^ { j } , \overline { a } ^ { j } \right) = ( 1 - \alpha ) Q _ { t } ^ { j } \left( s , a ^ { j } , \overline { a } ^ { j } \right) + \alpha \left[ r ^ { j } + \gamma v _ { t } ^ { j } \left( s ^ { \prime } ,\overline{a} \right) \right]$

$v _ { t } ^ { j } \left( s ^ { \prime } \right) = \sum _ { a ^ { j } } \pi _ { t } ^ { j } \left( a ^ { j } | s ^ { \prime } , \overline { a } ^ { j } \right) \mathbb { E } _ { \overline { a } ^ { j } ( \boldsymbol { a } - j ) \sim \pi _ { t } ^ { - j } } \left[ Q _ { t } ^ { j } \left( s ^ { \prime } , a ^ { j } , \overline { a } ^ { j } \right) \right]$

## Loss function

For MFQ: $\mathscr { L } \left( \phi ^ { j } \right) = \left( y ^ { j } - Q _ { \phi ^ { j } } \left( s , a ^ { j } , \overline { a } ^ { j } \right) \right) ^ { 2 }$

$y ^ { j } = r ^ { j } + \gamma v _ { \phi _ { - } ^ { j } } ^ { \mathrm { MF } } \left( s ^ { \prime } \right)$

# Experiments

This paper evaluate algorithm in 3 different scenarios, but here I only show the battle game example which is more related to my works.

## Settings:

Two armies fighting against each other in a grid world.The goal of each army is to get more rewards.Action space contains move or attack nearby agents. Reward setting is : -0.005 for every move, 0.2 for attacking an enemy,5 for killing an enemy, -0.1 for attacking an empty grid, -0.1 for being attacked or killed.

## Results

2000 rounds after self-plays trainning.

• IL performs better than AC and MF-AC imply the effictiveness of off-policy learning with replay-buffer.
• The replay-buffer and the maximum operator in calculating Q-values may be the reason why MFQ converge fast than MFAC .

# Conclusion

Transform many-body problem into two-body problem using mean field theory enable the scalability in MARL.