Summary: ACCNet: Actor-Coordinator-Critic Net for “Learning-to-Communicate” with Deep Multi-agent Reinforcement Learning
Author of the paper: Peking University and Huawei Technologies
- Insight: Once each agent knows teh states and actions from other agents,the env could be treated stationary regradless of chaging policy
- Propose 2 framework to solve multi-agent 'learning to communicate problem'.
- Can work on both continuous and discrete action space and easy to scale up
- tested on Network Routing problem and Traffic Junction task
Setting: PO-MDP,agents need communication at excuting
init global message glo_m
for every agent i:
encode loc_obs to local message m
for every agent i:
then run the actor-critic algorithm
Setting: PO-MDP,agents can excuting without communication
m is global message
for every agent :
Q() take m as state-action
Q() take [m,m_i] as state-action
the learning as actor-critic
- Comm-Net use single network with communication channel at each layer is not easy to scale up.
- DIAL's message is delayed for one timestamp and the env will be non-stationary in multi-agent situation.
- COMA focus on credit assignment,and limit to discrete action space and critic get the entire game screen .
- MADDPG limited to continuous action spaces.
In traffic junction task
- message for braking and gassing are naturally seprated
- for same type message,they can be speprated by different cars
- PCA is vital for analysis communication message,and make the message easier to inteprete
Tricks for Implementations
- The concurrent experience replay(CER): samples the concurrent experiences collected at the same timestep for all agents as they trained concurrently.
- The current episode experience replay(CEER): keep all experiences of current episode in a temporary buffer and combines them with experiences form the main replay buffer as training examples at the end of each episode.
- Disabled experience replay for discrete action space environments: for discrete action space,any replay methods would cause huge non-stationary.
- activation functions: sigmoid,elu for conitnuous action spaces ; relu for discrete action spaces
- the tricks metioned for MARL is useful when doing implementation.
- encode both action and state to message when learning can generate better policy .
- PCA can be used to inteprete message .