Policy Gradient Methods for Multi-Agent Reinforcement Learning: A Comparative Study

Jianing Luo

doi:10.54097/58j7ca95

Authors

Jianing Luo

DOI:

https://doi.org/10.54097/58j7ca95

Keywords:

Multi-Agent Reinforcement Learning, Counterfactual Multi-Agent Policy Gradient, Meta-Learning Policy Gradient, Status Quo Policy Gradient.

Abstract

Multi-Agent Reinforcement Learning (MARL) has proven to be a compelling tool for addressing decision-making tasks with multiple agents who are engaged in complex interactions and collaborations. This study evaluates a variety of policy gradient techniques that are currently used in MARL, specifying three up-to-date methods: Counterfactual Multi-Agent Policy Gradient (COMA), Meta-Learning Policy Gradient (Meta-PG), and Status Quo Policy Gradient (SQPG). The performance of each method is determined by its convergence speed across training, success in varying environments, and the stability of the variance. The experimental results indicate that Meta-PG is the fastest algorithm, achieving the highest performance metrics in both shared and teamwork-based tasks, being optimal in such cases. COMA, however, exhibits great stability and effectiveness in adversarial settings; and employs the novel idea of counterfactual credit assignment for better learning. Though SQPG offers an overall balanced performance in every environment, it suffers from generally low variance and long convergence due to its equilibrium-seeking characteristic. These research results exhibit the trade-offs in terms of learning speed, stability, and adaptability in MARL. From the study, Meta-PG is noted to be effective for fast learning, COMA for adversarial interactions, while SQPG is a general method that needs more refinement. MARL's real-world applicability should be enhanced by focusing on hybrid models that combine the advantages of these approaches, improved variance reduction techniques, and iterative testing on a larger population of agents.

Downloads

Download data is not yet available.

References

[1] J. Foerster, G. Farquhar, T. Afouras, et al., Counterfactual multi-agent policy gradients, Proc. AAAI Conf. Artif. Intell. 32(1) (2018)

[2] D. K. Kim, M. Liu, M. D. Riemer, et al., A policy gradient algorithm for learning to learn in multiagent reinforcement learning, Proc. Int. Conf. Mach. Learn., 2021, 5541-5550

[3] J. G. Kuba, M. Wen, L. Meng, et al., Settling the variance of multi-agent policy gradients, Adv. Neural Inf. Process. Syst. 34, 13458-13470 (2021)

[4] B. Vasilev, T. Gupta, B. Peng, et al., Semi-on-policy training for sample-efficient multi-agent policy gradients, arXiv preprint, arXiv:2104.13446 (2021)

[5] P. Badjatiya, M. Sarkar, N. Puri, et al., Status-quo policy gradient in multi-agent reinforcement learning, arXiv preprint, arXiv:2111.11692 (2021)

[6] W. Li, S. Huang, Z. Qiu, et al., GAILPG: Multi-agent policy gradient with generative adversarial imitation learning, IEEE Trans. Games 2024

[7] J. Shi, X. Wang, M. Zhang, et al., A distributed adaptive policy gradient method based on momentum for multi-agent reinforcement learning, Complex Intell. Syst. 10(5), 7297-7310 (2024)

[8] J. Chen, J. Feng, W. Gao, et al., Decentralized natural policy gradient with variance reduction for collaborative multi-agent reinforcement learning, J. Mach. Learn. Res. 25(172), 1-49 (2024)

[9] C. Daskalakis, D. J. Foster, N. Golowich, Independent policy gradient methods for competitive reinforcement learning, Adv. Neural Inf. Process. Syst. 33, 5527-5540 (2020)

[10] X. Zhao, J. Lei, L. Li, et al., Distributed policy gradient with variance reduction in multi-agent reinforcement learning, arXiv preprint, arXiv:2111.12961 (2021)