行业研究公司研究宏观策略财报招股书会议纪要稳定币低空经济 DeepSeek AIGC 智能驾驶大模型

多智能体强化学习大模型初探-郝晓田

文化传媒2023-08-07DataFunSummit2023：大模型与AIGC峰会罗***

AI智能总结

多智能体决策大模型面临的挑战

基本概念

合作式多智能体系统：多个参与主体合作优化目标函数，常见于游戏AI、智能仓储、车辆协调等场景。
合作式多智能体强化学习：通过多智能体马尔可夫决策过程（MMDP）和去中心化部分可观测马尔可夫决策过程（Dec-POMDP）建模。

挑战

维度灾难：状态观测空间和联合动作空间随实体数量指数增长。
学习样本效率低。
通用性和泛化能力差。

多智能体强化学习大模型及其优势

多智能体强化学习大模型

设计具有较好泛化性的模型，一个模型可以解决多个类似问题。

大模型带来的好处

在自然语言处理、计算机视觉等领域已取得突破性成果。
强化学习领域：通过更大、更好的模型提高性能并减少运行时间。

动作语义网络与置换不变性、置换同变性

动作语义网络（Action Semantics Network）

考虑不同动作对其他智能体的影响，设计基于动作语义的神经网络。

置换不变性与置换同变性

处理具有相同类型的智能体时，构建对实体顺序不敏感的函数，显著降低状态/观测空间维度。
动作分为实体相关（如攻击、治疗）和非实体相关（如移动）两类。

跨任务自动化课程学习

自动课程生成（PORTAL）

自适应选择任务以优化学习序列。
通过评估回报选择难度适中的任务，并基于任务相似度实现策略迁移。

实验结果

SMAC挑战：实现了100%的胜率。
MPE和Google足球环境：展示了不同场景下的应用效果。

总结

提供了简单但高效的置换不变性和置换同变性模块实现。
作为任何多智能体强化学习算法的插件模块，提升性能。
达到典型多智能体强化学习基准的最优结果。

郝晓田-天津大学-博士在读 DataFunSummit#2023 目录CONTENT 多智能体决策大模型面临的挑战置换不变性、置换同变性 03 01 为什么强化学习需要大模型？多智能体决策大模型有哪些挑战？ [ICLR-23] Boosting MARL via Permutation Invariantand Permutation Equivariant Networks 跨任务自动化课程学习[AAMAS-23] PORTAL: Automatic Curricula Generationfor Multiagent Reinforcement Learning 02动作语义网络[ICLR21] Action Semantics Network: Consideringthe Effects of Actions in Multiagent Systems 04 多智能体决策大模型面临的挑战 DataFunSummit#2023 基本概念 •什么是合作式多智能体系统？ 现实世界中的大量实际问题可以建模为包含了多个主体的协同控制和优化问题。 由多个参与主体，合作地优化某个（或多个）相同的目标函数。基本概念 •合作式多智能体强化学习建模方式 obs: { [类型、距离、相对横纵坐标、血量、护甲], … , }action: [无操作、上下左右移动、攻击某个敌方单位] Multiagent Markov Decision Processes (MMDP):<N,S,A=𝐴𝐴1×⋯×𝐴𝐴𝑛𝑛,R,T, γ> Decentralized Partially Observable MDP (Dec-POMDP):Joint policy𝝅𝝅=𝜋𝜋1, … ,𝜋𝜋𝑛𝑛𝝅𝝅∗= argmax𝝅𝝅𝔼𝔼𝝅𝝅[∑𝑡𝑡=0𝑇𝑇𝛾𝛾𝑡𝑡𝑅𝑅(𝑠𝑠,⃗𝑎𝑎)].<N,S,A,R,T, γ,O, Z> 基本概念 •合作式多智能体强化学习建模方式 难点1：维度灾难 Multiagent Markov Decision Processes (MMDP):<N,S,A=𝐴𝐴1×⋯×𝐴𝐴𝑛𝑛,R,T, γ> Decentralized Partially Observable MDP (Dec-POMDP):Joint policy𝝅𝝅=𝜋𝜋1, … ,𝜋𝜋𝑛𝑛𝝅𝝅∗= argmax𝝅𝝅𝔼𝔼𝝅𝝅[∑𝑡𝑡=0𝑇𝑇𝛾𝛾𝑡𝑡𝑅𝑅(𝑠𝑠,⃗𝑎𝑎)].<N,S,A,R,T, γ,O, Z> 什么是多智能体强化学习大模型？ •设计模型使具有比较好的泛化性，一个模型可以解决多个类似问题更大模型能给强化学习带来什么好处？ •大模型在自然语言处理、计算机视觉等领域已取得突破性成果（ChatGPT3.5约有1750亿参数）。 •强化学习领域：BBF (Bigger, Better, Faster) [1] (Atari-100k)BBFresults insimilar performanceto model-basedEfficientZerowith at least4x reduction in runtime. [1] Bigger, Better, Faster: Human-level Atari with human-level efficiency, ICML-2023. 多智能体强化学习大模型面临哪些挑战？ obs: { [类型、距离、相对横纵坐标、血量、护甲], … , } ①Different entity numbers and types：不同场景的智能体（或实体）数量、种类不同；②Different feature inputs：实体的特征不同→观测（obs）、状态（state）不同；③Different action spaces：动作空间不同；④Different rewards：奖励函数不同；网络输入维度、含义等不同策略网络输出维度、含义不同值函数网络输出尺度不同类比语言模型对多智能体系统进行统一描述 •Align multiagent systems and languages Entity-factored description of multiagent system 3条重要设计先验 •动作语义网络•[ICLR-2021] Action Semantics Network: Considering the Effects of Actions in Multiagent Systems. •置换不变性、置换同变性、模型变长输入•[ICLR-2023] Boosting MARL via Permutation Invariant and Permutation Equivariant Networks. •迁移学习、跨任务的自动化课程学习•[AAMAS-2023] PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning. 动作语义网络 [ICLR-2021] Action Semantics Network: Considering theEffects of Actions in Multiagent Systems DataFunSummit#2023 ASN (Action Semantics Network) •ASN considers different actions’ influence on other agents and designs neural networks based onthe action semantics, e.g., move or attack actions. ASN (Action Semantics Network) •ASN considers different actions’ influence on other agents and designs neural networks based onthe action semantics, e.g., move or attack actions. 03 置换不变性与置换同变性 [ICLR-2023] Boosting MARL via Permutation Invariant andPermutation Equivariant Networks. DataFunSummit#2023 Motivation •Entity-factored modeling in MARL A multiagent environment typically consists of𝑚𝑚entities, including𝑛𝑛learning agents and𝑚𝑚 − 𝑛𝑛non-player objects. Both the statesand each agent’s observationo𝑖𝑖are usually composed of the features of the m entities:[x0, … , xm],eachxj∈X, e.g.,𝑠𝑠 ∈ 𝒮𝒮 ⊆ ℝ𝑚𝑚×𝑑𝑑𝑠𝑠and𝑜𝑜𝑖𝑖∈ 𝒪𝒪 ⊆ ℝ𝑚𝑚×𝑑𝑑𝑜𝑜.𝑑𝑑𝑠𝑠and𝑑𝑑𝑜𝑜are the feature dimension of each entity in𝑠𝑠and𝑜𝑜𝑖𝑖. 状态、观测空间随实体数量指数增长 •If simply representing thestatesor theobservationo𝑖𝑖as a concatenation of the𝑚𝑚entities’ features in a fixed order, the state space orobservation spacewill grow exponentially as the entity number𝑚𝑚increases, which results inlow sample efficiencyandpoorscalabilityof existing MARL methods. 系统的状态刻画的是实体集合的客观信息，不随“输入顺序的变化而” Main idea •There exists symmetry features in multiagent systems containing homogeneous agents. •Building𝑄𝑄𝑖𝑖/𝑉𝑉/𝜋𝜋𝑖𝑖functionsinsensitive to the entities’ ordercan significantly reduce the state/observation space by a factor of⁄𝟏𝟏 𝒎𝒎!,i.e., fromX𝑚𝑚(concatenating𝑠𝑠/𝑜𝑜𝑖𝑖in a fixed order) to�X𝑚𝑚m!(thusalleviating the curse of dimensionality). Motivation •Entity-factored modeling in MARL There are 2 types of actions:𝐴𝐴𝑖𝑖≜𝐴𝐴equiv,𝐴𝐴inv in typical MA environments.Entity-correlatedactions𝐴𝐴 equiv: e.g., attack which enemy entity or heal which ally entity (StarCraft), pass the ballto which teammate (Football);Normal(entity-uncorrelated) actions𝐴𝐴inv:e.g., move in different directions. Motivation •Design permutation insensitive𝑸𝑸𝒊𝒊/𝝅𝝅𝒊𝒊functions To build𝑄𝑄𝑖𝑖/𝜋𝜋𝑖𝑖functionsinsensitive to the order of the entities’ features ([x0, … , xm]), we should take the type of theactions into consideration.•For entity-correlated actions𝐴𝐴equiv, permute the input entities’ ordershould also permute the corresponding outputs’ order. •For normal (entity-uncorrelated) actions𝐴𝐴inv, permute the input entities’ ordershould not permute the outputs’ order. 𝑔𝑔 ∈ 𝐺𝐺is an arbitrary permutation matrix operating on[x0, … , xm]T Method •Designing Permutation Invariant and Permutation Equivariant Policy Networks 𝑔𝑔

点击免费查看完整报告