郝晓田-天津大学-博士在读 DataFunSummit#2023 目录CONTENT 多智能体决策大模型面临的挑战 置换不变性、置换同变性 03 01 为什么强化学习需要大模型?多智能体决策大模型有哪些挑战? [ICLR-23] Boosting MARL via Permutation Invariantand Permutation Equivariant Networks 跨任务自动化课程学习[AAMAS-23] PORTAL: Automatic Curricula Generationfor Multiagent Reinforcement Learning 02动作语义网络[ICLR21] Action Semantics Network: Consideringthe Effects of Actions in Multiagent Systems 04 多智能体决策大模型面临的挑战 DataFunSummit#2023 基本概念 •什么是合作式多智能体系统? 现实世界中的大量实际问题可以建模为包含了多个主体的协同控制和优化问题。 由多个参与主体,合作地优化某个(或多个)相同的目标函数。 基本概念 •合作式多智能体强化学习建模方式 obs: { [类型、距离、相对横纵坐标、血量、护甲], … , }action: [无操作、上下左右移动、攻击某个敌方单位] Multiagent Markov Decision Processes (MMDP):<N,S,A=𝐴𝐴1×⋯×𝐴𝐴𝑛𝑛,R,T, γ> Decentralized Partially Observable MDP (Dec-POMDP):Joint policy𝝅𝝅=𝜋𝜋1, … ,𝜋𝜋𝑛𝑛𝝅𝝅∗= argmax𝝅𝝅𝔼𝔼𝝅𝝅[∑𝑡𝑡=0𝑇𝑇𝛾𝛾𝑡𝑡𝑅𝑅(𝑠𝑠,⃗𝑎𝑎)].<N,S,A,R,T, γ,O, Z> 基本概念 •合作式多智能体强化学习建模方式 难点1:维度灾难 Multiagent Markov Decision Processes (MMDP):<N,S,A=𝐴𝐴1×⋯×𝐴𝐴𝑛𝑛,R,T, γ> Decentralized Partially Observable MDP (Dec-POMDP):Joint policy𝝅𝝅=𝜋𝜋1, … ,𝜋𝜋𝑛𝑛𝝅𝝅∗= argmax𝝅𝝅𝔼𝔼𝝅𝝅[∑𝑡𝑡=0𝑇𝑇𝛾𝛾𝑡𝑡𝑅𝑅(𝑠𝑠,⃗𝑎𝑎)].<N,S,A,R,T, γ,O, Z> 什么是多智能体强化学习大模型? •设计模型使具有比较好的泛化性,一个模型可以解决多个类似问题 更大模型能给强化学习带来什么好处? •大模型在自然语言处理、计算机视觉等领域已取得突破性成果(ChatGPT3.5约有1750亿参数)。 •强化学习领域:BBF (Bigger, Better, Faster) [1] (Atari-100k)BBFresults insimilar performanceto model-basedEfficientZerowith at least4x reduction in runtime. [1] Bigger, Better, Faster: Human-level Atari with human-level efficiency, ICML-2023. 多智能体强化学习大模型面临哪些挑战? obs: { [类型、距离、相对横纵坐标、血量、护甲], … , } ①Different entity numbers and types:不同场景的智能体(或实体)数量、种类不同;②Different feature inputs:实体的特征不同→观测(obs)、状态(state)不同;③Different action spaces:动作空间不同;④Different rewards:奖励函数不同;网络输入维度、含义等不同策略网络输出维度、含义不同值函数网络输出尺度不同 类比语言模型对多智能体系统进行统一描述 •Align multiagent systems and languages Entity-factored description of multiagent system 3条重要设计先验 •动作语义网络•[ICLR-2021] Action Semantics Network: Considering the Effects of Actions in Multiagent Systems. •置换不变性、置换同变性、模型变长输入•[ICLR-2023] Boosting MARL via Permutation Invariant and Permutation Equivariant Networks. •迁移学习、跨任务的自动化课程学习•[AAMAS-2023] PORTAL: Automatic Curricula Generation for Multiagent Reinforcement Learning. 动作语义网络 [ICLR-2021] Action Semantics Network: Considering theEffects of Actions in Multiagent Systems DataFunSummit#2023 ASN (Action Semantics Network) •ASN considers different actions’ influence on other agents and designs neural networks based onthe action semantics, e.g., move or attack actions. ASN (Action Semantics Network) •ASN considers different actions’ influence on other agents and designs neural networks based onthe action semantics, e.g., move or attack actions. 03 置换不变性与置换同变性 [ICLR-2023] Boosting MARL via Permutation Invariant andPermutation Equivariant Networks. DataFunSummit#2023 Motivation •Entity-factored modeling in MARL A multiagent environment typically consists of𝑚𝑚entities, including𝑛𝑛learning agents and𝑚𝑚 − 𝑛𝑛non-player objects. Both the statesand each agent’s observationo𝑖𝑖are usually composed of the features of the m entities:[x0, … , xm],eachxj∈X, e.g.,𝑠𝑠 ∈ 𝒮𝒮 ⊆ ℝ𝑚𝑚×𝑑𝑑𝑠𝑠and𝑜𝑜𝑖𝑖∈ 𝒪𝒪 ⊆ ℝ𝑚𝑚×𝑑𝑑𝑜𝑜.𝑑𝑑𝑠𝑠and𝑑𝑑𝑜𝑜are the feature dimension of each entity in𝑠𝑠and𝑜𝑜𝑖𝑖. 状态、观测空间随实体数量指数增长 •If simply representing thestatesor theobservationo𝑖𝑖as a concatenation of the𝑚𝑚entities’ features in a fixed order, the state space orobservation spacewill grow exponentially as the entity number𝑚𝑚increases, which results inlow sample efficiencyandpoorscalabilityof existing MARL methods. 系统的状态刻画的是实体集合的客观信息,不随“输入顺序的变化而” Main idea •There exists symmetry features in multiagent systems containing homogeneous agents. •Building𝑄𝑄𝑖𝑖/𝑉𝑉/𝜋𝜋𝑖𝑖functionsinsensitive to the entities’ ordercan significantly reduce the state/observation space by a factor of⁄𝟏𝟏 𝒎𝒎!,i.e., fromX𝑚𝑚(concatenating𝑠𝑠/𝑜𝑜𝑖𝑖in a fixed order) to�X𝑚𝑚m!(thusalleviating the curse of dimensionality). Motivation •Entity-factored modeling in MARL There are 2 types of actions:𝐴𝐴𝑖𝑖≜𝐴𝐴equiv,𝐴𝐴inv in typical MA environments.Entity-correlatedactions𝐴𝐴 equiv: e.g., attack which enemy entity or heal which ally entity (StarCraft), pass the ballto which teammate (Football);Normal(entity-uncorrelated) actions𝐴𝐴inv:e.g., move in different directions. Motivation •Design permutation insensitive𝑸𝑸𝒊𝒊/𝝅𝝅𝒊𝒊functions To build𝑄𝑄𝑖𝑖/𝜋𝜋𝑖𝑖functionsinsensitive to the order of the entities’ features ([x0, … , xm]), we should take the type of theactions into consideration.•For entity-correlated actions𝐴𝐴equiv, permute the input entities’ ordershould also permute the corresponding outputs’ order. •For normal (entity-uncorrelated) actions𝐴𝐴inv, permute the input entities’ ordershould not permute the outputs’ order. 𝑔𝑔 ∈ 𝐺𝐺is an arbitrary permutation matrix operating on[x0, … , xm]T Method •Designing Permutation Invariant and Permutation Equivariant Policy Networks 𝑔𝑔