多智能体强化学习大模型初探 郝晓田-天津大学-博士在读 DataFunSummit#2023 NOAH’SARKLAB 目录 CONTENT NOAH’SARKLAB 01 多智能体决策大模型面临的挑战 为什么强化学习需要大模型?多智能体决策大模型有哪些挑战? 置换不变性、置换同变性 03 [ICLR-23]BoostingMARLviaPermutationInvariantandPermutationEquivariantNetworks 02 动作语义网络 [ICLR21]ActionSemanticsNetwork:ConsideringtheEffectsofActionsinMultiagentSystems 跨任务自动化课程学习 04 [AAMAS-23]PORTAL:AutomaticCurriculaGenerationforMultiagentReinforcementLearning 01 多智能体决策大模型面临的挑战 DataFunSummit#2023 NOAH’SARKLAB •什么是合作式多智能体系统? 现实世界中的大量实际问题可以建模为包含了多个主体的协同控制和优化问题。 由多个参与主体,合作地优化某个(或多个)相同的目标函数。 AlphaStar(DeepMind) Dota2(OpenAI-5) HonorofKings(Tencent) 游戏AI中的多“英雄”协作 多用户-多商品推荐 多车辆运输投递优化 智能仓储多车辆协调 (云计算、排产)多资源调度、协同优化 滴滴出行多车辆协作调度 X� •合作式多智能体强化学习建模方式 obs:[[类型、距离、相对横纵坐标、血量、护甲],…,]action:[无操作、上下左右移动、攻击某个敌方单位] MultiagentMarkovDecisionProcesses(MMDP): <N,S,A=𝐴1×⋯×𝐴𝑛,R,T,γ> 𝑡=0 DecentralizedPartiallyObservableMDP(Dec-POMDP): <N,S,A,R,T,γ,O,Z> 𝜋1,…,𝜋� Jointpolicy�= 𝜋∗=argmax𝜋D𝜋[∑�𝛾𝛾𝑡𝑅(𝑠,𝑎⃗)]. •合作式多智能体强化学习建模方式 难点1:维度灾难 状态观测空间随实体数量指数增长联合动作空间随实体数量指数爆炸 难点2:学习样本效率低 难点3:通用性、泛化性差 X� MultiagentMarkovDecisionProcesses(MMDP): <N,S,A=𝐴1×⋯×𝐴𝑛,R,T,γ> 𝑡=0 DecentralizedPartiallyObservableMDP(Dec-POMDP): <N,S,A,R,T,γ,O,Z> 𝜋1,…,𝜋� Jointpolicy�= 𝜋∗=argmax𝜋D𝜋[∑�𝛾𝛾𝑡𝑅(𝑠,𝑎⃗)]. •设计模型使具有比较好的泛化性,一个模型可以解决多个类似问题 MMM21c3s5z2m_vs_1z3s_vs_5z3s5z3s5z_vs_3s6z 相同游戏不同场景 (星际争霸) 不同游戏不同场景 StarCraftDota2HonorofKings •大模型在自然语言处理、计算机视觉等领域已取得突破性成果(ChatGPT3.5约有1750亿参数)。 •强化学习领域:BBF(Bigger,Better,Faster)[1] Environmentsamplestoreachhuman-levelperformanceonAtari(over26games). Largernetwork+self-supervision+increasingreplayratio+parameterreset (Atari-100k)BBFresultsinsimilarperformancetomodel-basedEfficientZerowithatleast4xreductioninruntime. [1]Bigger,Better,Faster:Human-levelAtariwithhuman-levelefficiency,ICML-2023. obs:[[类型、距离、相对横纵坐标、血量、护甲],…,] ①Differententitynumbersandtypes:不同场景的智能体(或实体)数量、种类不同; ②Differentfeatureinputs:实体的特征不同→观测(obs)、状态(state)不同; 网络输入维度、含义等不同 ③Differentactionspaces:动作空间不同;策略网络输出维度、含义不同 ④Differentrewards:奖励函数不同; 值函数网络输出尺度不同 •Alignmultiagentsystemsandlanguages 种类位置血量护甲 … 向上向下 … 攻击 … 描述 客观世界 词表→句子 word2vec 词向量 神经网络 词 tokenizer (模型底座) 构成 描述 观测/状态 多智能体系统 属性表→实体表←动作表(动作语义) (类似关系型数据库) word2vec 属性 tokenizer (模型底座) 实体向量 神经网络 state obs Entity-factoreddescriptionofmultiagentsystemLanguagemodel •动作语义网络 •[ICLR-2021]ActionSemanticsNetwork:ConsideringtheEffectsofActionsinMultiagentSystems. •置换不变性、置换同变性、模型变长输入 •[ICLR-2023]BoostingMARLviaPermutationInvariantandPermutationEquivariantNetworks. •迁移学习、跨任务的自动化课程学习 •[AAMAS-2023]PORTAL:AutomaticCurriculaGenerationforMultiagentReinforcementLearning. 02 动作语义网络 [ICLR-2021]ActionSemanticsNetwork:ConsideringtheEffectsofActionsinMultiagentSystems DataFunSummit#2023 NOAH’SARKLAB •ASNconsidersdifferentactions’influenceonotheragentsanddesignsneuralnetworksbasedontheactionsemantics,e.g.,moveorattackactions. e.g.,moveactionse.g.,attackactions •ASNconsidersdifferentactions’influenceonotheragentsanddesignsneuralnetworksbasedontheactionsemantics,e.g.,moveorattackactions. e.g.,moveactionse.g.,attackactions 03 置换不变性与置换同变性 [ICLR-2023]BoostingMARLviaPermutationInvariantandPermutationEquivariantNetworks. DataFunSummit#2023 NOAH’SARKLAB •Entity-factoredmodelinginMARL Amultiagentenvironmenttypicallyconsistsof�entities,including�learningagentsand�−�non-playerobjects. Boththestatesandeachagent’sobservationo𝑖�areusuallycomposedofthefeaturesofthementities:[x0,…,xm], eachxj∈X,e.g.,�∈𝒮�⊆ℝ𝑚×𝑑�and𝑜𝑖�∈�⊆ℝ𝑚×𝑑𝑜. 𝑑�and𝑑�arethefeaturedimensionofeachentityin�and𝑜𝑖�. ThecurseofdimensionalityX�状态、观测空间随实体数量指数增长 •Ifsimplyrepresentingthestatesortheobservationo𝑖�asaconcatenationofthe�entities’featuresinafixedorder,thestatespaceorobservationspacewillgrowexponentiallyastheentitynumber�increases,whichresultsinlowsampleefficiencyandpoorscalabilityofexistingMARLmethods. Mainidea 系统的状态刻画的是实体集合的客观信息,不随“输入顺序的变化而” 6homogeneousagents 6states •Thereexistssymmetryfeaturesinmultiagentsystemscontaininghomogeneousagents. X •Building𝑄𝑄𝑖�/𝑉/𝜋𝑖�functionsinsensitivetotheentities’ordercansignificantlyreducethestate/observationspacebyafactorof𝟏𝟏⁄𝑚!, i.e.,from X� (concatenating𝑠/𝑜𝑖�inafixedorder)to � �m!(thusalleviatingthecurseofdimensionality). •Entity-factoredmodelinginMARL 𝐴equiv,𝐴inv Thereare2typesofactions:𝐴𝑖�≜ intypicalMAenvironments. Entity-correlatedactions𝐴equiv:e.g.,attackwhichenemyentityorhealwhichallyentity(StarCraft),passtheball towhichteammate(Football); Normal(entity-uncorrelated)actions𝐴inv:e.g.,moveindifferentdirections. Attack&healinStarCraft passtheballinFootballGame moveindifferentdirections •Designpermutationinsensitive𝑸𝑸𝒊𝒊/𝜋𝒊�functions Tobuild𝑄𝑄𝑖�/𝜋𝑖𝑖functionsinsensitivetotheorderoftheentities’features([x0,…,xm]),weshouldtakethetypeoftheactionsintoconsideration. •Forentity-correlatedactions𝐴equiv,permutetheinputentities’ordershouldalsopermutethecorrespondingoutputs’order. •Fornormal(entity-uncorrelated)actions𝐴inv,permutetheinputentities’ordershouldnotpermutetheoutputs