arXiv:2401.03568v2[cs.AI]25Jan2024 AGENTAI: SURVEYINGTHEHORIZONSOFMULTIMODALINTERACTION ZaneDurante1†*,QiuyuanHuang2‡∗,NaokiWake2∗, RanGong3†,JaeSungPark4†,BidiptaSarkar1†,RohanTaori1†,YusukeNoda5,DemetriTerzopoulos3,YejinChoi4,KatsushiIkeuchi2,HoiVo5,LiFei-Fei1,JianfengGao2 1StanfordUniversity;2MicrosoftResearch,Redmond; 3UniversityofCalifornia,LosAngeles;4UniversityofWashington;5MicrosoftGaming Figure1:OverviewofanAgentAIsystemthatcanperceiveandactindifferentdomainsandapplications.AgentAIisemergingasapromisingavenuetowardArtificialGeneralIntelligence(AGI).AgentAItraininghasdemonstratedthecapacityformulti-modalunderstandinginthephysicalworld.Itprovidesaframeworkforreality-agnostictrainingbyleveraginggenerativeAIalongsidemultipleindependentdatasources.Largefoundationmodelstrainedforagentandaction-relatedtaskscanbeappliedtophysicalandvirtualworldswhentrainedoncross-realitydata.WepresentthegeneraloverviewofanAgentAIsystemthatcanperceiveandactinmanydifferentdomainsandapplications,possiblyservingasaroutetowardsAGIusinganagentparadigm. ∗EqualContribution.‡ProjectLead.†WorkdonewhileinterningatMicrosoftResearch,Redmond. ABSTRACT Multi-modalAIsystemswilllikelybecomeaubiquitouspresenceinoureverydaylives.Apromisingapproachtomakingthesesystemsmoreinteractiveistoembodythemasagentswithinphysicalandvirtualenvironments.Atpresent,systemsleverageexistingfoundationmodelsasthebasicbuildingblocksforthecreationofembodiedagents.Embeddingagentswithinsuchenvironmentsfacilitatestheabilityofmodelstoprocessandinterpretvisualandcontextualdata,whichiscriticalforthecreationofmoresophisticatedandcontext-awareAIsystems.Forexample,asystemthatcanperceiveuseractions,humanbehavior,environmentalobjects,audioexpressions,andthecollectivesentimentofascenecanbeusedtoinformanddirectagentresponseswithinthegivenenvironment.Toaccelerateresearchonagent-basedmultimodalintelligence,wedefine“AgentAI”asaclassofinteractivesystemsthatcanperceivevisualstimuli,languageinputs,andotherenvironmentally-groundeddata,andcanproducemeaningfulembodiedactions.Inparticular,weexploresystemsthataimtoimproveagentsbasedonnext-embodiedactionpredictionbyincorporatingexternalknowledge,multi-sensoryinputs,andhumanfeedback.WearguethatbydevelopingagenticAIsystemsingroundedenvironments,onecanalsomitigatethehallucinationsoflargefoundationmodelsandtheirtendencytogenerateenvironmentallyincorrectoutputs.TheemergingfieldofAgentAIsubsumesthebroaderembodiedandagenticaspectsofmultimodalinteractions.Beyondagentsactingandinteractinginthephysicalworld,weenvisionafuturewherepeoplecaneasilycreateanyvirtualrealityorsimulatedsceneandinteractwithagentsembodiedwithinthevirtualenvironment. Contents 1Introduction5 1.1Motivation5 1.2Background5 1.3Overview6 2AgentAIIntegration7 2.1InfiniteAIagent7 2.2AgentAIwithLargeFoundationModels8 2.2.1Hallucinations8 2.2.2BiasesandInclusivity9 2.2.3DataPrivacyandUsage10 2.2.4InterpretabilityandExplainability11 2.2.5InferenceAugmentation12 2.2.6Regulation13 2.3AgentAIforEmergentAbilities14 3AgentAIParadigm15 3.1LLMsandVLMs15 3.2AgentTransformerDefinition15 3.3AgentTransformerCreation16 4AgentAILearning17 4.1StrategyandMechanism17 4.1.1ReinforcementLearning(RL)17 4.1.2ImitationLearning(IL)18 4.1.3TraditionalRGB18 4.1.4In-contextLearning18 4.1.5OptimizationintheAgentSystem18 4.2AgentSystems(zero-shotandfew-shotlevel)19 4.2.1AgentModules19 4.2.2AgentInfrastructure19 4.3AgenticFoundationModels(pretrainingandfinetunelevel)19 5AgentAICategorization20 5.1GeneralistAgentAreas20 5.2EmbodiedAgents20 5.2.1ActionAgents20 5.2.2InteractiveAgents21 5.3SimulationandEnvironmentsAgents21 5.4GenerativeAgents21 5.4.1AR/VR/mixed-realityAgents22 5.5KnowledgeandLogicalInferenceAgents22 5.5.1KnowledgeAgent23 5.5.2LogicAgents23 5.5.3AgentsforEmotionalReasoning23 5.5.4Neuro-SymbolicAgents24 5.6LLMsandVLMsAgent24 6AgentAIApplicationTasks24 6.1AgentsforGaming24 6.1.1NPCBehavior24 6.1.2Human-NPCInteraction25 6.1.3Agent-basedAnalysisofGaming25 6.1.4SceneSynthesisforGaming27 6.1.5ExperimentsandResults27 6.2Robotics28 6.2.1LLM/VLMAgentforRobotics.30 6.2.2ExperimentsandResults.31 6.3Healthcare35 6.3.1CurrentHealthcareCapabilities36 6.4MultimodalAgents36 6.4.1Image-LanguageUnderstandingandGeneration36 6.4.2VideoandLanguageUnderstandingandGeneration37 6.4.3ExperimentsandResults39 6.5Video-languageExperiments41 6.6AgentforNLP45 6.6.1LLMagent45 6.6.2GeneralLLMagent45 6.6.3Instruction-followingLLMagents46 6.6.4ExperimentsandResults46 7AgentAIAcrossModalities,Domains,andRealities48 7.1AgentsforCross-modalUnderstanding48 7.2AgentsforCross-domainUnderstanding48 7.3Interactiveagentforcross-modalityandcross-reality49 7.4SimtoRealTransfer49 8ContinuousandSelf-improvementforAgentAI49 8.1Human-basedInteractionData49 8.2FoundationModelGeneratedData50 9AgentDatasetandLeaderboard50 9.1“CuisineWorld”DatasetforMulti-agentGaming50 9.1.1Benchmark51 9.1.2Task51 9.1.3MetricsandJudging51 9.1.4Evaluation51 9.2Audio-Video-LanguagePre-trainingDataset.51 10BroaderImpactStatement52 11EthicalConsiderations53 12Diversit