您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[香港中文大学]:AIGC驱动的3D场景理解及医学图像解析 - 发现报告

AIGC驱动的3D场景理解及医学图像解析

2023-07-15李镇香港中文大学温***
AIGC驱动的3D场景理解及医学图像解析

香港中文大学(深圳)助理教授李镇博士 香港大学博士(师从余益州教授),芝加哥大学访问学者(师从许锦波教授) 讲者介绍1 香港中文大学(深圳)理工学院/未来智联网络研究院助理院长/教授,校长青年学者 李镇助理教授FNII助理院长 目录 •AIGC驱动的3D室内场景稠密描述及视觉定位•AIGC驱动的3D高精度的说话人脸驱动及生成•AIGC驱动的结肠镜图片生成及解析 案例简介 •300字以内进行概括性的案例介绍(突出亮点、案例独特性等) 随着AIGC和ChatGPT等生成模型的迅速发展,我们探索出AIGC驱动的3D场景理解以及医疗场景的分析,并通过一系列自研的算法和工具,对AIGC算法辅助的下游应用进行了深入地研究,从3D场景的自动稠密描述,到室内场景的视觉定位,再到3D视觉驱动的高保真说话人脸生成,并推广到AIGC辅助的医疗场景的解析,我们均进行了深入地探讨。在本次分享中,我们将会从3D场景描述和定位,3D说话人脸生成,生成图片辅助的肠胃镜图片解析等方面,详解介绍我们应用方案的架构设计与工程实践,同时也会基于我们的经验分享在使用AIGC驱动的3D场景理解和医疗图像理解过程中的思考和对未来AIGC演进的展望。 目录 •AIGC驱动的3D室内场景稠密描述及视觉定位 •AIGC驱动的3D高精度的说话人脸驱动及生成 •AIGC驱动的结肠镜图片生成及解析 InstanceRefer:CooperativeHolisticUnderstandingforVisualGroundingonPointCloudsthroughInstanceMulti-levelContextualReferring ZhihaoYuan1,†,XuYan1,†,YinghongLiao1,RuimaoZhang1ShengWang2,ZhenLi1,*,andShuguangCui1 1TheChineseUniversityofHongKong(Shenzhen),ShenzhenResearchInstituteofBigData2CryoEMCenter,SouthernUniversityofScienceandTechnology Background VisualGrounding: Visualgrounding(VG)aimsatlocalizingthedesiredobjectsorareasinanimageora3Dscenebasedonanobject-relatedlinguisticquery Background ScanRefer: 1.Exploitingobjectdetectiontogenerateproposalcandidates;2.Localizedescribedobjectbyfusinglanguagefeaturesintocandidates. Background ScanRefer: Cons: 1.Theobjectproposalsinthelarge3Dsceneareusuallyredundant;2.Theappearanceandattributeinformationisnotsufficientlycaptured;3.Therelationsamongproposalsandtheonesbetweenproposalsandbackgroundarenotfullystudied. •ScanRefergenerates114possiblecandidatesafterfilteringproposalsbytheirobjectnessscores;•Eachproposal’sfeatureisgeneratedbythedetectionframework;•Thereisnorelationreasoningamongproposals Method InstanceRefer: 1.Instance-levelcandidaterepresentation(smallnumber);2.Multi-levelcontextualinference(attribute,objects’relationandenvironment). Method InstanceReferArchitecture: Languagefeatureencoding(thesameasScanRefer). Method InstanceReferArchitecture: Extractinginstancesthroughpanopticsegmentation(predictinstanceandsemantics). Method InstanceReferArchitecture: Eliminatingirrelativeinstancesbythetargetcategory(inferredbylanguage). Method InstanceReferArchitecture: Generatingvisualfeatureofeachcandidatebymulti-levelreferring(threenovelmodulesareproposed). Method InstanceReferArchitecture: Scoringeachcandidatematchinglanguageandvisualfeatures(thecandidatewiththelargestscorewillberegardedasoutput). Method SpecificModules: (a)AttributePerception(AP)Module. •Itconstructafour-layerSparseConvolution(SparseConv)asthefeatureextractor;•Afteranaveragepooling,theglobalattributeperceptionfeatureisobtained. Method SpecificModules: (b)RelationPerception(RP)Module. •Itusesk-nearestneighborstoconstructagraph,wherenodes’featuresaretheirsemanticsobtainedbypanopticsegmentationandedgesareconsistedoftheirsemanticsandrelativeposition; •Dynamicgraphconvolutionnetwork(DGCNN)isexploitedtoupdatethenode’sfeature Method SpecificModules: (c)GlobalLocalizationPerception(GLP)Module. •ItusesSparseConvlayerswithheight-poolingtogeneratea3×3bird-eyes-view(BEV)plane;•Bycombininglanguagefeature,itpredictswhichgridthetargetobjectislocatedin;•ItinterpolatesprobabilitiesandgeneratestheglobalperceptionfeaturesbymergingfeaturesfromAPmodule. Method SpecificModules: (d)MatchingModule •AnaiveversionbyusingCosinesimilarity;•Anenhanceversionbyusingmodularco-attentionfromMCAN[1]. (e)ContrastiveObjective whereQ+andQ−denotethescoresofpositiveandnegativepairs. Results ScanRefer: Results Results Nr3D/Sr3D: InstanceRefer:CooperativeHolisticUnderstandingforVisualGroundingonPointCloudsthroughInstanceMulti-levelContextualReferring Thanksforwatching! ZhihaoYuan1,†,XuYan1,†,YinghongLiao1,RuimaoZhang1ShengWang2,ZhenLi1,*,andShuguangCui1 1TheChineseUniversityofHongKong(Shenzhen),ShenzhenResearchInstituteofBigData2CryoEMCenter,SouthernUniversityofScienceandTechnology X-Trans2Cap:Cross-ModalKnowledgeTransferusingTransformerfor3DDenseCaptioning ZhihaoYuan1,†,XuYan1,†,YinghongLiao1,YaoGuo2,GuanbinLi3,ShuguangCui1,ZhenLi1,* 1TheChineseUniversityofHongKong(Shenzhen),TheFutureNetworkofIntelligenceInstitute,ShenzhenResearchInstituteofBigData,2ShanghaiJiaoTongUniversity,3SunYat-senUniversity Background TaskDescription(3DDenseCaptioning) Background Limitations •TheobjectrepresentationsinScan2Caparedefectivesincetheyaresolelylearnedfromsparse3Dpointclouds,thusfailingtoprovidestrongtextureandcolorinformationcomparedwiththeonesgeneratedfrom2Dimages.•Itrequirestheextra2Dinputinbothtrainingandinferencephases.However,theextra2Dinformationisusuallycomputationintensiveandunavailableduringinference. X-Trans2Cap Motivation •WeproposeaCross-ModalKnowledgeTransferframeworkon3Ddensecaptioningtask.•Duringthetrainingphase,theteachernetworkexploitsauxiliary2Dmodalityandguidesthestudentnetworkthatonlytakespointcloudsasinputthroughthefeatureconsistencyconstraints.•Amorefaithfulcaptioncanbegeneratedonlyusingpointcloudsduringtheinference. X-Trans2Cap 2Dand3DInputs 3DProposals 2DProposals X-Trans2Ca