您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[复旦大学]:2023年自然语言处理算法鲁棒性研究思考报告 - 发现报告
当前位置:首页/行业研究/报告详情/

2023年自然语言处理算法鲁棒性研究思考报告

信息技术2024-12-09张奇复旦大学测***
AI智能总结
查看更多
2023年自然语言处理算法鲁棒性研究思考报告

自然语言处理算法鲁棒性研究思考 1 张奇复旦大学 Dynabench:RethinkingBenchmarkinginNLP 2 3 4 自然语言处理真的被解决了吗? 万亿大模型 搜索引擎线上,精度95%条件下召回率小于20% 能够回答的部分绝大多数都是原文匹配类型 对话系统答非所问潜在政治风险非常不好的用户体验 8 4Post-creditScenes 自然语言处理仍然面临很多问题 Ebrahimietal.,HotFlip:White-BoxAdversarialExamplesforTextClassification,2018.9 Xingetal.,TastyBurgers,SoggyFries:ProbingAspectRobustnessinAspect-BasedSentimentAnalysis,EMNLP202010 Xingetal.,TastyBurgers,SoggyFries:ProbingAspectRobustnessinAspect-BasedSentimentAnalysis,EMNLP202011 !"1$ffi&'()(*+ƒ-ę/0123456789:;<=>!"? !"@$ABCDEF7GHIJKHL&'? !"M$NOPQRSTUVWXYZ[\]^'_` !"1$ffi&'()(*+ƒ-ę/0123456789:;<=>!"? !"@$ABCDEF7GHIJKHL&'? !"M$NOPQRSTUVWXYZ[\]^'_` AAAI2020BestPaper WINOGRANDE:AnAdversarialWinogradSchemaChallengeatScale WinogradSchemaChallenge(WSC)Commonsensereasoning Thetrophydoesn’tfitintothebrownsuitcasebecauseit’stoolarge.trophy/suitcaseThetrophydoesn’tfitintothebrownsuitcasebecauseit’stoosmall.trophy/suitcase RoBERTalargeachieves91.3%accuracyonavariantofWSCdataset Haveneurallanguagemodelssuccessfullyacquiredcommonsenseorareweoverestimatingthetruecapabilitiesofmachinecommonsense? Dataset-specificBiases Insteadofmanuallyidentifiedlexicalfeatures,theyadoptadenserepresentationofinstancesusingtheirprecomputedneuralnetworkembeddings. MainSteps: 1.RoBERTafine-tunedonasmallsubsetofthedataset. 2.Anensembleoflinearclassifiers(logisticregressions) 3.Trainedonrandomsubsetsofthedata 4.Determinewhethertherepresentationisstronglyindicativeofthecorrectansweroption 5.Discardthecorrespondinginstances Sakaguchietal.,WINOGRANDE:AnAdversarialWinogradSchemaChallengeatScale,AAAI2020.17 Sakaguchietal.,WINOGRANDE:AnAdversarialWinogradSchemaChallengeatScale,AAAI2020.18 (a)Atwo-dimensionaldatasetthatrequiresacomplexdecisionboundarytoachievehighaccuracy. (b)Ifthesamedatadistributionisinsteadsampledwithsystematicgaps(e.g.,duetoannotatorbias),asimpledecisionboundarycanperformwelloni.i.d.testdata(shownoutlinedinpink). (c)Sincefillinginallgapsinthedistributionisinfeasible,acontrastsetinsteadfillsinalocalballaroundatestinstancetoevaluatethemodel’sdecisionboundary Gardneretal.,EvaluatingModels’LocalDecisionBoundariesviaContrastSets,EMNLP202019 !"#$%&'()*+,-g/01234 Thedatasetauthorsmanuallyperturbthetestinstancesinsmallbutmeaningfulwaysthat(typically)changethegoldlabel,creatingcontrastsets. Gardneretal.,EvaluatingModels’LocalDecisionBoundariesviaContrastSets,EMNLP202020 Gardneretal.,EvaluatingModels’LocalDecisionBoundariesviaContrastSets,EMNLP202021 Aspect-I:Intrinsicnature wordlength(wLen);sentencelength(sLen)OOVdensity(oDen); Aspect-II:Familiarity wordfrequency(wFre);characterfrequency(cFre) Aspect-III:Labelconsistency labelconsistencyofword(wCon);labelconsistencyofcharacter(cCon) Self-diagnosis:aimstolocatethebucketonwhichtheinputmodelhasobtainedtheworstperformancewithrespecttoagivenattribute. Aided-diagnosis(A,B):aimstocomparetheperformanceofdifferentmodelsondifferentbucket. EntityCoverageRatio(ECR)Themeasureentitycoverageratioisusedtodescribethedegreetowhichentitiesinthetestsethavebeenseeninthetrainingsetwiththesamecategory. 25 Liuetal.,EXPLAINABOARD:AnExplainableLeaderboardforNLP,ACL2021 RandomSplits StandardSplits Standardsplits: Training:sections00–18 Development:sections19-21 Testing:sections22-24 Gormanetal.,Weneedtotalkaboutstandardsplits,ACL2019. Søgaard,WeNeedtoTalkAboutRandomSplits,EACL2021. Blueballs–TrainingOrangeballs--Test 1"#$%&'()*+,-./01 a"#$%&g()*+ b"-g./01234567 2"3456789:;<=>?@ABCDE a"89./01:;<=>?@ !"1$ffi&'()(*+ƒ-ę/0123456789:;<=>!"? !"@$ABCDEF7GHIJKHL&'? !"M$NOPQRSTUVWXYZ[\]^'_` SeveralexamplesofcellswithinterpretableactivationsdiscoveredinLSTMtrainedwithLinuxKernel andWarandPeace. Karpathyetal.,VisualizingandUnderstandingRecurrentNetworks,2016 Theypresentedadetailedempiricalstudyofhowthechoiceofneuralarchitecture(e.g.LSTM,CNN,orselfattention)influencesbothendtaskaccuracyandqualitativepropertiesoftherepresentationsthatarelearned. BottomLSTMlayer TopLSTM layer Visualizationofcontextualsimilaritybetweenallwordpairsinasinglesentenceusingthe4-layerLSTM. Petersetal.,DissectingContextualWordEmbeddings:ArchitectureandRepresentation,2018 Petersetal.,DissectingContextualWordEmbeddings:ArchitectureandRepresentation,2018 32 Red--highattributionBlue--negativeattribution Gray--near-zeroattribution IntegratedGradients(IG)(Sundararajanetal.,2017)toisolatequestionwordsthatadeeplearningsystemusestoproduceananswer. Sundararajanetal.,Axiomaticattributionfordeepnetworks.2017Mudrakartaetal.DidtheModelUnderstandtheQuestion?ACL2018 Forimagenetworks,thebaselineinputx'couldbetheblackimage,whilefortextmodelsitcouldbethezeroembeddingvector. 33 基于Bert的用户检索词---文章语义匹配模型 用户查询:硫酸沙丁胺醇吸入气雾剂用法 34 AttentionheadsexhibitingpatternsAttentionheadscorresponding tolinguisticphenomenaThebestperformingattentionsheadsofBERTonWSJdependencyparsing BERT’sattentionheadsexhibitpatternssuchasattendingtodelimitertokens,specificpositionaloffsets,orbroadlyattendingoverthewholesentence,withheadsinthesamelayeroftenexhibitingsimilarbehaviors Certainatt