热门搜索：

2023年自然语言处理算法鲁棒性研究思考报告

信息技术2024-12-09张奇复旦大学测***

AI智能总结

自然语言处理算法鲁棒性研究

一、背景与现状

快速发展：各类自然语言处理（NLP）算法迅速发展，在很多任务上已超越人类水平。
实际应用效果：然而，这些算法在实际应用中的效果并不理想，例如搜索引擎的精度高但召回率低，对话系统经常答非所问。

二、主要问题

数据集偏置
- WinoGRAND：数据集中存在偏置，模型依赖特定词汇特征，导致泛化能力不足。通过密集表示和随机子集训练，模型在WinoGRAND挑战上表现不佳。
数据集采样
- 对比集：数据集的采样方式对模型训练和测试有重要影响。通过手动扰动测试实例，创建对比集来评估模型的决策边界。
细粒度评测
- 中文词分割：尽管在某些任务上已取得进展，但中文词分割仍存在问题。通过不同属性（如词长、句子长度、未知词密度等）进行评估。
- 命名实体识别：通过实体覆盖比（ECR）衡量模型在测试集中未见过的实体的识别能力。
数据集划分
- 标准划分：标准数据集划分方法可能影响模型性能。随机划分方法可能更公平地反映模型的真实能力。

三、模型鲁棒性

对抗样本攻击
- BERT-MLM：使用BERT-MLM生成对抗样本，通过替换或插入单词来攻击模型。
- BERT-ATTACK：利用BERT生成对抗样本，通过替换单词来攻击模型。
模型解释性
- 注意力机制：尽管注意力层能加权输入组件的表示，但其解释性有限，有时与模型预测的相关性不高。
- 解释方法：通过集成梯度（Integrated Gradients）方法来解释模型的决策过程。

四、结论

现有模式局限：基于基准测试集合和常用评价指标的模式不能全面反映NLP模型的实际问题。
未来方向：需要更加关注模型的鲁棒性和解释性，以及数据集的合理划分，以提高模型的实际应用能力。

自然语言处理算法鲁棒性研究思考 1 张奇复旦大学 Dynabench:RethinkingBenchmarkinginNLP 2 3 4 自然语言处理真的被解决了吗？万亿大模型搜索引擎线上，精度95%条件下召回率小于20% 能够回答的部分绝大多数都是原文匹配类型对话系统答非所问潜在政治风险非常不好的用户体验 8 4Post-creditScenes 自然语言处理仍然面临很多问题 Ebrahimietal.,HotFlip:White-BoxAdversarialExamplesforTextClassification,2018.9 Xingetal.,TastyBurgers,SoggyFries:ProbingAspectRobustnessinAspect-BasedSentimentAnalysis,EMNLP202010 Xingetal.,TastyBurgers,SoggyFries:ProbingAspectRobustnessinAspect-BasedSentimentAnalysis,EMNLP202011 !"1$ffi&'()(*+ƒ-ę/0123456789:;<=>!"? !"@$ABCDEF7GHIJKHL&'? !"M$NOPQRSTUVWXYZ[\]^'_` !"1$ffi&'()(*+ƒ-ę/0123456789:;<=>!"? !"@$ABCDEF7GHIJKHL&'? !"M$NOPQRSTUVWXYZ[\]^'_` AAAI2020BestPaper WINOGRANDE:AnAdversarialWinogradSchemaChallengeatScale WinogradSchemaChallenge(WSC)Commonsensereasoning Thetrophydoesn’tfitintothebrownsuitcasebecauseit’stoolarge.trophy/suitcaseThetrophydoesn’tfitintothebrownsuitcasebecauseit’stoosmall.trophy/suitcase RoBERTalargeachieves91.3%accuracyonavariantofWSCdataset Haveneurallanguagemodelssuccessfullyacquiredcommonsenseorareweoverestimatingthetruecapabilitiesofmachinecommonsense? Dataset-specificBiases Insteadofmanuallyidentifiedlexicalfeatures,theyadoptadenserepresentationofinstancesusingtheirprecomputedneuralnetworkembeddings. MainSteps: 1.RoBERTafine-tunedonasmallsubsetofthedataset. 2.Anensembleoflinearclassifiers(logisticregressions) 3.Trainedonrandomsubsetsofthedata 4.Determinewhethertherepresentationisstronglyindicativeofthecorrectansweroption 5.Discardthecorrespondinginstances Sakaguchietal.,WINOGRANDE:AnAdversarialWinogradSchemaChallengeatScale,AAAI2020.17 Sakaguchietal.,WINOGRANDE:AnAdversarialWinogradSchemaChallengeatScale,AAAI2020.18 (a)Atwo-dimensionaldatasetthatrequiresacomplexdecisionboundarytoachievehighaccuracy. (b)Ifthesamedatadistributionisinsteadsampledwithsystematicgaps(e.g.,duetoannotatorbias),asimpledecisionboundarycanperformwelloni.i.d.testdata(shownoutlinedinpink). (c)Sincefillinginallgapsinthedistributionisinfeasible,acontrastsetinsteadfillsinalocalballaroundatestinstancetoevaluatethemodel’sdecisionboundary Gardneretal.,EvaluatingModels’LocalDecisionBoundariesviaContrastSets,EMNLP202019 !"#$%&'()*+,-g/01234 Thedatasetauthorsmanuallyperturbthetestinstancesinsmallbutmeaningfulwaysthat(typically)changethegoldlabel,creatingcontrastsets. Gardneretal.,EvaluatingModels’LocalDecisionBoundariesviaContrastSets,EMNLP202020 Gardneretal.,EvaluatingModels’LocalDecisionBoundariesviaContrastSets,EMNLP202021 Aspect-I:Intrinsicnature wordlength(wLen);sentencelength(sLen)OOVdensity(oDen); Aspect-II:Familiarity wordfrequency(wFre);characterfrequency(cFre) Aspect-III:Labelconsistency labelconsistencyofword(wCon);labelconsistencyofcharacter(cCon) Self-diagnosis：aimstolocatethebucketonwhichtheinputmodelhasobtainedtheworstperformancewithrespecttoagivenattribute. Aided-diagnosis(A,B):aimstocomparetheperformanceofdifferentmodelsondifferentbucket. EntityCoverageRatio(ECR)Themeasureentitycoverageratioisusedtodescribethedegreetowhichentitiesinthetestsethavebeenseeninthetrainingsetwiththesamecategory. 25 Liuetal.,EXPLAINABOARD:AnExplainableLeaderboardforNLP,ACL2021 RandomSplits StandardSplits Standardsplits: Training:sections00–18 Development:sections19-21 Testing:sections22-24 Gormanetal.,Weneedtotalkaboutstandardsplits,ACL2019. Søgaard,WeNeedtoTalkAboutRandomSplits,EACL2021. Blueballs–TrainingOrangeballs--Test 1"#$%&'()*+,-./01 a"#$%&g()*+ b"-g./01234567 2"3456789:;<=>?@ABCDE a"89./01:;<=>?@ !"1$ffi&'()(*+ƒ-ę/0123456789:;<=>!"? !"@$ABCDEF7GHIJKHL&'? !"M$NOPQRSTUVWXYZ[\]^'_` SeveralexamplesofcellswithinterpretableactivationsdiscoveredinLSTMtrainedwithLinuxKernel andWarandPeace. Karpathyetal.,VisualizingandUnderstandingRecurrentNetworks,2016 Theypresentedadetailedempiricalstudyofhowthechoiceofneuralarchitecture(e.g.LSTM,CNN,orselfattention)influencesbothendtaskaccuracyandqualitativepropertiesoftherepresentationsthatarelearned. BottomLSTMlayer TopLSTM layer Visualizationofcontextualsimilaritybetweenallwordpairsinasinglesentenceusingthe4-layerLSTM. Petersetal.,DissectingContextualWordEmbeddings:ArchitectureandRepresentation,2018 Petersetal.,DissectingContextualWordEmbeddings:ArchitectureandRepresentation,2018 32 Red--highattributionBlue--negativeattribution Gray--near-zeroattribution IntegratedGradients(IG)(Sundararajanetal.,2017)toisolatequestionwordsthatadeeplearningsystemusestoproduceananswer. Sundararajanetal.,Axiomaticattributionfordeepnetworks.2017Mudrakartaetal.DidtheModelUnderstandtheQuestion?ACL2018 Forimagenetworks,thebaselineinputx'couldbetheblackimage,whilefortextmodelsitcouldbethezeroembeddingvector. 33 基于Bert的用户检索词---文章语义匹配模型用户查询:硫酸沙丁胺醇吸入气雾剂用法 34 AttentionheadsexhibitingpatternsAttentionheadscorresponding tolinguisticphenomenaThebestperformingattentionsheadsofBERTonWSJdependencyparsing BERT’sattentionheadsexhibitpatternssuchasattendingtodelimitertokens,specificpositionaloffsets,orbroadlyattendingoverthewholesentence,withheadsinthesamelayeroftenexhibitingsimilarbehaviors Certainatt

点击免费查看完整报告

你可能感兴趣

2023年自然语言处理算法鲁棒性研究思考报告

自然语言处理算法鲁棒性研究

一、背景与现状

二、主要问题

三、模型鲁棒性

四、结论

你可能感兴趣

算法决策系列：主题热点投资聚焦：关注虚拟现实、自然语言处理、稀土、无人机、疫苗相关主题

基金经理投资价值分析报告：淳厚基金陈文：追求可持续的超额收益，构建“高鲁棒性”的投资组合

使用机器学习和自然语言处理从 30 年的已发表研究中汇编寄生免疫原性蛋白

金融工程专题报告：B-L模型的鲁棒性优化及其在大类资产与行业配置中的运用

2019年指导加强了模型的鲁棒性