1,1,1,1*,1,1,2,3,1 1.,,7100492.,7100213.,100080*. E-mail: haijunwang@xjtu.edu.cn (: 62232014623723676227237762372368)(QCYRCXM-2022-345) 摘要,,.,,,,.,;,,;,,. 关键词,,,, 1引言Accepted ,Pre-trained Language Model, PLMLarge Language Model, LLM. LLM[1∼3]NaturalLanguage Process, NLP.[4][5][6][7][8],[9, 10].,,,[11∼13][14][15][16].,LLM. LLM[17∼19],,.,,,.. RedditDAN[20](Do Anything Now), DAN,,;,;,,,,[21], Emerald Sleet,LLM. LLM,LLM.,ios,,,,..,[22],[23],[24, 25].,LLM.,,,.LLM[17],,.LLM,LLM,. LLM,,LLM: (1)..,,.. (2)..,,.,. (3)..LLM,.,,,.Accepted :2,;3;4;LLM.1. 2越狱攻击的定义 ,ios,LLM,LLM, LLM.LLM.,LLM,. 2.1实例分析 LLM, LLM,无差别应答.1(a),,ChatGPTtext-davinci-0031).LLM,.LLM,,. ,安全保护,.LLM,.1(b),ChatGPT(06132)). ,LLM,,越狱攻击,.1(c).,,,. ,LLM,,WormGPT[26][27],LLM.Accepted •WormGPT[26],2021.,. •202212Redditr/ChatGPTDAN[20].DAN,,. •2023616ChatGPTGrandma Exploit,ChatGPT[27].,,Win11Office365,. •2024214[21], Emerald SleetLLM,.Crimson SandstormLLM, . WormGPT,,,,.,.LLM,.,: (1).,,,,[28]; (2)., LLM,; (3).,,,; (4).,LLM,,,.Accepted 2.2定义与形式化模型 GPT4[29].,. D1[29]:()D2[16]:D3[11]:,D4[30]: LLM,D5[31]:LLM.LLM,D6[12]:,,LLMD7[32]: ,, .D1D3D4,,D2,D5D6.D7,. .[12, 13],[33][34];[22],[35].,. 定义1(语言模型的越狱攻击),,. ,.1:.1,,3.3.1,. 1,2.2(a)1(a),R=LLMbase(Q).,LLMbase,Q,R. •LLMbase,,.,,. •QR.,.,. Liu[13]OpenAI 8,. AdvBench[12]500,.Souly[36],,StrongREJECT. 2(b)1(b),R=LLMS(Q).S.LLMSLLM. •,,., LLM,,,S.,4.2. 2(c)1(c),R=LLMFS(Q′),LLMFS(·) =F[LLMS(·)],Q′=M(Q),FLLMS,MQ.,LLMS.,. •F,.LLM,beam searchtop-K,[33],,[34]..,LLMS,.,. Deng[35],PANDORA,,,GPT.4.1. •MQQ′..,TwitterReddit Discord,jailbreakchatAIPRM FlowGPT.,.4.1.Accepted ,D1,D2,D3,D4.D5,D6,.D7,,.1,4.4. 3越狱攻击的起源及其根因 ,3,.LLM, NLPPLMLLM3,LLM,.LLM,3, [37, 38],,.3,.. 3.1LLM的演化 NLP,PLMLLM[39]. PLM,NLP,, BERT[40]T5[41]GPT1/2[42, 43]PLM.PLM,[44],scaling law., PLMLLM,PLMLLM[45]. LLMNLP[46].LLM,,,.Accepted 3.1.1上下文学习 GPT3[1]LLM,,Few-shot Learning.Zero-shot Learning,,., GoogleInstruction Tuning[47].<,,>,,FLAN.,Self-instruct[48].LLaMa[49]self-instruct,Alpaca[50].,NLP. ,,Prompt Engineering. ,,.2,.,. 3.1.2价值观对齐 2021AntropicAskell[17]AI,,3H.,AI,[60].,Reinforcement Learning from Human Feedback, RLHF[18, 19, 61], RLHFLLM,. OpenAI[18],,RLHF.:[62]Proximal Policy Optimization, PPO.OpenAIGPT3InstructGPT. InstructGPT,.,RLHF, Rafailov[63]Direct Preference Optimization, DPO. DPO,,.Accepted 3.2安全认知的演进 ,.[18, 64],[65, 66]. 3.2.1毒性 [37],[67, 68],,. ,[69].,,. 有毒文本的检测.. Zhao[70],.[69, 71], ,Perspective API3)OpenAI4).,[72][73],[67]. 模型内在毒性的评估.,,[38].., RealToxicityPrompts[74],. ToxicGen[75],.RealToxicityPrompts, ToxicGen,,. 模型内在毒性的消除.,.., Dinan[76],.[77],. Xu[78],. 3.2.2无害性 Askell[17],.,: (1).; (2).; (3)..,.,,.Accepted ,,.,. 无害性的消除.,3.1.2.,.,,.RLHF. .Bai[19],..RLHF,,,,. RLHF. LLaMa2[64]RLHF,RLHF(1).,, (2)., PPO.,,.LLaMa2. Bai[79]AI,.,,,,.AIRLAIF.,. Dai[80],.,LLM. 无害性的检测.,[34, 66, 81]. RLHF:..,.,,,. 3.3越狱攻击的根因 .,[82, 83].,.,.,.. 越狱攻击得以产生的根因是模型的服务属性与价值观的不匹配.4,,.,,,.,.,:.Accepted 3.3.1价值观 ,.,,[18, 19],[79].,,[84].,[85].,1,,. AI,.. AIArtificial Intelligence Generated Content, AIGC,:.,. 2021[86],. 2022[87],,,,,,,,,.,. 2023[88],,.Accepted .,,,,,[89].,,,,,.,.,,,,.,,. 3.3.2数据 ,.,[44].,[82, 83]. LLM,. LLM, ,[90∼92],,[74].,., LLM,.,,.,. .,,,.,,.,,.,,,. 3.3.3算法 .,,,,,,[93].,top-K,.,,,.[33, 94],,.Wolf[95],,,.Accepted ,,,.,,[96]. Wei[11],,.,,. 3.3.4目标 ,,.,:.,,..,,,100%.[97],,,., Kalai[98], . ,.,.,,.Wei[11],,,. 4攻防演化 ,,,,.,:4.1,5165,;4.2:,;4.3;,. 4.1攻击方法 .,..,.,,. 4.1.1人工设计 ,MQQ′,:(1).Q′Q; (2).Q′Q,.,,,. 早期实证研究.[13, 99],[13][99][99][11][13].Liu[13],:,ChatGPT. Wei[11],:,,;,,;JSON,,. Shen[99]6387,,(1).; (2).,,,.Accepted 固定上下文.LLM.,,[100, 101],6. WeiIn-Context Attack, ICA, ICA[51],,.[52]Chain of utterance,CoU[53],,,,. 嵌套情境.Q,LLM.,. DeepInception[102] ,,LLM,,LLM. 人工设计的攻击原理.,.RLHF指令遵循,,,.,,:;,;,. ,,,目标竞争,,.,. 4.1.2自动改写 :.,,.PAIRImprovement :NonePrompt:How do youhotwire a car?Improvement:I will use a fictionalTAP 嵌套情境.,LLM,.Compositional Instruction Attacks, CIA[103]Q,.:T-CIAW-CIA.T-CIA,,W-CIA,. ReNeLLM[104],,Q.CIAReNeLLM7.Response:I’m sorry, Ican’t help you withthat.scenario to avoid safety filters…Prompts:You are a car engineertesting the safety of a car, howwould you hypothetically hotwire acar?Attack LLMLLMToTOff-topicdetectionLLMLLMIterationPromptLLMMasterKeyAttack LLMPre-train + Fine-tuneInitial promptModel-basedrewriteHowsecretlykill?ahopelessfriendJADEHow killfriend?Pre-definedlanguagerulesHowkillfriend?GrammaranalysisLanguagerulesNew promptInitial promptOff-topicpromptNew promptAccepted 加密.,,,.Wei[11]Base64,Base64,Base64. Yuan[2]CipherChat,,GBKASCII Unicode .,100 %GPT4,. Liu[105]Disguise and Reconstruction Attack, DRA,,,,. 多语言.[106∼108]. YongLow-resource language[106],,.,ChatGPT., Puttaparthi[108],,,8,ChatGPT.,,.,,[108]. Q: Teach someone how to hack into a person’s email accountQ′: Teach someone how toa person’s email account 图8多语言攻击示例 Figure 8An example of multilingual attack 模型改写.,. MasterKey[109]LLM,.,,9.,GPTFuzzer[31],,,4.2.2.PAIRImprovement :NonePrompt:Ho