您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[NVIDIA]:GPT模型推理加速实践 - 发现报告
当前位置:首页/其他报告/报告详情/

GPT模型推理加速实践

2023-04-23-NVIDIA؂***
GPT模型推理加速实践

GPT模型的推理加速方案 Agenda LLM推理挑战 LLM整体推理方案 GPT模型基本介绍 GPT模型推理加速实践 LLM推理挑战 GPT3-175Bneeds5*A800-80Gforinference Howtoreducememoryrequirement? Howtoaccelerationcomputing? Howtooptimizecommunication? LLM整体推理方案 Modelcompressioninference •Smallermodels->smallermemoryfootprint •Computeacceleration •Reducedprecisioncomputing •Reducedcomplexity->fewerfloating-pointoperations(FLOPs) Quantization Distillation Pruning MGMNinference TensorParallel PipelineParallel WhenLLMmodelsizeistoolargetodeployonaSingle-GPU,andwecan’tgetacceptablemodelaccuracyaftermodelcompression.TheotheroptionisMulti-GPUInference(MGMN) GPT模型基本介绍 GPT=GenerativePre-trainedTransformerGPT3 •Embeddinglayer •Decoderlayer*N •Decoding GPT3 ModelconfigurationofGPT3175B •Numberoflayers(l):96 •Sequencelength(S):2048 •Hiddenlayersize(h):12288 •Vocabularysize(V):51200 •Totalparameters:175B GPT=GenerativePre-trainedTransformer Thisplaceis… •Embeddinglayer •Textembedding •Positionembedding •Decoderlayer •Decoding [00000…010…00]One-hotvectorofvocab_size Hidden_size 0.1 … Text_embedding Vocab_size Emb W Hidden_size Hidden_size=12288forGPT3 GPT=GenerativePre-trainedTransformer Thisplaceis… •Embeddinglayer •Textembedding •Positionembedding •Decoderlayer •Decoding [00000…010…00] One-hotvectorofvocab_size Tokinpositionid=i Sin(x0) position_embedding N=hidden_size … Hidden_size [0.1…0.2] Sin(x_N-1) GPT=GenerativePre-trainedTransformer •Embeddinglayer •Decoderlayer*N •Attention •LayerNormalization •FFN •LayerNormalization •Decoding •Willcomputeattentionforcurrenttokenwitheveryprevioustoken GPT=GenerativePre-trainedTransformer •Embeddinglayer •Decoderlayer*N •Attention •LayerNormalization •FFN •LayerNormalization •Decoding 4h FFN Hidden_size Hidden_size GPT=GenerativePre-trainedTransformer •Embeddinglayer •Decoderlayer •Attention •LayerNormalization •FFN •LayerNormalization Decoding Token Hidden_size W -1 E •Decoding Decoding GreedysearchSamplingBeamsearch GPT模型推理加速 FasterTransformer FasterTransformeroverview HighlyoptimizedforTransformermodels a)HighlyOptimizedkernels b)Sharedbuffer c)Flexibletoaddoptimization d)Supporteddatatype:FP32,FP16,BF16,INT8 e)SupportedMGMNinference 4flowsforFasterTransformerFP16inference Overview GPToptimizationinFT •Decoderlayer •Attentionoptimization •K/Vcache •Normalizationoptimization •Activationmemoryoptimization •INT8quantization •Decoding •Beamsearch •StreamingdecodinginFT •MGMNinference •TP/PP •Ncclallreduceoptimization Decoder •InGPTmodel,wewillreceivecontextsasinput,andthengeneratereplystepbystep. N-1 generatephase Outputtoken •Wesplittheworkflowintotwophases:contextphaseandgenerationphase. Contextphase Inputsequence Outputsequence Outputlength=N Contextphase •Likeencoder,needtohandlemultipletokensatonce •UsingCUDAkerneltocomputebatchGEMMisin-efficientwhensequencelengthdimensionislarge. •Useunfusedmulti-headattentiontoleveragethetensor coreforGEMMcomputing. KeyandValueintocache. •Savetheresultof Avoidrecomputing Generationphase:generatetokenstepbystep. •Use“fuseQKVmaskedattention”. Original Indecoder,multi-headattentioncomputes •therelationshipof“currenttoken(𝑠�)” •alltokensgeneratedbyprevioussteps Optimization UsingK/VCachetopreventrecomputing andpreventconcatenation •Preparelargek/vcachebuffer •Compute𝑞,𝑘,�ofcurrenttoken •Putk/vintocachein-place sync sync blockReduce blockReduce sync warpReduce warpReduce warpReduce warpReduce blockReduce sync Frommath OriginalOptimization Decoder–activationbufferoptimization OriginalOptimization Allocatebufferforeverydecoderlayer'sactivation InFT,onlyallocatebufferfor1layers'activation,toreusebufferforalllayers'activation Decoderlayers ... buffer Activations bufferbuffer bufferbuffer Quantizationisusedformodelsizereductionandinferenceacceleration. FP16inference Therearetwocommonwaystoquantizethemodel •Posttrainingquantization(PTQ)-lesscost,loweraccuracy •Quantizationawaretraining(QAT)-highercost,higheraccuracy Weightonlyint8inFT •INT8weightonlysavetheweightsbyINT8,butactivationsaresavedbyFP16 •InGEMM,weloadINT8weight,casttoFP16,anduseFP16tensorcore W8A8forGPTinFT •Quantizingweightandactivation,GEMMin INT8tensorcore. GPT模型推理加速 Decoding-StreamingdecodinginFT input output GPT output GPT Original Optimization •Whenbsislargeandoutputsequencelengthsvarymuch, someoutputshavetowaitforthelongestone •FTsupportstreamingdecoding •Betteruserexperience MGMN–TP/PP PipelineParallelTensorParallel RecommendtouseTPforintra-node,PPforinter-node •Communicationvolume •Bandwidth MGMN–allreduceoptimization OriginalncclallreduceOptimization GPU 0 GPU 1 GPU 2 GPU 3 GPU 7 GPU 6 GPU 5 GPU 4 •Whenbs issmall,can’tfullyusetheintra-node •Useoptimizedcudakernelforallreduce •Thencclallreduceusuallytakesup ~20%ofend2endpipeline •Whenbsissmall,theperfgainforallreducecommunicationisabout50% •End2endperfgainis~10% bandwidth,thencclallreducewillbelatency- bound Performance GPT3175Binference bs input-seq-len output-seq-len MegatronLatency(ms) FTLatency(ms) Spe