热门搜索：

GPT模型推理加速实践

2023-04-23-NVIDIA؂***

GPT模型的推理加速方案 Agenda LLM推理挑战 LLM整体推理方案 GPT模型基本介绍 GPT模型推理加速实践 LLM推理挑战 GPT3-175Bneeds5*A800-80Gforinference Howtoreducememoryrequirement? Howtoaccelerationcomputing? Howtooptimizecommunication? LLM整体推理方案 Modelcompressioninference •Smallermodels->smallermemoryfootprint •Computeacceleration •Reducedprecisioncomputing •Reducedcomplexity->fewerfloating-pointoperations(FLOPs) Quantization Distillation Pruning MGMNinference TensorParallel PipelineParallel WhenLLMmodelsizeistoolargetodeployonaSingle-GPU,andwecan’tgetacceptablemodelaccuracyaftermodelcompression.TheotheroptionisMulti-GPUInference(MGMN) GPT模型基本介绍 GPT=GenerativePre-trainedTransformerGPT3 •Embeddinglayer •Decoderlayer*N •Decoding GPT3 ModelconfigurationofGPT3175B •Numberoflayers(l):96 •Sequencelength(S):2048 •Hiddenlayersize(h):12288 •Vocabularysize(V):51200 •Totalparameters:175B GPT=GenerativePre-trainedTransformer Thisplaceis… •Embeddinglayer •Textembedding •Positionembedding •Decoderlayer •Decoding [00000…010…00]One-hotvectorofvocab_size Hidden_size 0.1 … Text_embedding Vocab_size Emb W Hidden_size Hidden_size=12288forGPT3 GPT=GenerativePre-trainedTransformer Thisplaceis… •Embeddinglayer •Textembedding •Positionembedding •Decoderlayer •Decoding [00000…010…00] One-hotvectorofvocab_size Tokinpositionid=i Sin(x0) position_embedding N=hidden_size … Hidden_size [0.1…0.2] Sin(x_N-1) GPT=GenerativePre-trainedTransformer •Embeddinglayer •Decoderlayer*N •Attention •LayerNormalization •FFN •LayerNormalization •Decoding •Willcomputeattentionforcurrenttokenwitheveryprevioustoken GPT=GenerativePre-trainedTransformer •Embeddinglayer •Decoderlayer*N •Attention •LayerNormalization •FFN •LayerNormalization •Decoding 4h FFN Hidden_size Hidden_size GPT=GenerativePre-trainedTransformer •Embeddinglayer •Decoderlayer •Attention •LayerNormalization •FFN •LayerNormalization Decoding Token Hidden_size W -1 E •Decoding Decoding GreedysearchSamplingBeamsearch GPT模型推理加速 FasterTransformer FasterTransformeroverview HighlyoptimizedforTransformermodels a)HighlyOptimizedkernels b)Sharedbuffer c)Flexibletoaddoptimization d)Supporteddatatype:FP32,FP16,BF16,INT8 e)SupportedMGMNinference 4flowsforFasterTransformerFP16inference Overview GPToptimizationinFT •Decoderlayer •Attentionoptimization •K/Vcache •Normalizationoptimization •Activationmemoryoptimization •INT8quantization •Decoding •Beamsearch •StreamingdecodinginFT •MGMNinference •TP/PP •Ncclallreduceoptimization Decoder •InGPTmodel,wewillreceivecontextsasinput,andthengeneratereplystepbystep. N-1 generatephase Outputtoken •Wesplittheworkflowintotwophases:contextphaseandgenerationphase. Contextphase Inputsequence Outputsequence Outputlength=N Contextphase •Likeencoder,needtohandlemultipletokensatonce •UsingCUDAkerneltocomputebatchGEMMisin-efficientwhensequencelengthdimensionislarge. •Useunfusedmulti-headattentiontoleveragethetensor coreforGEMMcomputing. KeyandValueintocache. •Savetheresultof Avoidrecomputing Generationphase:generatetokenstepbystep. •Use“fuseQKVmaskedattention”. Original Indecoder,multi-headattentioncomputes •therelationshipof“currenttoken(𝑠�)” •alltokensgeneratedbyprevioussteps Optimization UsingK/VCachetopreventrecomputing andpreventconcatenation •Preparelargek/vcachebuffer •Compute𝑞,𝑘,�ofcurrenttoken •Putk/vintocachein-place sync sync blockReduce blockReduce sync warpReduce warpReduce warpReduce warpReduce blockReduce sync Frommath OriginalOptimization Decoder–activationbufferoptimization OriginalOptimization Allocatebufferforeverydecoderlayer'sactivation InFT,onlyallocatebufferfor1layers'activation,toreusebufferforalllayers'activation Decoderlayers ... buffer Activations bufferbuffer bufferbuffer Quantizationisusedformodelsizereductionandinferenceacceleration. FP16inference Therearetwocommonwaystoquantizethemodel •Posttrainingquantization(PTQ)-lesscost,loweraccuracy •Quantizationawaretraining(QAT)-highercost,higheraccuracy Weightonlyint8inFT •INT8weightonlysavetheweightsbyINT8,butactivationsaresavedbyFP16 •InGEMM,weloadINT8weight,casttoFP16,anduseFP16tensorcore W8A8forGPTinFT •Quantizingweightandactivation,GEMMin INT8tensorcore. GPT模型推理加速 Decoding-StreamingdecodinginFT input output GPT output GPT Original Optimization •Whenbsislargeandoutputsequencelengthsvarymuch, someoutputshavetowaitforthelongestone •FTsupportstreamingdecoding •Betteruserexperience MGMN–TP/PP PipelineParallelTensorParallel RecommendtouseTPforintra-node,PPforinter-node •Communicationvolume •Bandwidth MGMN–allreduceoptimization OriginalncclallreduceOptimization GPU 0 GPU 1 GPU 2 GPU 3 GPU 7 GPU 6 GPU 5 GPU 4 •Whenbs issmall,can’tfullyusetheintra-node •Useoptimizedcudakernelforallreduce •Thencclallreduceusuallytakesup ~20%ofend2endpipeline •Whenbsissmall,theperfgainforallreducecommunicationisabout50% •End2endperfgainis~10% bandwidth,thencclallreducewillbelatency- bound Performance GPT3175Binference bs input-seq-len output-seq-len MegatronLatency(ms) FTLatency(ms) Spe

点击免费查看完整报告

你可能感兴趣

GPT模型推理加速实践

你可能感兴趣

Triton推理引擎专场,面向多框架的AI模型部署服务Triton及其在蚂蚁预测引擎中的应用实践（上）

计算机行业点评报告：GPT~4发布，多模态处理+复杂推理能力有望打开应用空间

硅基流动高性能低成本的大模型推理云实践

长文本大模型推理实践——以KVCache为中心的分离式推理架构

通信行业：OpenAI发布o1模型，有望驱动推理算力需求再上新台阶