STARROCKSLAKEHOUSE SUMMITASIA2024ISALLYOUNEED StarRocks云原生 湖仓分析技术揭秘 杨关锁镜舟科技DLA研发工程师Ai 焦明烨阿里云研发工程师 StarRocks STARROCKSLAKEHOUSE SUMMITASIA2024ISALLYOUNEED 滋子 01StarRocksLakehouse基本架构 02StarRocks在Iceberg的性能优化工作 03StarRocks在Paimon的性能优化工作 Ai StarRocksLakehouse基 01本架构 STARROCKSSUMMITASIA2024LAKEHOUSEISALLYOUNEED StarRocksLakehouse架构 hadoopLoadtoStarRocks HOFSStarRocks AmazonS3LakeHouse AggregatedData MV/View Flink QueryDataLakeDirectelyWriteBack/ETLDataLakeMaterializedview DenormalizedData External OpenDataLakeMV Hive,Iceberg,Hudi,Paimon.. LoadtoOpenformat HIVE ICEBERG PhudiApachePaimon*· ODS(Hive/Iceberg/Hudi...) StarRocksasLakehouse datacanbewrittenintoStarRocks, providingultra-fastanalysis StarRocksasunifieddatalakequeryengineUsingmaterializedviewtoaccelerate DirectlyqueryopenformatdatainLakequeryondemand STARROCKSSUMIMITASIA2024LAKEHOUSEISALLYOUNEED StarRocks在Iceberg的 02性能优化工作 STARROCKSSUMMITASIA2024LAKEHOUSEISALLYOUNEED StarRocks查询Iceberg Whyit'snotfastenough? HMS/Glue Executionplanis notoptimal 1.Getmetadata(tables,partions,files..) 2.Slicefilesintoscanranges sliceit Fetchandparse metadataisslowFE datafiles scanranges 4.AssignscanrangestoBE BEBEBE Filereaderisnot efficient 5.Readdatafromremotestorage HDFS/OSS/S3 RemotelOisslowParguet ORC CSV STARROCKSSUMMITASIA2024LAKEHOUSEISALLYOUNEED IcebergMetadataCache 性能痛点 ●元数据文件获取和解析速度慢 >解析8M8BM元B数据文件需要1s1s ·访问成本高 >单次查询通常只涉及一个Manifest文件中包含的少量数据文件 MetadataCache ·缓存解析后的元数据 ·支持后台增量刷新 STARROCKSSUMMITASIA2024LAKEHOUSEISALLYOUNEED IcebergDistributedMetadataPlan 性能痛点 ·PlPalna阶n段耗时过长,特别是元数据文件解析速度慢 ·对FE节点的CPU和内存依赖过重 ·当表的元数据很大时,IcebergJobPlaning耗时显著增加 FE DistributedMetadataPlan ·实现 RebuildScanTask manifestfilemanifestfile manifestfile 》将元数据文件的获取、解析和过滤从单个FE节点转移到多个BE执行ParseandFilterParseandFilter ParseandFilter ·效果 >IcebergJobPlaning性能提升数倍 >FE节点的内存和CPU开销显著降低 ManifestManifestManifest BEBEBE Testwith4backends v3.2localplan v3.3distributedplan oneweekdata 60s 11s twoweekdata 72s 14s STARROCKSSUMMITASIA2024LAKEHOUSEISALLYOUNEED IcebergMetadataCache&DistrubutedMetadataPlan ●FEFE优先检查本地cacheca是c否he命中 Howmetadatacacheanddistributedmetadataplan works? ●未命中cacchaech时e,根据一定条件选择: FE 3.2.执行distributed3.1.直接从远端获取并解 >直接从远端拉取并解析元数据文件 >执行distributeddmiasntifersitbplaunjtoebdmanifestplanjob manifestplanjob MetaDataManager 析manifest文件 BE1.检查目标manifest2.检查目标manifest 文件是否存在文件是否存在 In-memorydeserialized manifestcache Localmanifestfile cache 4.从远端获取并解析manifest文件 manifestfile STARROCKSSUMMITASIA2024LAKEHOUSEISALLYOUNEED 统计信息收集 统计信息等深直方图 BucketCount=4 DataSet=[1.6,1.9,1.9,2.0,2.4,2.6,2.7,2.7,2.8,2.9,3.4,3.5] 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1.752.002.25 2.502.75 3.00 3.253.50 AST Relation Relation Parser Analyzer Transformer LogicalPlan Statistics Optimizer PhysicalPlan DistributedPlan PlanFragment Builder ·统计信息为CBO优化器提供成本的计算参考●统计信息类型:基础统计信息、直方图统计信息 ·优化器基于统计信息尽可能选择最优执行计划●收集类型:全量收集、抽样收集 ●收集方式:手动收集、自动收集和查询触发收集 STARROCKSSUMMITASIA2024LAKEHOUSEISALLYOUNEED 统计信息收集-查询触发收集 Optimizer gettablestatistics invalidoutdatedcacherunningtask ConnectorTableCacheStatscollectStatisticsRunningTaskQueue insert addpendingtaskscheduletask PendingTaskQueue ·优化器查询FE缓存的统计信息,确定需要触发收集的表和列 ·触发信息包装成收集任务,添加到等待队列 ·调度线程周期地从等待队列中获取任务放入执行队列 ·收集任务在执行时收集并存储统计信息,清除FE缓存中对应的过期统计信息 STARROCKSSUMMITASIA2024LAKEHOUSEISALLYOUNEED ScanRange增量投递 全量投递增量投递 ·获取所有scanrange,然后全部投递下去,期间BE同步等待·将所有scanrange分成小块,分批投递,BE可以尽早参与执行 ·一次载入所有scanrange,会导致FE/BE内存压力较大·可以减少FE/BE内存压力 ●对类似limit等短路查询不友好 ·对于limit等短路查询,尽可能减少多余的scanrange开销 MetaData PhysicalPlan scanranges getfiles Preprocess ScanNode Fragment Deliver Fragment Preprocess Preprocess Fragment Fragment IfMoreFiles MoreRoundofdelevery Deliver Deliver Fragment Fragment MetaData scanranges getfiles PhysicalPlan ScanNode Preprocess Fragment Deliver Fragment STARROCKSSUMMITASIA2024LAKEHOUSEISALLYOUNEED ParquetReader-自适应IO合并 SELECTC1,C2,C3FROMTWHEREC1=100 C1C2C3 ·合并➶IO,减少远端访问IOPS ·根据过滤条件优化合并范围,减少读放大 无10合并自适应I0合并 ifClishighselective? NOYES C1C2C3C1C2C3C1C2C3 HDFS/S3 STARROCKSSUMIMITASIA2024LAKEHOUSEISALLYOUNEED ParquetReader-PageIndex支持 ●ParquetPageIndex技术 ColumnlndexStructsforOffsetindexStructsfor row_group[i].columns[j]row_group[i].columns[j] i:0 j:n i:0 . j:n i:m . j:0 .+. i:m j:n i:0 1:0 i:0 j:n i:m ++ j:0 i:m j:n FileMetaData ·基于PageIndex的查询优化 4 B c "A' Min: Max:'A Min:1500 1500 Max:1500 2 Min:2Max: 500Min:500 Max:500 4 Min;'E"Max: Min:8 "H" 3000 Min:250 Max:12 "y' Min:'Y Max:Y Max:3000 12 250 1500 F 500 30 00 12 "y' 250 SELECT*FROMTWHEREA≤3 STARROCKSSUMIMTASIA2024LAKEHOUSEISALLYOUNEED ParquetReader-PageIndex支持 ·PageIndex效果测试 Query NoPageIndex withPageIndex [antency(ms) ReadBytes(GB) [antency(ms) ReadBytes(GB) Q01 1812 21.087 1389 7.932 Q02 4418 36.523 4002 25.573 Q03 271 1.235 238 1.000 Q04 138 0.687 78 0.105 Q05 218 1.634 80 0.150 Q06 221 1.908 107 0.235 Q07 2006 22.893 1487 8.796 Q08 1972 22.828 1470 8.740 600 3351 5.247 368 2.055 Q10 2071 15.464 647 3.559 SUM 16478 129.49 9866 58.13 STARROCKSSUMMITASIA2024LAKEHOUSEISALLYOUNEED Data