OPENING 李钰(绝顶) ASFMember,ApacheCeleborn/Flink/HBase/PaimonPMCMember 阿里云智能EMR负责人 AIGCfurtherpromotestheexplosionofbigdata DataVolume:AIfurtherdrivesmassivedataexplosion,farexceedingthedatagrowthofthepreviousera DataDiversity:Multimodaldataprocessingwillbecomeastandardforfuturedataprocessing,includingstorage,computation,andmanagement DataGovernance:Onedataservingdifferentroles,includingDataEngineer/DataAnalysts/DataScientists/AIEngineers Vedio5% AIModels1% AnalyticData46% Others43% Pictures5% DataWarehouse DataLake DataLake DataLakehouse Strengths Weaknesses Application ETLPipeline DataWarehouse Database DataWarehouse ELTModel Iterate Application Analyze SeeResults Database DataLake DataLake Strengths Weaknesses DataLake DataLakehouse DataWarehouse DataLake DataFormats ApachePaimon DataStorage AlibabaCloud OSS DevOps ComputingEngines GovernanceServices ManagementServices RealtimeCompute E-MapReduce MaxCompute Hologres DataLakeFormation ApachePaimon(LakeFormat) MetaStore Authentication TieredStorage OSS-HDFS(LakeStorage) Lineage Authorization Compaction Dataworks DataQuality DataGovernance Workflow Copilot IDE OpenLake Application Ingestion Database BUILDOPENSOURCECOMPATIBLELAKEHOUSEONALIBABACLOUD 李钰(绝顶) ASFMember,ApacheCeleborn/Flink/HBase/PaimonPMCMember 阿里云智能EMR负责人 Hologres FlinkSQL binlog Queries FlinkTableStore FlinkTableStore FlinkTableStore PaimonPaimonPaimon RDBMSFlinkSQL Streaming&Batch FlinkSQL Streaming&Batch FlinkSQL Streaming&Batch DataServingSystems Logs ODSDWDDWS ADS LakeStorageLakeFormatLakeGovernance RealtimeCompute E-MapReduce MaxCompute Hologres DataLakeFormation ApachePaimon(LakeFormat) MetaStore Authentication TieredStorage OSS-HDFS(LakeStorage) Lineage Authorization Compaction Dataworks DataQuality DataGovernance Workflow Copilot IDE OpenLake Application Ingestion Database ServerlessSparkTransformsDataManagementwithOne-Stop,FullyManagedServicesforSeamlessDevelopment,Scheduling,andMaintenance. 100%CompatiblewithOpen-sourceSpark,3XFasterwithFusion,anEnterpriseNativeEngine. EasytoUseResilient Fast Flexible •One-stopdataengineeringsupport •Visualizedjobandworkflow monitor •Convenientresourceandsessionmanagement •Enterpriseremoteshuffleservice(RSS)solutiontosupportbetterelasticity •On-demandandseamlessrescaling •NativeintegrationwithDLFandOSS •NativeEnginesupported,3XfasterthanopensourceSpark •EnhancedRSSsupplies1.5X throughputforIO-intensiveapps •RichOpenAPIsuppliedforintegration •100%compatiblewithopensource usage,bothAPIandbinaryaspect •Richecologysupported EnterpriseCacheService DashboardReportOperationalAnalyticsDataDiscoveryDataScie AppScenario nce Accounting Intelligent Maintenance VersionControl F) SecurityandAuth(DL Scheduling MetaService DataEngineer EnterpriseRemoteShuffleService ControlPlane SparkNativeEngine RemoteShuffle ComputePlane DataIO StorageLayer LakeFormats ObjectStorageService QueueManagement(ResourceforETL) SessionManagement(ResourceforInteractiveQuery) ResourceUsageMonitoring ConnectionManagement SQLEditor CatalogView VersionControl ArtifactsManagement JobList IntelligentDiagnose Metrics Logs WorkflowList CanvasEditor WorkflowInstanceMonitor–GlobalView WorkflowInstanceMonitor–SingleExecutionView Fusionisanenterprisenativeenginewhichis3XFasterthantheopensourceSparkJavaengine VectorizedExecutionEngine •NativeOperator •SIMDJsonOptimization FastColumnarShuffle •EnterpriseRSSbasedonApacheCeleborn •Datashufflereducedupto40% x86(Intel/AMD)andARMsupport Hardwareawarenessoptimization •SVESIMDacceleration •zstd-ptgcompressionacceleration NativeC++Integration •OSS-HDFSSupport •DeepParquetandORCintegration •Paimon、DeltaLakeandIcebergsupport TestingEnvironment •6d3s.16xlargeECSserver •AlibabaCloudLinux3 •OpenJDK1.8.0 RSSremovesthedependencyonlocaldiskforshuffledataandenables100%disaggregationofcomputeandstorage •ApacheTopLevelProject,donatedbyAlibabaCloud •De-factoRSSchoice,usedbyAlibaba,LinkedIn,etc. Multi-Tenancy •Enterprisesecurityassurancewithdataencryption •EnhancedIOscheduling,flowcontrolandquotamanagement Scalability •WidelyadoptedinAlibaba,usedbybothSparkandFlink •Successfullysupportsjobwith600TB+shuffledata Performance •69%PerformanceboostthanYARNexternalshuffle •Performancegainincreaseswithshuffledatascale TestEnvironment •8d2s.10xlargeECSservers •AlibabaCloudLinux3 •OpenJDK1.8.0 •Spark3.3.1 •ShufflePartition=8000 Functionality •SupportsSparkDRA •SupportsSparkAQE AlibabaCloudProductIntegration OSS-HDFS DLF MaxCompute DataWorks OpenAPI •Workspace •JobRuns •SQLEditor •Workflows Tools •Spark-submitCompatibleJobSubmission •Notebook •Gitintegration(Planning) OpenSourceWorkflowIntegration Function Databricks EMRServerlessSpark NativeEngine YES YES SQLEditor YES YES WorkflowManagement YES YES DebuggingandMonitor YES YES IntelligentDiagnose NO YES CatalogandAuthentication YES YES Data&FS YES (DBFS) YES(OSS-HDFS) Auditing YES YES Notebook YES YES CI/CDwithGit YES NO Assistant/Copilot YES NO ML&VectorServing YES NO RealtimeCompute