您的浏览器禁用了JavaScript(一种计算机语言,用以实现您与网页的交互),请解除该禁用,或者联系我们。[阿里巴巴]:阿里云:2024阿里云开源大数据Workshop 杭州站 - 发现报告
当前位置:首页/其他报告/报告详情/

阿里云:2024阿里云开源大数据Workshop 杭州站

2024-07-27-阿里巴巴Y***
阿里云:2024阿里云开源大数据Workshop 杭州站

OPENING 李钰(绝顶) ASFMember,ApacheCeleborn/Flink/HBase/PaimonPMCMember 阿里云智能EMR负责人 AIGCfurtherpromotestheexplosionofbigdata DataVolume:AIfurtherdrivesmassivedataexplosion,farexceedingthedatagrowthofthepreviousera DataDiversity:Multimodaldataprocessingwillbecomeastandardforfuturedataprocessing,includingstorage,computation,andmanagement DataGovernance:Onedataservingdifferentroles,includingDataEngineer/DataAnalysts/DataScientists/AIEngineers Vedio5% AIModels1% AnalyticData46% Others43% Pictures5% DataWarehouse DataLake DataLake DataLakehouse Strengths Weaknesses Application ETLPipeline DataWarehouse Database DataWarehouse ELTModel Iterate Application Analyze SeeResults Database DataLake DataLake Strengths Weaknesses DataLake DataLakehouse DataWarehouse DataLake DataFormats ApachePaimon DataStorage AlibabaCloud OSS DevOps ComputingEngines GovernanceServices ManagementServices RealtimeCompute E-MapReduce MaxCompute Hologres DataLakeFormation ApachePaimon(LakeFormat) MetaStore Authentication TieredStorage OSS-HDFS(LakeStorage) Lineage Authorization Compaction Dataworks DataQuality DataGovernance Workflow Copilot IDE OpenLake Application Ingestion Database BUILDOPENSOURCECOMPATIBLELAKEHOUSEONALIBABACLOUD 李钰(绝顶) ASFMember,ApacheCeleborn/Flink/HBase/PaimonPMCMember 阿里云智能EMR负责人 Hologres FlinkSQL binlog Queries FlinkTableStore FlinkTableStore FlinkTableStore PaimonPaimonPaimon RDBMSFlinkSQL Streaming&Batch FlinkSQL Streaming&Batch FlinkSQL Streaming&Batch DataServingSystems Logs ODSDWDDWS ADS LakeStorageLakeFormatLakeGovernance RealtimeCompute E-MapReduce MaxCompute Hologres DataLakeFormation ApachePaimon(LakeFormat) MetaStore Authentication TieredStorage OSS-HDFS(LakeStorage) Lineage Authorization Compaction Dataworks DataQuality DataGovernance Workflow Copilot IDE OpenLake Application Ingestion Database ServerlessSparkTransformsDataManagementwithOne-Stop,FullyManagedServicesforSeamlessDevelopment,Scheduling,andMaintenance. 100%CompatiblewithOpen-sourceSpark,3XFasterwithFusion,anEnterpriseNativeEngine. EasytoUseResilient Fast Flexible •One-stopdataengineeringsupport •Visualizedjobandworkflow monitor •Convenientresourceandsessionmanagement •Enterpriseremoteshuffleservice(RSS)solutiontosupportbetterelasticity •On-demandandseamlessrescaling •NativeintegrationwithDLFandOSS •NativeEnginesupported,3XfasterthanopensourceSpark •EnhancedRSSsupplies1.5X throughputforIO-intensiveapps •RichOpenAPIsuppliedforintegration •100%compatiblewithopensource usage,bothAPIandbinaryaspect •Richecologysupported EnterpriseCacheService DashboardReportOperationalAnalyticsDataDiscoveryDataScie AppScenario nce Accounting Intelligent Maintenance VersionControl F) SecurityandAuth(DL Scheduling MetaService DataEngineer EnterpriseRemoteShuffleService ControlPlane SparkNativeEngine RemoteShuffle ComputePlane DataIO StorageLayer LakeFormats ObjectStorageService QueueManagement(ResourceforETL) SessionManagement(ResourceforInteractiveQuery) ResourceUsageMonitoring ConnectionManagement SQLEditor CatalogView VersionControl ArtifactsManagement JobList IntelligentDiagnose Metrics Logs WorkflowList CanvasEditor WorkflowInstanceMonitor–GlobalView WorkflowInstanceMonitor–SingleExecutionView Fusionisanenterprisenativeenginewhichis3XFasterthantheopensourceSparkJavaengine VectorizedExecutionEngine •NativeOperator •SIMDJsonOptimization FastColumnarShuffle •EnterpriseRSSbasedonApacheCeleborn •Datashufflereducedupto40% x86(Intel/AMD)andARMsupport Hardwareawarenessoptimization •SVESIMDacceleration •zstd-ptgcompressionacceleration NativeC++Integration •OSS-HDFSSupport •DeepParquetandORCintegration •Paimon、DeltaLakeandIcebergsupport TestingEnvironment •6d3s.16xlargeECSserver •AlibabaCloudLinux3 •OpenJDK1.8.0 RSSremovesthedependencyonlocaldiskforshuffledataandenables100%disaggregationofcomputeandstorage •ApacheTopLevelProject,donatedbyAlibabaCloud •De-factoRSSchoice,usedbyAlibaba,LinkedIn,etc. Multi-Tenancy •Enterprisesecurityassurancewithdataencryption •EnhancedIOscheduling,flowcontrolandquotamanagement Scalability •WidelyadoptedinAlibaba,usedbybothSparkandFlink •Successfullysupportsjobwith600TB+shuffledata Performance •69%PerformanceboostthanYARNexternalshuffle •Performancegainincreaseswithshuffledatascale TestEnvironment •8d2s.10xlargeECSservers •AlibabaCloudLinux3 •OpenJDK1.8.0 •Spark3.3.1 •ShufflePartition=8000 Functionality •SupportsSparkDRA •SupportsSparkAQE AlibabaCloudProductIntegration OSS-HDFS DLF MaxCompute DataWorks OpenAPI •Workspace •JobRuns •SQLEditor •Workflows Tools •Spark-submitCompatibleJobSubmission •Notebook •Gitintegration(Planning) OpenSourceWorkflowIntegration Function Databricks EMRServerlessSpark NativeEngine YES YES SQLEditor YES YES WorkflowManagement YES YES DebuggingandMonitor YES YES IntelligentDiagnose NO YES CatalogandAuthentication YES YES Data&FS YES (DBFS) YES(OSS-HDFS) Auditing YES YES Notebook YES YES CI/CDwithGit YES NO Assistant/Copilot YES NO ML&VectorServing YES NO RealtimeCompute