CNCC | How to build a new infrastructure and new paradigm for big data analysis in the era of big models?

CNCC2024

Brief introduction of the forum:

How to build a new infrastructure and new paradigm for big data analysis in the era of big models?

Time: 13:30-17:30, October 26

Venue: Autumn Garden - Classroom Area (East 6)

Note: If there is any change, please refer to the final information on the official website (https://ccf.org.cn/cncc2024).

The development of large models has driven the explosive growth of data storage, computing, and processing requirements. However, big data infrastructure faces challenges such as storage scalability, computing resource utilization, real-time processing capabilities, data security, and privacy protection. How to build an efficient, elastic, and intelligent big data analysis infrastructure to meet the complex needs of enterprise-level applications has become an important direction of current technological innovation.

This forum focuses on the new infrastructure and new paradigm in the era of big data analysis and big models, discusses the deep integration of cloud native data platforms, large models and data intelligence, and promotes the intelligent development of big data platforms. The report covers cutting-edge technologies such as cloud computing elastic architecture, large model pre-training corpus cleaning, data management and analysis intelligence, and demonstrates the key role of vector engine, inference acceleration, generative SQL optimization, and network digital twin technology in improving data processing and management efficiency, providing forward-looking ideas and practical experience for the application of Data+AI integration.

Forum Agenda

order	topic	Keynote speaker	unit
1	Data+AI-Driven Cloud-Native Data Platform: Opportunities and Challenges	Li Feifei	Alibaba
2	Cleaning of pre-trained corpora of massive large models	Chen Wenguang	Tsinghua University
3	Big Data + Big Model: A New Path for Data Intelligence	Gao Yunjun	Zhejiang University
4	Digital Twin Exploration of Digital Networks	Tian Chen	Nanjing University
5	Optimization methods and challenges of generative SQL statements	Wang Zhaoguo	Shanghai Jiao Tong University
Panel link	All speakers at this forum

Introduction of the chairman and guests of the forum

Chair of the Forum

CNCC | How to build a new infrastructure and new paradigm for big data analysis in the era of big models?

Yuan Ye

He is a distinguished member of CCF, deputy director of the database committee, and dean of the Research Institute of Beijing Institute of Technology

Professor, doctoral supervisor, winner of the National Outstanding Youth and Excellent Youth Fund. He has presided over key projects of the National Natural Science Foundation of China and key R&D projects of the Ministry of Science and Technology. He has won the second prize of the National Science and Technology Progress Award, the first prize of the Natural Science Award of the Chinese Institute of Electronics, the first prize of the Ministry of Education and the Liaoning Province Science and Technology Progress Award, the National Excellent Doctoral Dissertation Nomination Award, and the CCF Outstanding Doctoral Dissertation Award. His main research interests are big data management and analysis. He has published more than 100 CCF Class A papers in important academic conferences and journals such as SIGMOD, VLDB, ICDE, VLDBJ, TKDE, TPDS, etc.

Zheng Bolong

Professor and doctoral supervisor of the School of Computer Science, Huazhong University of Science and Technology

He is a national high-level young talent, whose main research direction is big data management and analysis, and has published more than 50 CCF Class A papers in important academic conferences and journals such as SIGMOD, VLDB, ICDE, VLDBJ, TKDE, etc. He has presided over the National Natural Science Foundation of China (General and Sino-European Talents, Youth), and sub-projects of the National Key R&D Program. He has been nominated for the VLDB 2024 Best Paper Award, VLDB 2020 Excellent Paper, ICDE 2019 Excellent Paper, ACM SIGSPATIAL China Rising Star Award, Huawei Spark Award, etc. Chairman of CCF YOCSEF Wuhan Sub-forum (2024-2025), Executive Member of CCF Database Committee and Information System Committee.

Forum Speaker

Li Feifei

ACM/CCF/IEEE Fellow(会士)，阿里云资深副总裁

CCF大数据专家委员会、数据库专业委员会常委，获数据库与大数据系统方向多项国际顶级学术科研与技术会议最佳论文/成果奖(IEEE ICDE 2024 Industry and Application Best Paper Award, ACM SIGMOD 2024 Industry Track Best Paper Award, ACM SIGMOD 2023 Beset Paper Award, EDBT 2022 10 Years Test of Time Award, IEEE ICDCS 2020 Best Paper Award、ACM SoCC Best Paper Award Runner Up、ACM SIGMOD 2016 Best Paper Award, ACM SIGMOD 2015 Best System Demonstration Award, IEEE ICDE 2014 10 Years Most Influential Paper Award, IEEE ICDE 2004 Best Paper Award), 作为第一完成人获得世界互联网大会2019全球领先科技成果奖、浙江省科技进步一等奖、中国电子学会科技进步一等奖等。担任多个国际及国内一流学术期刊和学术会议的编委、主席。带领团队研发了以云原生数据库PolarDB为核心的阿里云瑶池数据库产品矩阵，实现了中国数据库市场份额第一，作为国内唯一数据库厂商连续4次进入Gartner全球Cloud DBMS市场分析报告领导者象限。

Title: Data+AI-Driven Cloud Native Data Platform: Opportunities and Challenges

Abstract:Data and computing power drive the rapid development of artificial intelligence, cloud computing provides a massive and easy-to-use computing resource pool, and cloud native data platform promotes the development of data-driven artificial intelligence models (such as LLM large language models) due to its elastic expansion, high availability, and distributed characteristics. In order to meet the needs of enterprise-level applications such as elastic scaling, elastic computing, on-demand usage, AI inference, and RAG construction, cloud computing platforms and cloud-native data platforms need to explore new architectures, such as the separation of shared-everything architecture and storage and computing through distributed shared storage shared-storage, and support Data+AI inference applications and RAG applications based on vector engine and inference acceleration. At the same time, technical challenges such as financial-grade high availability, remote multi-active, and multi-source heterogeneous multi-model data management are also key capabilities that the cloud-native data platform must provide. Artificial intelligence and data platforms are rapidly evolving to the four modernizations of cloud native, platform-based, integrated, and intelligent. Based on the continuous exploration and practice of the above technical challenges, we have developed PolarDB, a cloud-native database system, to provide enterprise-level cloud-native database capabilities. At the same time, we have developed AnalyticDB (ADB), an enterprise-level cloud-native data warehouse, and Lindorm, a cloud-native multi-model database. Based on the computing platform and data platform, Alibaba Cloud has developed a large language model generalization. Withstood Alibaba's 11.11 world-class transaction peak challenges and achieved commercial success on Alibaba Cloud. We deeply combine the latest technologies such as machine learning and security encryption to provide an intelligent, efficient and secure one-stop Data+AI cloud-native data platform for next-generation enterprise-level applications.

Chen Wenguang

He is a fellow of CCF, director of the Academic Working Committee, an honorary member of YOCSEF, and a professor of the Department of Computer Science of Tsinghua University

His main research areas are operating systems, compilers, and parallel computing. He is currently the vice chairman of the Beijing Computer Federation; Executive Director of ACM China Council.

Title: Cleaning of pre-trained corpora of massive large models

Abstract:The ability of large models depends on a large number of high-quality corpora, and 10 trillion tokens have been used in open source models. Although the Internet provides far more than 10 trillion tokens, the high-quality corpus still needs to be cleaned through complex data before it can be used for model training. Large model corpus processing requires multiple processes such as word segmentation, language judgment, deduplication, and quality judgment, which is a typical Data + AI processing process, which puts forward high requirements for the underlying data processing system. This report introduces the Zhuge crossbow big data system, which is compatible with the PySpark interface and supports the performance optimization of Python UDF, which can effectively support the cleaning of large model pre-training corpus.

Gao Yunjun

Qiushi Distinguished Professor of Zhejiang University, doctoral supervisor

He has published more than 150 CCF Class A papers, 4 monographs, more than 20 authorized patents, 4 registered soft works, and won 6 best/excellent papers in ICDE and other conferences, and 3 special prizes/first prizes for scientific and technological progress of provincial or ministerial or national societies. He is currently the vice chairman of ACM China SIGSPATIAL Branch, the director of the Provincial Key Laboratory of Big Data Intelligent Computing, and the deputy dean of the School of Software of Zhejiang University. He serves as an editorial board member/associate editor of journals such as TKDE, JCST, FCS, and Computer Research and Development, a member of the program committee/Workshop/Tutorial/publicity/publication/publishing/local (co-) of more than 10 top/important international academic conferences such as VLDB, SIGSPATIAL, and WISE, and a member of the (senior) program committee of SIGMOD, VLDB, ICDE, SIGKDD, SIGIR, etc. He has trained a number of Ph.D./Master's students to win 8 outstanding Ph.D./Master's Thesis Awards from provincial, ministerial or national societies and the global champion of the KDD Cup 2022 Wind Power Forecast track.

Title: Big Data + Big Model: A New Path for Data Intelligence

Abstract:With the increasing scale of training data, large models have evolved strong generalization capabilities and new intelligences have emerged. The intelligence of a large model is the intelligence that comes from data. The intelligence of large models is also feeding back big data management and analysis, and has shown great potential in data governance and data analysis. The deep integration of big data and large models will create a new path for data intelligence. This report focuses on the research frontier of the integration of big data and large models, first introduces the relevant background of big data and large models, and then discusses the empowerment of large models by data management technology (DB for LLMs) and the empowerment of large model technology to data analysis (LLMs for Data Analytics), and reports on the exploration of vector databases, retrieval enhancement generation, Text-to-SQL, data agents, etc.

Tian Chen

Professor, Nanjing University

Doctoral supervisor, funded by the National Science Foundation for Distinguished Young Scholars. He has published more than 100 papers in many top academic conferences and well-known international journals in the field of computer networks and distributed systems, such as SIGCOMM, NSDI, OSDI, FAST, SIGMOD. His work has been widely cited and followed by researchers at home and abroad, and according to the latest academic search data of Google Scholar, his papers have been cited more than 5,000 times so far.

Personal homepage: https://cs.nju.edu.cn/tianchen.

Title: Digital Twin Exploration of Digital Networks

Abstract:The digital twin of the network is to digitally reproduce the physical network, and collect the data on the physical network, including data packets, configuration information, node status, etc., into the data warehouse through real-time or non-real-time data collection, so as to provide data support for the analysis, diagnosis, simulation and decision-making of the whole life cycle of the physical network with the help of artificial intelligence, expert experience, big data analysis and other technologies. This report will report on the preliminary progress of the NASA research group at Nanjing University in the field of digital twins of networks.

Wang Zhaoguo

He is a tenured associate professor and doctoral supervisor of Shanghai Jiao Tong University, and deputy dean of the School of Software

Winner of the Outstanding Young Scientist Fund and the person in charge of key R&D projects. He is mainly engaged in the research of databases and distributed systems, and the relevant results have been published in authoritative conferences in OSDI, SIGMOD, VLDB, NSDI, PPoPP, PODC and other related fields. He has won the 2023 SIGMOD Research Highlight Award, the SIGMOD 2022 Best Paper Honorable Mention, the APSys 2017 Best Paper Award, the ACM ChinaSys Rising Star Award, the Huawei Olympusz Pioneer Award, and two Huawei Spark Awards. He has been invited to serve as a member of the program committee of international conferences such as EuroSys 2025, NSDI 2024, SOCC 2024, IEEE ICDCS 2019/2023, IEEE Cluster 2021, etc.

Title: Optimization Methods and Challenges of Generative SQL Statements

Abstract:SQL optimization is a core problem in data systems. In recent years, with the development of web frameworks and machine learning technologies, SQL statements have gradually changed from handwritten by developers to system-assisted generation. This change breaks the assumptions of traditional database systems about SQL optimization, and makes it difficult for existing optimization rules and methods to continue to apply. At the same time, the existing research mainly focuses on the accuracy of generating SQL, but pays less attention to its performance optimization. This report will briefly report on our research results in generative SQL performance optimization and rewriting rules, and share some challenges and thoughts encountered in the research process.

CNCC | How to build a new infrastructure and new paradigm for big data analysis in the era of big models?

Read on