Software Engineering under Large Models: Opportunities and Challenges: Report of the 9th CCF Xiuhu Conference

On January 19~21, 2024, the 9th CCF Xiuhu Conference was held at Suzhou CCF Business Headquarters & Academic Exchange Center. During the three-day conference, more than 20 experts from academia and industry conducted in-depth exchanges and discussions on the theme of "Software Engineering under Large Models: Opportunities and Challenges", and formed the following report.

Background and significance

Software engineering (SE) refers to the application of engineering principles and methods to the process of software development, operation, maintenance and management to achieve high quality, high efficiency, reliability and maintainability of software. At present, large language models (LLMs) such as ChatGPT have made breakthroughs in many software engineering tasks such as requirements analysis, code generation, test generation, program repair, and code optimization, bringing new opportunities and challenges to the research and practice of software engineering. In the era of large models, it is urgent to explore the development of human-machine collaborative software from digitalization and knowledge to highly intelligent, and build a software engineering technology that is task-driven, data-driven, model-driven, and trusted integration.

Based on the above problems, this conference focused on the two perspectives of "Large Model Supported Software Engineering (LLM4SE)" and "Software Engineering Supporting Large Model (SE4LLM)", and invited more than 20 experts and scholars from academia and business circles in different professional fields such as software engineering, system software, artificial intelligence, etc., to share their views and collide ideas around the theme of "Software Engineering under Large Models: Opportunities and Challenges". The conference was organized according to six topics: "Software Engineering under Large Model" invited experts in the field to summarize the current situation and main problems of software engineering under large models, provide background knowledge for follow-up exchanges and discussions, and highlight the importance of demand design engineering; The topic of "Large Model and Code Intelligence" focuses on one of the most important subfields of intelligent software engineering application - code assistance and generation; The topic of "Large Model and Software Quality Assurance" focuses on the development trend in the field of software testing and code analysis, review and operation and maintenance. The topic of "Data, Evaluation and Verification of Code Models" focuses on the data engineering, evaluation and verification of code models. The topic of "Large Model and Open Source Engineering" focuses on the supply chain ecology and the construction of open source ecology of large models. The "Future and Challenges of LLM4SE" focuses on the prospects and opportunities of the cross-integration of large models and software engineering.

Software engineering under large models

The large model represented by ChatGPT has shown amazing results in text understanding and generation, and has aroused widespread attention and discussion in many fields. It has become a research hotspot to solve related software engineering problems by constructing special large models in the field of software engineering.

Practice sharing

In the 60s of the 20th century, software engineering was formally proposed, and scholars in various fields began to pay attention to the study of program structure, so that programming languages and compilation systems were widely used. In the early stages of the development of software engineering, artificial intelligence and machine learning techniques were not widely used, and software engineering mainly focused on structured programming, modular design, and data structures. With the rapid development of structured programming, object-oriented programming, and cloud computing, artificial intelligence and machine learning technologies continue to mature, and intelligent software engineering (AI4SE) is gradually emerging. Some intelligent technologies are introduced into the process of software development, testing and maintenance, such as intelligent requirements analysis and design, intelligent code generation and repair, intelligent project management, etc., to improve the efficiency and quality of software development.

At the end of 2021, OpenAI released the first code model Codex for code generation. Codex is similar to GPT-3's architecture, pre-trained using 159 GB of publicly available code data collected from GitHub. The Codex-based Copilot plug-in has become a benchmark for code generation aids, and the HumanEval dataset proposed in the Codex paper has become one of the commonly used evaluation datasets for subsequent code generation. As of January 2024, there are more than 20 large models for the field of software engineering, published by OpenAI, DeepMind, Salesforce, Huawei, Microsoft, etc., as well as the University of Illinois at Urbana-Champaign and Tsinghua University in academia, and the number of parameters has also grown from a few bytes in the early days to hundreds of bytes.

The rapid leap forward in artificial intelligence-generated content (AIGC) technology in 2023 has paved the way for the innovation of software development models. The new software engineering system with large models as the core has become the focus of attention in academia and industry because of its great potential. The successful deployment of tools such as GitLab Duo, GitHub Copilot X, Tabnine, and Codota has further broadened the application horizon of AI in the field of software engineering. Gartner's Top 10 Strategic Technology Trends for 2024 includes platform engineering, AI-enhanced development, industry cloud platforms, intelligent applications, etc., highlighting the prospects and advantages of combining software engineering and AI in the future, as well as the urgent need for new software development models.

Under the guidance of large models, a software engineering innovation focusing on the efficient use of AI is emerging: building a co-evolutionary, open and shared software knowledge platform, so that large models can gain insight into the full picture of complex system implementation and its business and technical background, and promote the efficient circulation and application of software development knowledge in a natural way; With the help of the excellent knowledge processing capability of the large model, the digitalization of the R&D process is promoted, the knowledge redundancy and duplication of work are reduced, and the value of documents and knowledge work is enhanced. By strengthening data governance and digital project implementation, laying a solid data foundation, combining the digital knowledge accumulated in software development, and using the memory, comprehension and correlation analysis advantages of large models, the efficiency and credibility of intelligent development are enhanced. The active exploration of scholars and practitioners from all walks of life has not only opened up new growth points for software engineering, but also brought unprecedented challenges and opportunities.

Arguments of opinion

Academician Lu Jian of Nanjing University emphasized that the catfish effect brought about by the success of the large model has opened up new exploration space for almost all fields. He pointed out that artificial intelligence is an intelligent behavior that imitates human behavior and crosses with it, and there is no situation where artificial intelligence completely replaces human beings, and the era of large models still needs to use knowledge, reasoning and causality. Under the premise of conditional constraints, scene adaptation, and systematic adaptation, the application field should actively embrace the changes brought about by the large model, and continue to exercise itself to achieve gradual improvement. Lv Jian emphasized that we are currently in a "grand view" world, and the large model links all the world's knowledge, and we should actively find a new scientific foundation and promote the organic combination of the "grand view" world and traditional science.

Academician Hu Shimin of Tsinghua University analyzed the application of large models in software engineering, especially the development process and current challenges of automatic code generation technology. He pointed out that although automatic code generation technology is essential to improve the efficiency of software development, large code generation models still face many problems in terms of data requirements, backbone network design, training and inference efficiency, and hallucinogenic phenomena. He introduced the innovation of the Jittor deep learning framework in supporting large model training and inference, including the concept of meta-operator and the idea of unified computing graph. The Fitten Code programming assistant developed based on the graph surpasses OpenAI Copilot and CodeGeeX in terms of code generation speed and accuracy. Hu Shimin looked forward to the future development direction of large models in the field of software engineering, including improving the scale and quality of training datasets, formulating AI code generation specifications, improving the version management mechanism of automatic code generation, and strengthening software defect detection, in order to improve development efficiency and the ability to deal with complex systems.

Li Xuandong, a professor at Nanjing University, pointed out that while large language models open up new space in the field of software development, they also bring challenges to credibility judgment. Software development is a complex decision-making task that translates natural language requirements into program code, and large language models demonstrate powerful predictive content generation capabilities in multiple stages of software development, such as requirements analysis, design, coding, etc., which brings new opportunities for software engineering. However, the code generated by these models has a lack of credibility and needs to be further reviewed and optimized by humans. This has led to the human-machine collaborative programming model, which requires developers to guide large models to generate predictive code, which is then carefully analyzed, understood, and modified as necessary to ensure the quality of the code. At the same time, the reliability of the software is further improved by performing trusted assurance activities such as static analysis, dynamic testing, and formal verification. Li Xuandong pointed out that the rise of human-machine collaborative programming has put forward higher professional requirements for developers, including the ability to judge the credibility of the content generated by large models. In order to effectively leverage the tools of large models and drive the future of software engineering, it is necessary to achieve comprehensive improvement in three key areas: technology development, educational innovation, and professional capacity development.

Xia Xin, Director of Huawei's 2012 Software Engineering Application Technology Lab, pointed out that the development of large-scale model-driven software engineering is in the 3.0 era, which not only revolutionizes the traditional development process, but also brings new challenges and opportunities. With the help of large models such as ChatGPT and Codex, the accuracy of automatic code generation and test case generation can be significantly improved, and the efficiency of software testing and development can be optimized. However, the model still has limitations in terms of complex logical reasoning, minority language support, legacy system modernization, test scripting and fixing, etc. The accuracy problems of hardware and software stacks, such as model training loss alignment and end-to-end convergence, and debugging problems in multi-machine clusters, are key technical obstacles to be solved. In addition, the construction of high-quality datasets, model evaluation systems, test requirements and instruction tuning have become current research hotspots.

Requirements analysis and programming under large models

Requirements and design are critical phases of software development and remain a challenging process in software development. Software programming plays a vital role in the software development process by translating abstract design ideas, requirements specifications, and business logic into a form that can be understood and executed by machines. With the application of large models in the field of software engineering, it will effectively assist in the generation of high-quality requirements specifications and code prototypes in the requirements engineering stage in the future, and further improve the efficiency and quality of subsequent programming links.

Practice sharing

Software requirements analysis is the cornerstone of the software development process, which refers to the process of collecting, analyzing, standardizing, and managing system, product or service requirements. This process helps project teams and stakeholders clarify the goals, functionality, and performance requirements of the software, ensuring that the development direction is aligned with user expectations. By identifying and clarifying requirements at an early stage, you can reduce the risk of rework due to requirements changes at a later stage, reducing development costs and project delays. A good requirements analysis helps to build software that is modular, scalable, and easy to maintain, reducing the failure rate.

Intelligent programming technology is applied to a variety of scenarios, including code completion, code search, etc. These applications are designed to improve the efficiency of software development, reduce human error, and assist developers in implementing features quickly. In recent years, large language models have been widely used in code generation tasks. These models use a variety of different architectures, including code megamodels such as CodeBert, PLBART, and CodeGPT. These models are pre-trained on a code corpus to gain a deep understanding of the syntax, semantics, and idiomatic structure of the code. To enhance the model's understanding of code complexity, some innovative approaches incorporate structured representations. GraphCodeBert integrates graph-based representation on top of CodeBert, while CodeT5 combines the encoder-decoder paradigm with the structural nature of the code.

The open-source community provides a wealth of resources for code generation research, including open-source codebases, datasets, and tools. For example, platforms such as GitHub and Stack Overflow provide a large number of code samples that are used to train and evaluate large models. Based on large-scale data, Codex and CodeGen build large-scale models with billions of parameters, which demonstrate advanced performance in code generation tasks, which can assist developers in writing code and improve programming efficiency. The success of Codex and others has led to the development of more similar models, such as StarCoder and CodeLLaMA. Another approach is to use closed-source models such as ChatGPT, GPT-4, etc., to assist in code generation and use them to evaluate their effectiveness in generating functional code. For example, Microsoft introduced Phi-1, which trains on high-quality data to achieve performance comparable to larger code models with 13B parameters.

Arguments of opinion

Professor Jin Zhi of Peking University analyzed the development prospect of software requirements engineering with software design thinking as the starting point. She believes that from an engineering point of view, demand engineering is divided into multiple stages of spiraling, and there are problems such as insufficient communication and insufficient knowledge. Existing methods can use large models to generate requirements information in the form of interviews, classify requirements, realize requirements tracking, analyze requirements quality, write specifications, and so on. Large models have the ability to supplement missing knowledge and express requirements in a structured manner, and the challenge lies in how to give the right hint at the right time. In addition, large models are difficult to distinguish between tone differences, highly dependent on prompts, and have other inherent problems. In the face of these problems, Jinzhi proposed the solution of systematization of problem description, modeling of AI agent collaboration goals, and establishment of agent cooperation team.

Professor Li Ge of Peking University elaborated on how large model technology is shaping the future of software automation, and discussed the challenges and opportunities in the process of realizing advanced software automation. He points out that software automation is a fundamental way to improve productivity and quality. While large models excel in natural language processing, they still face many challenges in applying them to code generation and software automation, including model accuracy, reliability, security, and dependency on private domain data. Li Ge predicted the future development of large models in software development processes, DevOps practices, software test automation, and software maintenance. Using the "iron clamp model" as a metaphor, he emphasized the importance of automated coding and testing in the software development process, and expressed optimism that large models will lead the future of software automation.

Peng Xin, a professor at Fudan University, emphasized that with the rise of large-scale model technology, generative AI could dramatically change the way programming and software development is done in the near future. AI has shown great potential in intelligent assistance such as code completion and code search, but it also faces challenges such as scale complexity, abstract thinking ability, hard-to-capture "dark knowledge", and long-term maintenance support. Peng Xin pointed out that the fundamental difficulty of software development lies in the conception of requirements and design, and although large models are excellent in code generation, software design and maintenance still need to rely on good modular design and expertise. Peng Xin also pointed out that by strengthening the accumulation of digitization and knowledge, and using the memory, understanding and association capabilities of large models, the digital twin of knowledge-enhanced code can be realized, so as to improve development efficiency and ensure the credibility of software. The main challenges faced by software engineering tools in the future include how to build a shared software development knowledge platform, so that large models can better understand the global information of complex software systems and their business and technical contexts, and how to efficiently realize the sharing and utilization of software development knowledge, so as to solve the efficiency problems and credibility challenges in the software development process.

Liu Hui, a professor at Beijing Institute of Technology, analyzed the key role of code optimization in the software development process in detail, and discussed in depth the application prospect and challenges of large model technology in this field. Liu Hui emphasized that although there are similarities between code generation and code optimization in the problem space and solution space, there are significant differences between them in ambiguity, variation range and transformation rules, and the introduction of large models provides a new perspective for code optimization. At present, the challenges of code optimization based on large models mainly include: the collection of training data, the customization and generalization of the model, and how to ensure the reliability of the optimization results and break through the input and output limitations of the model. He suggested that in order to promote the development of code intelligence to a higher level, in-depth research is needed in data collection, model optimization, credibility of results, and technology scaling.

Large model and software quality assurance

Software quality assurance is an indispensable part of ensuring software reliability, including software testing, code review, software operation and maintenance, etc., and is an important safety net for the entire software development process. In recent years, the emergence of large language models has brought new opportunities for the development of software quality assurance, improved the R&D efficiency in the fields of software testing, code review, software operation and maintenance, and improved the stability, security and performance of various software systems.

Practice sharing

For software testing, researchers have proposed a series of techniques aimed at automating the process of generating unit test code and reducing the cost of manually writing unit test code to a certain extent. LLMs have shown good results in areas such as code generation. Considering that code generation and unit test code generation are essentially source code generation, recent research has begun to extend the application of code generation to unit test code generation. For example, A3Test (Assertion Augmented Automated Test Case Generation) pre-trains LLMs with the method to be tested and assertion sentences, so that LLMs have stronger basic knowledge of assertion. Then, the pre-trained model was fine-tuned for the test code generation task, so that the number of correct test codes increased by 147% compared with SOTA while reducing the number of test code generated.

Code review is an important part of the software development life cycle, which usually requires developers and other relevant team members to conduct a quality review of code modifications, and evaluate the logic, function, style and other factors of the code to be reviewed on the basis of viewing and understanding the code modifications. In practice, manual review of code often takes a lot of time. Therefore, in recent years, automating code review through intelligent methods, especially large language model methods, has been a research hotspot in the field of software engineering. Code reviews typically involve different tasks and scenarios, including automatically assessing the quality of code modifications, automatically generating comments for code reviews, and automatically improving low-quality code. At present, large models are used in different code review scenarios, and the main technical approaches include: large model pre-training, model fine-tuning for specific code review tasks, prompt engineering, and context learning.

Software operations is the process of managing and maintaining software systems using automation, standardization, and best practices, including software deployment, monitoring, performance optimization, troubleshooting, and security management. Modern software operations require more automation and intelligent tools to address the challenges of increased complexity to improve efficiency, save costs, reduce human error, and enable rapid detection and response to problems. The Alibaba Cloud team proposed to use tool learning, which combines code data and log data in cloud systems to enable large models to automatically perform root cause analysis. Microsoft proposes a confidence evaluation framework to alleviate the problem of untrustworthy output caused by hallucinations of large models to a certain extent. In addition, Microsoft applies large models to the failure analysis pipeline to improve the efficiency and accuracy of data annotation, assist in understanding common problems in cloud service monitoring, and recommend monitoring strategies for specific services. The HUAWEI CLOUD team proposed a hierarchical representation method that uses a language model to automatically classify fault tickets, thereby saving the cost of manual classification in fault prediction.

Arguments of opinion

Zhang Chengzhi, a professor at the Hong Kong University of Science and Technology, believes that even if there are errors in a given program, large models can still infer its intent, but it is difficult to accurately judge the right or wrong of the program, inferred expectations and actual outputs. He investigates ways to generate error-throwing test cases using large models. Theoretically, large models can use differential testing to generate reference programs and use the results as use cases, but in practice, there are problems such as incorrect reference programs and false positives. Therefore, Zhang Chengzhi proposed a differential prompt framework and used the above expected results to carry out differential testing. Experiments show that the large model has weak logical reasoning ability, but is good at generating programs similar to the source code. In order to make up for the shortcomings of logical reasoning, Zhang Chengzhi suggested that the inference problem should be transformed into a code generation problem, and the large model should be integrated with other methods. Experiments have shown that large models can generate almost exactly correct code for common tasks, but considering the performance of code repair, the current large models cannot be used as independent and reliable development assistants.

Lu Rongcong, a professor at the Chinese University of Hong Kong, believes that even though automated operation and maintenance has multiple advantages, it still faces multiple challenges in the practice of modern software operation and maintenance, mainly in three aspects: complexity, performance and scalability. Large models bring new opportunities to address these challenges. Previous studies at home and abroad have shown that large language models can be used for automatic fault discovery, fault root cause location, and post-event analysis, which brings new possibilities to the traditional operation and maintenance model. For example, the TimeGPT model is pre-trained based on massive time series data for multi-node root cause location tasks.

Xie Xiaoyuan, a professor at Wuhan University, discussed the significance of large models for software quality assurance, and analyzed the application of large models in software testing, defect location, and automatic program repair. In terms of software testing, the research focus of large models has shifted from early pre-training and fine-tuning to prompt design. In terms of defect localization, the method based on large model can use static information to locate and achieve better results than traditional methods. In terms of bot repair, some methods try to combine large models with completion tools, and use dialog-driven repair to complete tasks. Xie Xiaoyuan believes that large models have improved human-machine collaboration based on natural interactions, and promoted the birth of multiple code evaluation benchmarks. However, there are also problems with today's large models, such as overestimation of capabilities, and there is still a gap between static evaluation of models and real-time verification results. In order to solve these problems, Xie Xiaoyuan suggested expanding the dataset to realize the dynamic evolution of the large model and refine the verification of the output of the large model. She pointed out that there are still many limitations of large models, and the problems of input length and mode, positioning ability, and output uncertainty limit the application of large models in the field of defect localization.

Jiang He, a professor at Dalian University of Technology, mentioned that software engineering driven by large models is leading software development into the 3.0 era, which has significantly changed the paradigm of software engineering, covering the reconstruction of processes, tools, indicators and value systems. Taking Dalian Xinhuaxin Company as an example, the company has improved development efficiency and economy by optimizing the team structure and technological innovation, and using large models to realize code automation and test automation. The specific applications of large models include cloud model training and on-premise inference, security protection, resource management, etc., and have achieved results in code generation, test case generation, and repair. However, large models still face problems such as complex logical reasoning, insufficient support for small languages, modernization of legacy code, and high cost of all-in-one deployment and application. Academia and industry are actively working on ways to address this, such as improving the accuracy of test assertions through prompt engineering, and exploring automated script generation and defect location repair strategies.

Data, evaluation, and validation of large code models

The development of large models has promoted technological innovation in areas such as data processing and model optimization. According to the scaling law, with the increasing size and parameters of the model, the demand for high-quality data of large models has become more urgent, and large models integrate data resources from all over the world, and the requirements of data diversity, accuracy and unbiased data have become important challenges. The complexity and diversity of large models also make it more difficult to evaluate models, and traditional evaluation methods may not be able to comprehensively evaluate the performance of large models, and new evaluation frameworks and indicators are needed.

Practice sharing

Data quality determines the upper limit of the model, and the performance and capability of a large model mainly depend on the quality, quantity, and diversity of the data used. The richer and cleaner the dataset, and the broader the domain coverage, the more comprehensive knowledge and complex patterns the large model can learn. The diversity of datasets ensures that large models can have a wide range of applicability and versatility in different fields. In addition, high-quality data can also help large models reduce overfitting and improve the robustness and generalization ability of the model. Therefore, ensuring the quality and diversity of data is one of the key steps in the success of training a large model, and it is also the basis for the model to perform well in real-world applications. With the development of technologies such as Chain-of-Thought (CoT), Few-Shot Learning (FSL), and Zero-Short Learning (ZSL) in the era of large models, it is necessary to guide the model to learn quickly without special training, making full use of the extensive knowledge and skills that the large model has received, so that the model can quickly adapt to specific tasks under limited data conditions. In particular, in scenarios where data is scarce or expensive, reducing the model's need for a large amount of labeled data can effectively improve data utilization and reduce the model's dependence on data.

The systematic and comprehensive evaluation of large models is an important part of promoting the iteration of large models, which affects the performance and feasibility of the model in practical applications. In the model selection phase, evaluation can help compare and select different models to provide a reference for further tuning and improvement. During the training process, the interpretability and continuous monitoring of the model are also the focus of evaluation, which helps to understand the decision-making process of the model and improve the acceptance and trust of the model in real-world applications. In the training completion stage, the performance of the model on specific tasks can be measured through indicators such as accuracy, precision, recall, and F1 score, and the generalization ability of the model can be detected through evaluation on new datasets, and overfitting can be avoided. During the model deployment phase, the performance of the model is continuously monitored to ensure that the model continues to perform well in a changing environment. In addition, it is also important to evaluate the reliability and robustness of the model in different environments and tasks, evaluate the ethics and fairness of the model, and ensure that the model is not biased or discriminatory in the face of different groups of people.

Arguments of opinion

Professor Lv Rongcong of Hong Kong Chinese University gave a detailed introduction to the paradigm of code and large model data generation and application in the era of large models. He pointed out that data is the foundation of large language models, and large model capabilities emerge from data, and both pre-trained data and fine-tuned data are important. When using traditional machine learning methods to solve a task, professional personnel need to use high-quality training data to train professional models to minimize the loss function to obtain a model for a specific task.

Wang Yinan, a technical expert from Tencent, shared the evaluation practices of code models and code intelligence products within Tencent, including the application and evaluation practices of Tencent's self-developed Worker Bee Copilot tool in multiple coding tasks such as code completion, code generation, knowledge Q&A, vulnerability detection and repair, and unit test generation, and evaluated the performance of the model in the actual coding development process, which effectively reflected the application of the code model in the actual project development process of the industry. At present, the difficulties, key points and pain points encountered by the industry in the application and evaluation of models include many data sources, inconvenient collection, poor data quality, low data labeling efficiency, and slow delivery speed in data engineering. In the model evaluation stage, it is difficult to meet the requirements of comprehensive, accurate and fast evaluation, the code model evaluation dimensions, methods and indicators are incomplete, and the manual construction of annotated datasets is time-consuming and costly, which affects the iterative cycle of the model. How to form an effective closed loop between the evaluation results and model training and improve the iteration speed and quality of intelligent code products is also an important and urgent problem to be solved.

Liu Kui, a technical expert at Huawei, pointed out that there are various large code models in the research field that can be used to evaluate datasets and various indicators, but whether these datasets and indicators are strongly correlated with the improvement of development practice efficiency needs to be further verified. Although some data processing schemes can be inferred from some open source large models and research reports, they are not completely transparent and cannot ensure the traceability of data sources. In the pre-training stage, the impact of multi-language and multi-domain data combination and matching on the model is still in the early stage of exploration, and the definition of high-quality datasets in the industry is still an "unsolved mystery".

Xiong Yingfei, an associate professor at Peking University, pointed out that the more parameters of a large code model, the better the effect will be, but the cost will also be higher, so how to reduce the number of parameters while keeping the model performance unchanged, or improve the model performance with the same number of parameters, is a problem worth exploring. If the code model is trained according to the text training method, the knowledge of syntax types and other fields will be ignored, which will make the space of the model generator larger and increase the learning difficulty of the model. He proposed that the program can be represented with the help of grammatical rule sequences, the probability of rule selection can be used to generate programs, and the neural network can be used to realize the Linglong framework to meet the requirements of semantic and type program search. Similar perceptual type rules and grammar rules can also be used for large pretrained models to improve model performance. Utilizing knowledge of program syntax, semantics, and types can effectively improve model performance, and more general coding methods and a variety of different knowledge have great potential for program synthesis.

Li Shanshan, a professor at the National University of Defense Technology, elaborated on the development of code intelligence models and related arguments, and introduced the relevant research on code representation tasks and model generalization. For code representation tasks, intermediate representations can be used to solve the problem that traditional use of characters and Abstract Syntax Code (AST) is difficult to accurately represent the semantics of source code. The existing code representation model does not yet have the ability to generalize to other tasks, so the generalization ability of the model can be improved by sharing code representation with the help of multi-task linkage, and the introduction of course-based learning will help the model to achieve unified task improvement on different tasks. The combination of parameter efficient fine-tuning methods adapted to multiple programming languages, retrieval enhancements, and prompt templates can stimulate the code generation capabilities of large models for domain-specific tasks.

Large models and open source engineering

In the rapid development of information technology, open source software and artificial intelligence have become an important force to promote technological innovation and industrial development. The emergence of large models has not only made revolutionary progress in the field of natural language processing, but also has had a profound impact on the open source ecosystem. The open source ecosystem, as a complex system involving developers, users, and organizations around the world, is undergoing a transformation driven by large models.

Practice sharing

Open source software has become one of the mainstream models of software development due to its open source and free sharing characteristics. The globalization process of open source software has never stopped, and even when social and economic globalization encounters resistance, open source software continues to promote the progress of information technology on a global scale with its openness and innovation ability. The open source ecosystem not only brings together multiple participants such as enterprises, developers, open source foundations, industry alliances, and governments, but also with the popularization and improvement of platforms such as GitHub and GitLab, software projects have grown at an unprecedented rate, forming a distributed collaborative innovation model on a global scale. As of January 2023, there are more than 400 million projects and 100 million developers on GitHub. In this context, artificial intelligence, especially large language models, with their powerful data processing and analysis capabilities, is changing the way we interact with computers, improving the efficiency of software development, and even replacing humans to complete complex tasks in some fields, becoming an important force to promote the transformation of the software ecosystem.

With the popularization and deepening of large model technology in the open source community, the interdependence between projects has increased exponentially, building a criss-crossed, tightly intertwined, and hard-to-trace software supply chain network. In this network, each software artifact no longer operates in isolation, but as part of the ecosystem, forming a close and dynamic interdependence with other artifacts, sharing resources and security risks. For example, as of April 28, 2024, 228713 vulnerabilities have been registered on the Common Vulnerabilities and Exposures (CVE) website. Therefore, in recent years, many scholars and practitioners have proposed a variety of techniques to perceive potential vulnerabilities and risks in open source software from the code and open source community, so as to detect vulnerabilities in open source software as soon as possible and reduce the losses caused by vulnerabilities. For example, the MemVul model includes an external storage module to introduce external vulnerability knowledge from the Common Weakness Enumeration (CWE) to help open source software users perceive potential vulnerabilities in time and take mitigation measures.

In addition, due to its advantages of fast iteration, relatively low cost, powerful functions and easy customization, open source large models have gradually narrowed the performance gap with closed-source models, and even shown better performance and flexibility in some aspects. As a result, more and more enterprises and individual developers choose to invest in the R&D and application of open source large models, forming a virtuous circle, accelerating the evolution of the entire open source ecosystem, and promoting the collaborative development and continuous innovation of open source software and AI technologies around the world. With the development of a new social form with deep integration of human-machine-things, the participants of the software ecosystem are becoming more and more diversified, covering traditional developers, emerging AI engineers, ordinary users, and even various smart device manufacturers. This makes the construction and maintenance of the open source ecosystem more challenging, and at the same time creates a broad stage for the continuous evolution and innovation of the software ecosystem.

Arguments of opinion

Zhou Minghui, a professor at Peking University, discussed the impact of large models on open source projects against the background of open source globalization and the widespread use of ChatGPT. "Joining the group for heating" is an important driving force for the development of open source, and open source can effectively fight monopoly and decoupling, gather the wisdom of the crowd, and achieve innovation. The current open source projects face many challenges, such as emerging open source projects, which challenge existing mature ecosystems, and it is extremely difficult to cultivate their own ecosystems; The complex supply chain of open source software has attracted a lot of attention around the world. At present, there are many tasks in open source projects, such as measuring the open source community and software supply chain, intelligently recommending novice tasks, and library migration solutions. Due to their unexplainability, large models are difficult to use directly for recommendations. Zhou Minghui also discussed the future of complex open source system metrics, and believed that systematic and refined capabilities are needed to achieve supply chain management, achieve visibility, management, and control, and adapt to the environment of continuous changes in the technology ecosystem. She advocated the deep integration of industry, university and research, and refined cooperation to build an open source ecosystem.

Hu Xing, an associate professor at Zhejiang University, emphasized the potential role of large models in open source software vulnerability management, including early vulnerability awareness, code-level vulnerability detection, vulnerability dependency analysis, and vulnerability location to patches. However, large models face problems such as insufficient data, diverse types of vulnerabilities, and difficulty in understanding hidden fixes in actual automation and collaborative applications. Based on these problems, Hu Xing proposed that an interpretable hidden vulnerability perception algorithm can be realized with the help of contrastive learning technology, improve the accuracy and efficiency of vulnerability identification and repair through code execution path and function call relationship, and locate vulnerabilities to patches by manually extracting the characteristics of code and vulnerability description.

Gao Cuiyun, an associate professor at Harbin Institute of Technology (Shenzhen), pointed out that open source data, as the core driving force to promote intelligent software development, plays a key role in the development of code models, and large models have gradually become the core hub of data use and generation. Data size and quality are critical to the performance of a model, and the disorderly generation and propagation of "low-quality" data will affect the effective and compliant training of the model and even pollute the human knowledge base. Gao Cuiyun believes that the problem of low data quality in open source data can be alleviated by data high-quality label generation methods, and the dependence of the model on data pseudo-features can be alleviated by means of counterfactual reasoning and metamorphosis testing.

The Future and Challenges of LLM4SE

The guests at the meeting reached the following consensus and issued corresponding initiatives on how to cooperate with each other and work together to realize the healthy development of large models in the field of software engineering.

consensus

1. Use the powerful knowledge utilization ability of the large model to strengthen the value of R&D digitization and various document knowledge, solve the waste of knowledge and repetitive thinking in software development, and use the large model as a large-scale knowledge base to prompt and complete the missing knowledge, which can effectively improve the efficiency of software development and knowledge reuse.

2. The flat information provided by the code data is not enough to support the large model to obtain a higher level of intelligent development capabilities; Large models have limited inference capabilities, and most of the code generation is based on search-based synthesis, which can be generated because of "seen".

3. On the one hand, there should be a general large model, using fine-tuning or contextual ability to show domain knowledge; On the other hand, there should be a domain model, and for the aspects that the large model is not good at, you can consider building a special knowledge base to improve the application effect of the large model.

4. Large models bring new opportunities for the transformation of software production methods. At the moment when large models subvert technology, we should actively think about the changes in software production methods, turn problems in the field of software engineering into closed problems, reduce the introduction of problems artificially, verify them in a verifiable way with the help of large models, skip the stages of coding and testing, and explore the automation method of direct code production from requirements.

5. Large models have opened up a new space to solve problems, but because of the lack of credibility judgment of large models, it has brought problems and challenges to the credibility assurance of software, and academia and industry should actively think about how to deal with the credibility assurance challenges brought by large models.

Initiative

1. The catfish effect brought about by the success of the large model has opened up new exploration space for almost all fields. Under the premise of conditional constraints, scene adaptation, and systematic adaptation, all parties involved in production, education and research should actively embrace the changes brought about by the large model and achieve gradual improvement through continuous self-training.

2. In the era of large models, it is still necessary to use knowledge, reasoning and causality, and the application of large models should make use of their strengths and avoid their weaknesses, and apply them to appropriate fields to achieve the sustainable and healthy development of large models.

3. Large models have brought changes and opportunities to all walks of life, and in the context of task-driven, data-driven, and model-driven, it is necessary to continue to combine the practical problems and challenges encountered by the industry to catalyze the cutting-edge exploration of software engineering technology in the academic community.

4. Solve the problem of data base through data governance and digital projects, and on the basis of the digital and knowledge-based accumulation of software development, use the powerful memory, understanding and association capabilities of large models to strengthen intelligent development capabilities, so as to improve efficiency and guarantee credibility.

5. Develop the ability to control AI: be able to use AI, use AI well, and use AI correctly. Credibility judgment is made on the predictive content generated by large models, and the ability to control AI is obtained by comprehensively improving people's professional capabilities.

Finishing: Jin Zhi, Xia Xin, Gao Cuiyun

Attached: List of participating experts

Special Guests:

Lv Jian, Hu Shimin, Lv Rongcong

Attendees (in alphabetical order of surname):

Hu Xing, Jiang He, Li Ge, Li Shanshan, and Li Xuandong

Liu Hui, Liu Kui, Ma Xiaoxing, Peng Xin, Wang Lijie

Wang Yinan, Xie Xiaoyuan, Xiong Yingfei, Zhang Chengzhi, Zhang Yuhui

Zhou Minghui

Conference Organization:

Jin Zhi Xia Xin

Secretary of the Conference:

Gao Shan, Gao Cuiyun

Software Engineering under Large Models: Opportunities and Challenges: Report of the 9th CCF Xiuhu Conference

Read on