Reading guide ChatGPT4 was born, its powerful semantic generalization ability and characteristics for Zero-Shot, Few-Shot and other scenarios have further lowered the threshold for the use of AI, and it has also brought more imagination to all walks of life. This article will share Hainan Digital Manufacturing Technology's thinking and practice on DataOps to increase model-driven data innovation.
The main content includes the following five parts:
1. Challenges of traditional data management
2. The combination of DataOps and large models drives data engineering innovation
3. The implementation of DataOps + large model products
4. Future outlook
5. Q&A session
Speaker|Yang Minghao, Senior Big Data Technology Expert of Hainan Digital Manufacturing Technology Co., Ltd
Edited and organized|Peng Shuang
Content proofreading|Li Yao
Produced by DataFun
01
Challenges of traditional data management
In the process of aligning digital strategy with business strategy, enterprises also complete the process of data collection, management and analysis, which is the embodiment of enterprise digital transformation.
The digital transformation of enterprises presents three strategic trends:
- The first is the democratization of data analytics. Data analysis is no longer just the work of a few people, and each position will look at reports, and some statistical algorithms will be used to support business decisions in real time.
- The second is the diversification of data technologies. Nowadays, the information system is diversified, and data development students are faced with a variety of diverse and heterogeneous data. In a complex data environment, we need diversified technical components to support it, such as computing and storage components such as Flink and Spark. For example, in many financial anti-fraud scenarios, data components for knowledge graph modeling and analysis are used. As a whole, it has become a very large and complex systematic project.
- The third is lean business value. For business units, they also want data to be monetized quickly.
Under the above trend, we are facing the pain point of imbalance between supply and demand. Nowadays, the market changes rapidly, and business students need to respond quickly to market changes, so there are often a large number of temporary, and time-sensitive data requirements. In the early stage of development, the development environment is inconsistent with the test environment, deployment and integration require a lot of manual intervention, and there are data silos between systems. As a result, it takes a lot of time to survey the caliber of data, and the data development from delivery to launch often takes more than a week or even longer. At the same time, there are other problems such as lack of data semantics and difficulties in business usage.
Data development, like software engineering, should not be a waterfall process, where everything must be planned and designed before development, but rather agile delivery and the ability to respond to requirements in real time. The current R&D schedule is lagging behind market demand.
02
The combination of DataOps and large models drives data engineering innovation
Under the trend of digital transformation, and driven by the above-mentioned pain points, enterprises urgently need a new data research and development paradigm, which needs to meet the following requirements:
- Form an agile data R&D pipeline;
- build an efficient cross-domain collaboration mechanism;
- Create a self-service data usage experience;
- Establish a refined operation system.
This is the only way to deliver data quickly.
In this context, we came up with the idea of DataOps. DataOps is developed on the basis of DevOps, and its essence is that the orchestration of data workflow, data development, testing, deployment and launch, and regression need to achieve the effect of continuous integration, have a standardized system, realize automatic deployment, and optimize the entire data release process and optimize resources to achieve the effect of agile development and delivery.
Since 2018, the concept of DataOps has been included in the technology maturity curve by Gartner, and it has been increasing year by year. In 2022, DataManufacturing Technology and the Academy of Information and Communications Technology jointly established an expert working group to develop the competency standards for DataOps.
Now that we've covered the background of DataOps, let's go back to AI. From the development of deep learning technology represented by AlphaGo to the emergence of ChatGPT large language model technology last year, the threshold for AI use is getting lower and lower, and our information system is becoming more and more intelligent.
In just one year, there have been a lot of AI tools based on large language models. Our team used an open-source plugin called Bito that automatically generates and checks code, which has increased productivity by more than 20%.
The above diagram illustrates the standardized process of our DataOps, which uses large models to support code generation, code interpretation, and code review in the process of data development and release, making the whole process more intelligent.
The figure above compares the traditional data development model with the DataOps + large model model. The traditional data development model, from data modeling and data development to testing, deployment, and regression, requires a lot of manpower. In the early days, we had development environments, test environments, and production environments, and different environments had different developers to maintain their parameter profiles. When deploying, you have to manually change the configuration file, which often introduces potential risks.
With DataOps, a set of standards for agile data development and publishing, automation can be achieved. These configuration files will be managed uniformly and automatically, so that one set of data development and one set of data modeling scripts can be used in multiple environments, and the configuration parameters will be automatically replaced. At the same time, it also has the function of data sandbox, and the test data may not reflect the situation of the production environment, so the data of the production environment can be used to verify the script. Under this automated system, some large model capabilities are added, which can help us generate some data instructions, such as automatically generating SQL or adding comments, and automatic review. Therefore, the "DataOps + big model" is to add automation and intelligence to the standardized process system of DataOps to make the development and delivery of data more efficient.
03
Practical exploration of DataOps + large model products
Large models have many interesting application scenarios, such as copywriting generation, digital humans, knowledge retrieval enhancement, and so on. Another popular scenario is Text2SQL, which allows large models to generate instructions, and we will introduce our exploration and practice in this direction.
Text2SQL is based on the problem described in natural language, combined with the metadata information of the table (including table names, column names, and relationships between tables) to generate an accurate SQL statement. In 2022, before the advent of large models, Text2SQL was able to achieve an accuracy rate of about 75-77% based on pre-trained models. This data comes from Spider's review list, which is a cross-domain list that is the most authoritative in the field of Text2SQL. After the emergence of the large model, the list will be updated every two months, and the accuracy rate of the Text2SQL task done by GPT4 has soared from 78% to 91%.
With Text2SQL tasks, generating the right SQL only delivers half the value, and for data developers, the other half is finding the data they want quickly. In the real production environment, there are a large number of databases and tables, and it is necessary to find accurate tables and columns based on natural language questions, that is, accurate data caliber. Therefore, we believe that the definition of Text2SQL should also include tasks that can produce the correct SQL under different schemas.
Before the advent of large models, the Seq2Seq model was used. For questions, the metadata information of the table is embedded, and SQL statements are generated based on the embedded vector information, and there will be a mismatch problem in literature. When we make texts, Chinese to English or Chinese to German, the words are different, the grammatical structure is different, but the semantics may be the same. For example, group by, intersect, and union in SQL will not appear in natural language, and you won't say: "Help me check the total sales of this quarter, use the group by statement", so there will be a problem of semantic asymmetry.
Based on this problem, many optimization strategies have been made on the model architecture, such as Encoder Decoder. Before the big model came out, we commonly used Google T5-3B, Electra or RoBERTa as the base of the model, and constantly optimized its encoder. In order to better associate the problem with the tables or columns that need to be used, there is a more complex structure modeling, which will model the problem and adopt a more complex modeling strategy for the metadata to find the tables and columns related to the problem.
In recent years, one of the more popular methods is the modeling method based on knowledge graph, at the beginning, everyone will use the GNN graph neural network method to do modeling, but there is a problem, the model is more of a model of nodes, and there is no way to generalize to more than two degrees of depth of information. Therefore, the RAT-SQL AND RASAT models are proposed, which essentially define the meta-path on the knowledge graph, define the association relationship of the metadata, and do the long-term attention in the Encoder, embedding the vector information of the relationship into it, and strengthening the correlation between the problem and the local metadata.
After the optimization strategy of Encoder is solved, it will be Decoder, how to encode and generate SQL statements from encoder, at the beginning, we will use the Sketch-based method to disassemble the generated SQL statements into sub-parts, generate select, generate where, generate from, make word slots based on small statements, and fill in the table names and column names identified by the encoder before into this word slot, but in fact, the effect is not good. There will be a lot of errors, SQL statements that do not conform to the syntax, so we use the Generation-based method, the most classic is the structure of the AST syntax tree, let it follow a certain syntax tree to generate SQL, and PICARD, when searching, constantly go to a memory space to check whether the generated SQL is equivalent to the schema-linking content of the previous Encoder and whether it is syntactically compliant. THIS INCLUDES RESDSQL, AN OPTIMIZATION STRATEGY THAT ALLOWS IT TO GENERATE MORE ACCURATE SQL.
However, it is still not accurate enough, and after analysis, there is still a problem of semantic mismatch, so we added the architecture of Intermediate representation, and invented an intermediate language based on natural language and SQL, NatureSQL, SemQL, which became an intermediate language, and then generated SQL based on the intermediate language, and the accuracy was further improved.
After the advent of large models, the accuracy of their SQL generation, especially GPT4's policy-based generation, has become more and more accurate. At present, there are two main methods used for large models, one is prompt engineering, and the other is instruction-based supervision and fine-tuning, so that the large model can generate the corresponding SQL.
This year's paper is from SQL, and it's also a common way for large models to generate SQL tasks, i.e., a chain-of-thought approach. In the traditional Seq2Seq method, the natural language and schema information are fed and let it generate SQL, but this end-to-end approach is found to be ineffective, so the Encoder Decoder optimization strategy is born. In fact, the same is true for the large model, at the beginning, the natural language and schema were fed to it through prompt, and it was found that there was a problem with the accuracy, so the task of the large model was disassembled into subtasks, such as using a schema-linking prompt template to find the associated tables and columns, and then based on the tables and columns, classify them according to the complexity of the problem. Or SQL with some complex subqueries on top of joins. Why is it classified this way? Because for a very simple SQL query with a single table, if you use a complex prompt template to generate it, the effect will be reduced, so a categorical prompt is added. In this paper, I also added the template of self-correction prompt to tell the large model to help me optimize the SQL. Based on this series of subtasks, as well as the engineering of the chain of thought, GPT4 can finally generate more accurate SQL statements.
In the process of practice, it is found that both traditional pre-trained models and large models face many challenges. The semantic generalization ability of the traditional pre-trained model is definitely weaker than that of the large model, and its generation ability is not necessarily stronger than that of the large model, which is prone to the problem of missmatch, and the generation ability of complex query statements is also relatively weak zero-shot, few-shot capabilities, and some annotation corpora need to be prepared, and the workload is also very large. For example, the T53B model uses no less graphics card resources than the larger model.
Large models also face some challenges, and all the recent Text2SQL papers have focused on GPT4 research, and there is little research on localized and privatized large models. The good practice of Prompt relies more on the chain of thought, as well as in-context learning, that is, to give some sample data to the large model to guide it to generate more accurate SQL, but for complex schemas, it will exceed the token of the large model. Therefore, one way is to preprocess the table into a wide table first, and the generated SQL is not so complicated, and it becomes a query statement based on the wide table without doing a join operation. However, when generating a wide table, it faces the problem that the schema is very large, and it is easy to exceed its token. Fine-tuning based on the large model of privatization is also very rare.
We believe that both traditional pre-trained models and large models face a common challenge, because users actually don't know which database the data is in in the actual use process, and how to find the corresponding columns and tables from a large number of databases, tables, and thousands of rows of schemas, and it is very difficult to complete schema-linking efficiently. Nowadays, many Text2SQL evaluation tasks have assumed which library to use, and the tables and columns in this library are often very few, so there is no problem of finding numbers in complex schemas.
We propose some practices and methods, first build a semantic graph of metadata, the semantics is the question plus the table name plus the column name, which is all the semantics needed to generate SQL. When we go to generate a semantic graph, there is more label information and a description of the indicator, which can be added to this semantics, so that the prompt generation is more accurate. Building a semantic graph requires a complete set of data governance tools. When we model data, we have a set of automated data standard tools, and this set of data standards is put into the data model, which is called the process of model failure. When the data model is completed based on the data standard, the metadata of the logic layer, the concept layer, and the management metadata will be jointly placed on the metadata catalog, and then the semantic graph will be generated based on the metadata catalog, which is the process of generating metadata lineage.
In addition, in the process of building a data warehouse, some customers will copy some fact tables and dimension tables, and in this form of data warehouse modeling, the entire data warehouse modeling is directly modeled from the logical layer, but when the modeling is completed, the data developers do not know what business caliber this table corresponds to, and the business caliber does not know which tables and workflows need to be used for this business indicator, which also leads to the problem of semantic asymmetry. Where is the semantics of more business caliber? In fact, it is on the conceptual layer, and now everyone is doing the semantics of the conceptual layer with knowledge graphs, so this lineage graph is very important in the task of Text2SQL. The data preprocessing work in many papers is to do semantic transformation, for example, the table name is department, which is actually abbreviated by the abbreviation DEPT, DEPT is a meaningless column name, how can the SQL model recognize it? Therefore, some preliminary preparation for data governance is required.
At present, the industry is based on GPT4 to do Text2SQL tasks, and we have carried out ecological cooperation with some domestic large models and connected some of their interfaces for testing. At present, all tests are based on specific scenarios, specific data, and specific privatized large models, which are for reference only and may not be highly representative. We found that the instruction fine-tuning of the privatized large model has a limited effect on the Text2SQL task, and we believe that this problem will be solved soon. In addition, based on the metadata information selected by schema-linking, and then generating prompts, the Text2SQL task has a 5% to 10% improvement on the privatized large model. First, use a schema-linking model, first find out the table names and column names related to the problem, and then make prompt templates based on the table names and column names, the accuracy of the model will be improved, instead of directly feeding all the metadata information to the large model. The effect of schema-linking based on pre-trained models is better than that of large models. In-context learning has the most obvious effect on the privatization model, with about 10% improvement. In our previous Text2SQL method, an intermediate language strategy for generating natural language and SQL statements had no chemistry on some large models, or even worse. Self-correction has a slight improvement on the privatization model. Compared with the privatization model, the current effect of GPT4 is still far ahead.
This is the method we proposed, schema-linking is based on the pre-trained model to find the table and column names related to the problem, and then based on the prompt to do the large model, now many evaluations, especially the evaluation data of the list, use very few table names and column names, so when doing schema-linking, it also increases the difficulty, giving it more different databases, different tables, different column names, and its AUC has not decreased significantly.
This is our schema-linking based on a paper published by the National People's Congress this year on RESDSQL. The bottom layer is RoBERTa, based on RoBERTa to do Pooling Module, the cut words correspond to the complete table name or column name, when doing long attention, do long attention to the column name, and embed some column information into the table name to improve its sorting of the table name and column name related to the problem. We do schema-linking based on this model.
We also refer to Alibaba's Dail SQL paper, which proposes a standard method for prompt templates, based on prompt templates, plus GPT4, with an accuracy rate of 85-86%. At first, I used Code Representation to represent the schema, then Intermediate Representation, and most importantly, in-context learning. In this paper, a similar problem and similar SQL are used to form an example in the prompt template, prompting the large model to generate the correct SQL. In fact, the similarity of schema structure is missing here, and in many cases, the structure of SQL statements depends not only on the problem, but also on the corresponding database schema structure.
Some of the successful practices for Prompt are listed in the diagram above. For example, the application and structure design of the delimiter can let the large model know which part is asking questions, which part is talking about schema, and which part is talking about in-context learning. In addition, we will add some output prefixes to make the large model more stable when generating answers. This involves the robustness of the large model and the issue of compliance with the specification, because in the process of testing, we will use the large model to try to do schema-linking, so that it outputs the column names and table names related to the problem in JSON format, and there may be an error 1 out of 10 times, for example, a backbracket in the JSON is missing, so there is still a certain problem with robustness.
Next, we will introduce how the products of DataManufacturing Technology realize the integration of DataOps and large models.
The above figure shows the product architecture of Digital Manufacturing Technology. The entire process of DataOps, including data integration and development, provides standardized processes and methods, and adapts different privatization models based on the data lakehouse, and completes code generation and code interpretation based on the privatization model to help DataOps improve efficiency. At the same time, there is a unified metadata service, and for the Text2SQL task, the semantics are in the table names, column names, and problems, so it is necessary to provide more semantic information and build a metadata lineage map.
The integrated data pipeline includes the management domain, the development domain, and the governance domain. Among them, the governance domain includes a set of automated data standards and tools to help us implement data standards when modeling data, and do not allow the generation of meaningless table names and column names, which will seriously affect the effect of Text2SQL tasks. The development domain will do code generation, code parsing, and annotation based on the large model.
Finally, the product will have the ability to continuously integrate and release, including the ability to create and manage multiple environments where one code runs in multiple environments.
The image above shows some of the functional interfaces of our products. One of the biggest problems with open-source graph data lineage tools such as Altas is that a table can have dozens of columns, and when it unfolds, it is explosive, with many points, which is very difficult to read. Therefore, we have done a lot of optimization work in terms of data lineage, not only to manage the data lineage of the underlying metadata, but also to make some visual optimizations, so that data developers can really explore the entire data lineage based on the visual interface.
The above figure shows the functions of automatic data marking and workflow orchestration.
This is the first version of Text2SQL intelligent assistant we made, based on the input questions can find out the relevant metadata information, if not, can also manually correct the relevant subject areas according to the previous data governance standards, and then feed the large model to generate SQL.
The generated SQL is not just for looking, there is a more convenient rich text editor, and you can directly select the SQL to run on the editor and view the report.
Finally, there are the features of continuous integration and continuous release.
04
Future outlook
Finally, share some prospects for the future.
We build a semantic graph of metadata, especially the indicator layer, which contains more business semantics and business calibers. How to embed each data standard automation tool into the model, this is the work we need to do in the future, based on more semantics, to see how the model is improved, now we need some humans to supplement the semantics of the Spider data.
There are many ways to enhance Q&A based on subgraph retrieval on the knowledge graph, and we are also thinking about whether we can use the method of subgraph retrieval enhancement based on the knowledge graph to embed the information of the subgraph based on the enhanced method of subgraph retrieval and improve the effect of schema-linking based on the subgraph, which is also the direction we want to explore in the future.
In addition, we see that the large model splits the entire SQL task into different subtasks, and guides the model to generate the correct SQL based on the Prompt template of these subtasks, so can we make an agent based on the large model and use different submodels to complete the SQL generation tasks at each stage, which is also our current assumption.
We've seen a lot of long-context models emerge in recent months, and now the big models are maybe 2k, 4k, and some of the big models with long text have been typed into 200k. The biggest problem with Text2SQL is how to feed a large amount of metadata from the production environment to the model, so we are also wondering if the long-context model can directly feed all the metadata to it, and no longer need to do metadata sharding work, so that it can generate SQL. Of course, we are still waiting to see the effect of the long-context model, which focuses on the head and tail, and the information in the middle may be missing, so this is also a point we pay attention to.
Focusing on data governance, DataOps, and data development, Digital Manufacturing Technology is a provider of a new generation of agile data management platform with big data empowerment capabilities.
Our product is a one-stop data development and control platform, including data governance, lineage map construction, as well as tools to automate standard data governance, as well as industry big data solutions.
That's all for this sharing, thank you.
05
Q&A session
Q1: Our current implementation scheme, first in the number search, find the table is to request some models separately, maybe when looking for the table, there are some table field translations, through the translation, to determine which tables are available, and then to generate SQL, because this link has a recall process, the first few tables are not the tables we want, how to achieve this?
A1: If it is a translation recall of the table name, it still depends on the ability of the model base.
Q2: When evaluating the model, the SQL complexity rating is based on the number of group by or columns, and we have measured the low complexity, such as a single table and no join, which has a high success rate, but once the complexity is improved, some SQL may not be able to execute, and some SQL execution numbers are not the number we want to see. How to solve this kind of problem?
A2: We also encountered the same problem, such as a single-table query with wikisql, and the effect is really good. However, for SQL with joins and nested subqueries, the effect is not ideal, even if some supervision and fine-tuning are done based on large models, especially when we test based on spider data, many problems have been encountered.
Q3: Did you mention Text2SQL and start using it on the business side?
A3: The business side is still trying, because the accuracy rate is limited.
Q4: We encountered a problem when it actually landed, because it is a probabilistic model behind it, so its recall may be different every time. The second is that meta-data may be of poor quality and requires a lot of data governance, for example, if we are facing an analyst or a development engineer, he thinks that if I want to use the ability of CHAT, it will be very expensive, how to balance this problem? If he doesn't use it, then there is no way to continue to tune the model, because we don't have real data to train it, and if it is always a pre-trained model, there will be problems in the actual landing scenario.
A4: Yes, this question is classic. For example, in the case of cold starts, there is not too much corpus in the pre-training data or the evaluation data. In addition, a lot of semantic information, including table names and column names, needs to be governed by data. For us as a data governance vendor, we believe that not only for Text2SQL, but also for all downstream business systems and models, if we do not do data governance, there will be a lot of problems. So for us, this is not a choice question, data governance is a must.
But this will face a problem, I can't monetize quickly, and the business side wants to help me better manage my assets to find production value faster, and the cycle time will be very long. We don't have a good answer at the moment either.
Q5: How is the metadata information fed to the large model?
A5: What we input to the large model is actually a natural language expression, and our Prompt is a natural language expression, not a vectorization method, and there will be an embedding layer inside the model to do vectorized things.
Q6: What is the form of feeding the large model and what content is included?
A6: The metadata Prompt template depends on different strategies, some will assemble it all into a sequence, table name: column name, the column may have a parenthesis to indicate that it is a primary key, a foreign key. There are also some ways to put a DDL statement directly on top of the Prompt. If the schema is in the form of DDL statements, it may work better for large models.
Q7: Regarding the setting of the Prompt for writing SQL process, is there some advance settings in the intelligent assistant, or do you have to let the user write these settings? For example, in a scenario, we have some enumerated fields, and the gender field is stored as 0 and 1 in the data table, and the business problem is that please count the time of all men playing games in this table, so how to make GPT understand that the gender field 0 represents the male, and at the same time display it in SQL.
A7: There is already a set of standardized templates behind our intelligent assistant, based on different models accessed, such as GPT4 or the large model of privatized deployment, we will define a default template, and then we will consider the template that the user may want to explore by himself, and we may open the function of Prompt editing, which is currently built-in in the system.
If it's some enumerated values, in your question, you tell the big model to find out if the gender is equal to the male, maybe the big model won't be able to get this information, and it may report an error.
Q8: You just said that there is a Prompt template in the background, which is equivalent to the user entering a question in the login dialog box, which will actually be combined with that template to answer GPT and then generate SQL?
A8: Yes, that's right.
That's all for this sharing, thank you.