Instead of manual annotation, the AutoAlign method automates the complete knowledge graph pair based on a large model

AIxiv is a column that publishes academic and technical content in the heart of the machine. In the past few years, the AI xiv column has received more than 2,000 reports, covering the top laboratories of major universities and enterprises around the world, effectively promoting academic exchanges and dissemination. If you have a great job to share, please feel free to submit or contact us. Submission mailbox:[email protected];[email protected]

This work was jointly completed by a team of scholars from Tsinghua University, University of Melbourne, Chinese University of Hong Kong, University of Chinese Academy of Sciences Rui Zhang, Yixin Su, Bayu Distiawan Trisedya, Xiaoyan Zhao, Min Yang, Hong Cheng, Jianzhong Qi and others. 该团队专注于大模型、知识图谱、推荐搜索、自然语言处理、大数据等方向的研究。

As an important carrier of structured knowledge, knowledge graph is widely used in many fields such as information retrieval, e-commerce, and decision-making reasoning. However, due to the differences in the representation and coverage of knowledge graphs constructed by different institutions or methods, how to effectively integrate different knowledge graphs to obtain a more comprehensive and rich knowledge system has become an important problem to improve the coverage and accuracy of knowledge graphs, which is the core challenge to be solved by the Knowledge Graph Alignment task.

Traditional knowledge graph alignment methods must rely on manual annotation to align some entities and predicates as seed entity pairs. Such methods are expensive, inefficient, and poorly aligned. Scholars from Tsinghua University, the University of Melbourne, the University of Chinese Hong Kong, and the University of Chinese Academy of Sciences jointly proposed a fully automatic method for knowledge graph alignment based on large models—AutoAlign. AutoAlign does not need to manually label the aligned seed entities or predicate pairs, but completely aligns them through the algorithm's understanding of the semantics and structure of the entities, which significantly improves efficiency and accuracy.

Instead of manual annotation, the AutoAlign method automates the complete knowledge graph pair based on a large model

论文：AutoAlign: Fully Automatic and Effective Knowledge Graph Alignment enabled by Large Language Models，36 (6) TKDE 2024

Link to paper: https://arxiv.org/abs/2307.11772

Code link: https://github.com/ruizhang-ai/AutoAlign

Model introduction

AutoAlign consists of two main parts:

用于将谓词（predicate）对齐的谓词嵌入模块（Predicate Embedding Module）。

用于将实体（entity）对齐的实体嵌入学习部分，包括两个模块：属性嵌入模块（Attribute Embedding Module）和结构嵌入模块（Structure Embedding Module）。

The overall process is shown in the following figure:

Predicate embedding module: The predicate embedding module is designed to align predicates representing the same meaning in two knowledge graphs. For example, align is_in and located_in. To achieve this, the research team created a Predicate Proximity Graph, which combines two knowledge graphs into a single graph and replaces the entities in them with their corresponding Entity Type. This approach is based on the assumption that the same (or similar) predicate should also have a similar entity type (e.g., the target entity type for "is_in" and "located_in" is most likely to be location or city). Through the semantic understanding of types by the large language model, these types are further aligned, and the accuracy of triplet learning is improved. Finally, the predicate proximity graph is learned by graph encoding methods (such as TransE), so that the same (or similar) predicates have similar embeddings, so as to achieve the alignment of predicates.

In terms of implementation, the research team first constructed a predicate proximity graph. A predicate proximity graph is a type of graph that describes the relationships between entity types. Entity types represent a broad category of entities that can be automatically linked to different entities. Even if some predicates have different surface forms (e.g., "LGD:is_in" and "DBP:located_in"), their similarity can be effectively identified by learning the predicate proximity graph. The steps to build a predicate proximity graph are as follows:

Entity type extraction: The research team extracted entity types by obtaining the value of each entity's RDFS:type predicate in the knowledge graph. Typically, there are multiple types of each entity. For example, a Germany entity may have multiple types in the knowledge graph, such as "thing", "place", "location", and "country". In a predicate proximity diagram, they replace the head and tail entities of each triplet with a set of entity types.

Type alignment: Since entity types in different knowledge graphs may use different surface forms (e.g., "person" and "people"), the research team needs to align these types. To do this, the research team leverages the latest large language models, such as ChatGPT and Claude, to automatically align these types. For example, the research team can use Claude2 to identify similar type pairs in two knowledge graphs, and then align all similar types into a unified representation. To this end, the research team designed a set of automated prompts, which can automatically align words according to different knowledge graphs.

In order to capture predicate similarity, multiple entity types need to be aggregated. The research team proposed two aggregation methods: weighted and attention-based functions. In their experiments, they found that attention-based functions worked better. Specifically, they calculate the attention weights for each entity type and obtain the final pseudo-type embedding by means of a weighted summation. Next, the research team trained the predicate embeddings by minimizing the objective function so that similar predicates have similar vector representations.

Attribute Embedding Modules and Structure Embedding Modules: Both the Attribute Embedding Module and the Structure Embedding Module are used for entity alignment. Their ideas and predicate embeddings are similar, i.e., for the same (or similar) entity, the predicate in its corresponding triplet and another entity should also have similarity. Therefore, in the case of predicate alignment (via the predicate embedding module) and attribute alignment (via the Attribute Character Embeding method), we can make similar entities learn similar embeddings via TransE. Specifically:

Attribute Embedding Learning: The attribute embedding module establishes the relationship between the header entity and the attribute value by encoding the character sequence of the attribute value. The research team proposed three combinatorial functions to encode attribute values: summation combinatorial function, LSTM-based combinatorial function, and N-gram-based combinatorial function. Through these functions, we are able to capture the similarity between attribute values, so that the entity attributes in the two knowledge graphs can be aligned.

Structural embedding learning: The structural embedding module is improved based on the TransE method, which learns the embedding of entities by giving different weights to different neighbors. Aligned and implicitly aligned predicates are given higher weight, while unaligned predicates are treated as noise. In this way, the structure embedding module is able to learn more efficiently from the aligned triples.

Joint training: The three modules of predicate embedding module, attribute embedding module and structure embedding module can be trained alternately, influence each other through alternate learning, and achieve the overall optimal representation of each structure by optimizing the embedding. After the training, the research team obtained embedded representations of entities, predicates, attributes, and types. Finally, we compare the similarity of entities (such as cosine similarity) in two knowledge graphs to find the pairs of entities with high similarity (which need to be higher than a threshold) for entity alignment.

Experimental results

The research team conducted experiments on the latest benchmark dataset DWY-NB (Rui Zhang, 2022), and the main results are shown in the table below.

AutoAlign has shown significant improvements in knowledge graph alignment, especially in the absence of manual seeding. Without human annotation, it is almost impossible for existing models to be effectively aligned. However, AutoAlign was able to achieve excellent performance under these conditions. On both datasets, AutoAlign provides a significant improvement over the existing best benchmark model (even with manual labeling) without human seeding. These results show that AutoAlign not only outperforms existing methods in alignment accuracy, but also shows a strong advantage in fully automated alignment tasks.

Bibliography:

Rui Zhang, Bayu D. Trisedya, Miao Li, Yong Jiang, and Jianzhong Qi (2022). A Benchmark and Comprehensive Survey on Knowledge Graph Entity Alignment via Representation Learning. VLDB Journal, 31 (5), 1143–1168, 2022.

Instead of manual annotation, the AutoAlign method automates the complete knowledge graph pair based on a large model

Read on

【AASLD2024 Express】Prediction of HBsAg clearance by peginterferon α-2b treatment: a simple model based on baseline HBsAg levels

Large models lead the 6G revolution! The latest review explores the future of communication methods, covering multimodality, RAG, etc

The top CP of the large model turned from sweet to abusive: they were dissatisfied with each other, and they all looked for a spare tire, because the money was unpleasant

Archetype AI released a large model of Newtonian physics to learn physics principles from sensor data

CNCC | The future of multimodal affective computing under large models

The "Fuxi Eye" large model was released! It has the world's largest ophthalmic image database

New car | The AI large model is on the car, 13 new/27 optimizations, and the ZEEKR 009 glorious OTA upgrade

AI Daily: Fudan and Baidu's new models can generate 1-hour long videos; The new version of ChatGPT for Windows is launched; Two new features have been added to NotebookLM

Surveying and Mapping Bulletin | Ren Ping: Noise data visualization based on LOD1 city model

The terminal AI grading standard has been implemented, and the "fire" of the mobile phone model has burned to the agent

J Clin Invest丨Yang Weili/Li Shihua/Li Xiaojiang's team used monkey models to reveal new pathological mechanisms of Parkinson's disease

Tens of millions of dollars lost by poisoning for large model training? Anthropic found a hidden bug in the LLM codebase

Nearly 1,000 teenagers in the city gathered at Zhonghai Expo to show their skills in the three major model competitions of navigation, aviation and architecture

DeepMind and MIT developed Fluid, which enables autoregressive models to achieve large-scale expansion of Wensheng graphs

AI Weekly | ByteDance's large model training was "poisoned"; Microsoft will terminate the Azure OpenAI service for individuals in China

ByteDance responded to the attack on the intern for the training of the large model: it has been dismissed and does not affect the online business