天天看點

AgreementMaker:Efficient Matching for Large Real-World 翻譯正文之前正文正文之後

正文之前

這篇文章還是我看前幾天那個基于架構進行本體比對的一個Previous Work裡面的一個Previous Work。可以說有點菜,但是還是比較有參考意義的, 是以我把源碼下載下傳了下來,然後準備把對應的文章讀一讀,然後我個人比較喜歡中英對照,直接看中文的時候略過一些不重要的地方,在關鍵部位看原文。是以就有了這麼多的翻譯版本了。。
引用如下:Cruz I F, Antonelli F P, Stroe C. AgreementMaker: efficient matching for large real-world schemas and ontologies[J]. Proceedings of the VLDB Endowment, 2009, 2(2): 1586-1589.

正文

Abstract

摘要

We present the AgreementMaker system for matching real world schemas and ontologies, which may consist of hundreds or even thousands of concepts. The end users of the system are sophisticated domain experts whose needs have driven the design and implementation of the system: they require a responsive, powerful, and extensible framework to perform, evaluate, and compare matching methods. The system comprises a wide range of matching methods addressing different levels of granularity of the components being matched (conceptual vs. structural), the amount of user intervention that they require (manual vs. automatic), their usage (stand-alone vs. composed), and the types of components to consider (schema only or schema and instances). Performance measurements (recall, precision, and runtime) are supported by the system, along with the weighted combination of the results provided by those methods. The AgreementMaker has been used and tested in practical applications and in the Ontology Alignment Evaluation Initiative (OAEI) competition. We report here on some of its most advanced features, including its extensible architecture that facilitates the integration and performance tuning of a variety of matching methods, its capability to evaluate, compare, and combine matching results, and its user interface with a control panel that drives all the matching methods and evaluation strategies.

我們提出了AgreementMaker系統,用于比對真實世界模式和本體,可能包含數百甚至數千個概念。系統的最終使用者是複雜的領域專家,他們的需求推動了系統的設計和實作:他們需要一個響應迅速,功能強大且可擴充的架構來執行,評估和比較比對方法。該系統包含多種比對方法,可以解決比對的元件(概念與結構)的不同粒度級别,他們需要的使用者幹預量(手動與自動),它們的使用(獨立與組合),以及要考慮的元件類型(僅架構或架構和執行個體)。系統支援性能測量(召回率,準确率和運作時性能),以及這些方法提供的結果的權重組合。 AgreementMaker已在實際應用和Ontology Alignment Evaluation Initiative(OAEI)競賽中使用和測試。我們在此報告其一些最先進的功能,包括其可擴充的體系結構,有助于各種比對方法的內建和性能調整,評估,比較群組合比對結果的能力,以及控制所有比對方法和評估政策的使用者界面和控制台。

1. Introduction

1. 介紹

The issue of schema matching in databases [11], which has been investigated since the early 80’s, is fundamental to data integration, as is the closely-related issue of ontology alignment or matching [12]. The matching problem consists of defining mappings among schema or ontology elements that are semantically related. Such mappings are typically defined between two schemas or two ontologies at a time one being called the source and the other being called the target.

自80年代早期以來一直在研究的資料庫[11]中的模式比對問題是資料內建的基礎,與本體對齊或比對密切相關的問題也是如此[12]。比對問題包括定義在語義上相關的 模式或本體元素之間 的映射。這種映射通常在兩個模式或兩個本體之間定義,一個被稱為源本體,另一個被稱為目标本體。

We have been developing the AgreementMaker matching system, whose name takes after agreement, the encoding of a mapping. The capabilities of our system have been driven by the real-world problems of end users who are sophisticated domain experts. We have considered a variety of domains and applications, including: geospatial [2], environmental [4], and biomedical [13]. The conceptual information for these applications is stored in the form of ontologies. However, as demonstrated by others, the same approach can be used for schema matching [1, 10]. To validate our approach, we competed against seven other systems in the biomedical track of the 2007 Ontology Alignment Evaluation Initiative (OAEI), to match ontologies describing the mouse adult anatomy of the Mouse Gene Expression Database Project (2744 classes) and the human anatomy of the National Cancer Institute (3304 classes). We came in third in terms of accuracy (F-measure) [5].

我們一直在開發AgreementMaker比對系統,其名稱取決于協定(映射的編碼)。我們系統的功能受到最終使用者的現實問題的驅動,這些最終使用者是非常複雜的領域專家。我們已經考慮了各種領域和應用,包括:地理空間[2],環境[4]和生物醫學[13]。這些應用程式的概念資訊以本體的形式存儲。但是,正如其他人所證明的那樣,相同的方法可以用于模式比對[1,10]。為了驗證我們的方法,我們與2007年本體校準評估計劃(OAEI)的生物醫學行業中的其他七個系統進行了競争,以比對描述小鼠基因表達資料庫項目(2744類)的成年小鼠解剖學的本體和國家癌症研究所(3304類)的人體解剖學分類本體。我們在準确性方面排名第三(F-measure)[5]。

The AgreementMaker, which is currently in its third version, has been evolving to accommodate: (1) user requirements, as expressed by domain experts; (2) a wide range of input (ontology) and output (agreement file) formats; (3) a large choice of matching methods depending on the different granularity of the set of components being matched (local vs. global), on different features considered in the comparison (conceptual vs. structural), on the amount of intervention that they require from users (manual vs. automatic), on usage (stand-alone vs. composed), and on the types of components to consider (schema only or schema and instances); (4) improved performance, that is, accuracy (precision, recall, F-measure) and efficiency (execution time) for the automatic methods; (5) an extensible architecture to incorporate new methods easily and to tune their performance; (6) the capability to evaluate, compare, and combine different strategies and matching results; (7) a comprehensive user interface supporting both advanced visualization techniques and a control panel that drives all the matching methods and evaluation strategies.

目前處于第三版的AgreementMaker正在不斷發展以适應:(1)領域專家表達的使用者需求; (2)廣泛的輸入(本體)和輸出(協定檔案)格式; (3)根據不同粒度的元件集的比對選項(本地與全局),在比較中考慮的不同特征(概念與結構),他們需要的來自使用者的幹預量(手動與自動),使用(獨立與組合),以及要考慮的元件類型(僅架構或架構和執行個體); (4)改進性能,即自動方法的準确度(精确度,召回率,F測量值)和效率(執行時間); (5)可擴充的架構,可以輕松地整合新方法并調整其性能; (6)評估,比較群組合不同政策和比對結果的能力; (7)全面的使用者界面,支援進階可視化技術和控制台,驅動所有比對方法和評估政策。

In this demo paper, we focus on the most recent developments of the system, which has been almost completely redesigned in the last year. In particular, we describe: (1) the user interface with particular emphasis on the control panel and improved visualization and interaction capabilities; (2) the automatic matching methods and execution capabilities; and (3) the evaluation strategies for determining the efficiency of the matching methods and for performing the combination of results.

在本示範文章中,我們将重點介紹該系統的最新發展,該系統在去年幾乎完全重新設計。特别是,我們描述:(1)使用者界面,特别強調控制台和改進的可視化和互動功能; (2)自動比對方法和執行能力; (3)用于确定比對方法的效率和執行結果組合的評估政策。

2. RELATED WORK

2.相關工作

There are several notable systems related to ours, including Clio [6], COMA++ [1], Falcon-AO [7], and Ri MOM [14] (just to mention a few). Clio stands apart because of its single focus on database-specific constraints and operators (e.g., foreign keys, joins) to infer the mappings whereas constraints in ontologies (as implemented in the other three systems and in AgreementMaker) are of a different nature [12]. This different emphasis also permeates the remaining components of the various systems, as those that also support ontology matching implement a rich tool box of stringsimilarity and structural-based techniques and focus on performance. Consequently, some of these systems do not focus on user interaction: for example, Falcon-AO and Ri MOM provide simple interfaces that offer limited user interaction (e.g., no manual manipulation of the ontologies). However, what separates AgreementMaker from these other systems (including from COMA++, which has a more sophisticated user interface than the other two) is the degree to which it integrates the evaluation of the quality of the obtained mappings with the graphical user interface and therefore with the iterative matching process. This tight integration emerged from our work with domain experts, who required that the evaluation be an integral part of the matching process, not an “add on” capability.

有幾個與我們相關的着名系統,包括Clio [6],COMA ++ [1],Falcon-AO [7]和Ri MOM [14](僅舉幾例)。 Clio之是以與衆不同,是因為它專注于特定于資料庫的限制和運算符(例如,外鍵,連接配接)來推斷映射,而本體中的限制(在其他三個系統和AgreementMaker中實作)具有不同的性質[12 ]。這種不同的重點也滲透到各種系統的其餘元件中,因為那些支援本體比對的元件實作了豐富的相似性和基于結構的技術工具箱,并專注于性能。是以,這些系統中的一些不關注使用者互動:例如,Falcon-AO和Ri MOM提供了限制使用者互動的簡單接口(例如,沒有對本體的手動操縱)。然而,将AgreementMaker與其他系統(包括COMA ++,其具有比其他兩個更複雜的使用者界面)差別開來的是它将獲得的映射的品質評估與圖形使用者界面內建的程度,是以疊代比對過程(大意是可以直接看到評估結果的改進?)。這種緊密內建源于我們與領域專家的合作,他們要求評估是比對過程中不可或缺的一部分,而不是“附加”功能。

3. ARCHITECTURE

3.架構

The AgreementMaker supports a wide variety of methods or matchers. Our architecture (see Figure 1) allows for serial and parallel composition where, respectively, the output of one or more methods can be used as input to another one, or several methods can be used on the same input and then combined. A set of mappings may therefore be the result of a sequence of steps, called layers.

AgreementMaker支援各種方法或比對器。我們的體系結構(參見圖1)允許串行和并行組合,其中一個或多個方法的輸出可以分别用作另一個方法的輸入,或者可以在同一輸入上使用多個方法然後組合。是以,一組映射可能是一系列步驟的結果,稱為層。

The matching process of a generic matcher (see Figure 2), can be divided into two main modules: (1) similarity computation in which each concept of the source ontology is compared with all the concepts of the target ontology, thus producing two similarity matrices (one for classes and the other one for properties), which contain a value for each pair of concepts; (2) mappings selection in which the matrix is scanned to select only the best mappings according to a given threshold and to the cardinality of the correspondences, for example, 1-1, 1-N, N-1, M-N

通用比對器的比對過程(見圖2)可以分為兩個主要子產品:(1)相似度計算,其中源本體的每個概念與目标本體的所有概念進行比較,進而産生兩個相似性矩陣(一個用于類,另一個用于屬性),其中包含每對概念的值; (2)映射選擇,掃描矩陣以根據給定門檻值和對應關系的基數僅選擇最佳映射,例如1-1,1-N,N-1,M-N

To enable extensibility, we adopted the object-oriented template pattern by defining the skeleton of the matching process in a generic matcher, which defers only a few operations to the concrete matcher extensions (see Figure 3). This abstraction minimizes development effort by completely decoupling the structure of a single method from the architecture of the whole system, thus allowing reuse or any possible composition of matching modules.

為了實作可擴充性,我們通過在通用比對器中定義比對過程的架構來實作面向對象的模闆模式(???不懂),該模式僅将少數操作推遲到具體的比對器擴充(參見圖3)。這種抽象通過将單個方法的結構與整個系統的體系結構完全解耦來最小化開發效率,進而允許重用或任何可能的比對子產品組合。

A first layer matcher produces the similarity matrices, while the second and third layer matchers extend the first layer matchers. In particular, a second layer matcher improves on the results of a first layer matcher using conceptual or structural information, depending on whether it considers one concept alone or a concept and its neighbors. Finally, a third layer matcher combines the results of two or more matchers from the previous layers, in order to obtain a final matching or alignment, that is, a set of mappings.

第一層比對器産生相似性矩陣,而第二和第三層比對器擴充第一層比對器。特别地,第二層比對器使用概念或結構資訊改進第一層比對器的結果,這取決于它是單獨考慮一個概念還是概念及其鄰居。最後,第三層比對器組合來自先前層的兩個或更多個比對器的結果,以便獲得最終比對或對齊,即一組映射。

4. USER INTERFACE

4.使用者界面

The source and target ontologies (in XML, RDFS, OWL, or N3) are visualized side by side using the familiar outline tree paradigm (see Figure 4). Agreements can be exported in different formats (e.g., XML, Excel). Because all the matching operations and their results are managed by this interface, we gave special consideration to its design [4]. We describe next two new features of the interface: the control panel and the visualization of non-hierarchical ontologies (e.g., due to multiple inheritance in OWL). The latter feature allows for specific subtrees to be visually duplicated. Because we adopt the Model-View-Control pattern, this duplication does not affect the underlying data structures. The control panel (see Figure 5) allows users to run and manage matching methods and their results. Users can select parameters common to all methods (such as threshold and cardinality) and method-specific parameters. When a method has run, a new row is dynamically added to the table that is part of the control panel at the same time that lines depicting the mappings between the concepts are added (see Figure 4). Each row is color coded and allows for its selection so that the corresponding mappings (of the same color) can be compared visually. Each row also displays the performance values for the associated methods, thus allowing for the comparison with those of other rows. In addition, users can modify at runtime the method parameters by changing directly their values in the table or by selecting previously calculated matchings as input to the methods to be applied next. Multiple matchings can also be combined manually or with an automatic combination matcher.

源和目标本體(在XML,RDFS,OWL或N3中)使用熟悉的大綱樹範例并排顯示(參見圖4)。比對結果可以以不同的格式導出(例如,XML,Excel)。由于所有比對操作及其結果均由此接口管理,是以我們特别考慮了其設計[4]。我們将介紹接口的下兩個新功能:控制台和非分層結構的可視化(例如,由于OWL中的多重繼承)。後一特征允許在視覺上複制特定的子樹。因為我們采用模型-視圖-控制模式,是以這種應用不會影響基礎資料結構。控制台(參見圖5)允許使用者運作和管理比對方法及其結果。使用者可以選擇所有方法共有的參數(例如門檻值和基數)和特定于方法的參數。當一個方法運作時,一個新行被動态地添加到作為控制台一部分的表中,同時添加了描述概念之間映射的行(參見圖4)。每行都是彩色編碼的,并允許其選擇,以便可以在視覺上比較相應的映射(相同顔色)。每行還顯示相關方法的性能值,進而允許與其他行的性能值進行比較。此外,使用者可以在運作時通過直接更改表中的值或通過選擇先前計算的比對結果作為下一個要應用的方法的輸入來修改這個方法的參數。多個比對也可以手動組合或與自動組合比對器組合。

5. MATCHING METHODS

5.比對方法

First layer matchers compare concept features (e.g., label, comments, annotations, and instances) and use a variety of methods including syntactic and lexical comparison algorithms as well as the use of a lexicon like Word Net. Of those methods some were proposed by others (e.g., edit distance, Jaro-Winkler) and some devised by us, including a substring-based comparison that favors the length of the common substrings and a concept document-based comparison containing a wide range of features. Those features are represented as TF-IDF vectors and use a cosine similarity metric (see Figure 6).

第一層比對器比較概念特征(例如,标簽,注釋,注釋和執行個體)并使用各種方法,包括句法和詞彙比較算法以及Word Net等詞典的使用。其中一些方法是由其他人提出的(例如,編輯距離,Jaro-Winkler)和我們設計的一些方法,包括基于子串的比較,這有利于公共子串的長度和基于檔案的概念等方面進行廣泛特征上的比較。這些特征表示為TF-IDF向量并使用餘弦相似性度量(參見圖6)。

Second layer matchers use structural properties of the ontologies. Our own methods include the Descendant’s Similarity Inheritance (DSI) and the Sibling’s Similarity Contribution (SSC) matchers [3].

第二層比對器使用本體的結構屬性。我們自己的方法包括後代的相似性遺傳(DSI)和兄弟姐妹的相似性貢獻(SSC)比對[3]。

Finally, third layer matchers combine the results of two or more matchers so as to obtain a unique final matching in two steps. In the first step, a similarity matrix is built for each pair of concepts, using our Linear Weighted Combination (LWC) matcher, which processes the weighted average for the different similarity results (see Figure 7). Weights can be assigned manually or automatically, the latter assignment being determined using our evaluation methods. The second step uses that similarity matrix and takes into account a threshold value and the desired cardinality. When the cardinality is 1-1, we adopt the Shortest Augmenting Path algorithm [9] to find the optimal solution for this optimization problem (namely the assignment problem reduced to the maximum weight matching in a bipartite graph) in polynomial time.

最後,第三層比對器組合兩個或更多比對器的結果,以便在兩個步驟中獲得唯一的最終比對。在第一步中,使用我們的線性權重組合(LWC)比對器為每對概念建立相似性矩陣,該比對器處理不同相似性結果的權重平均值(參見圖7)。可以手動或自動配置設定權重,後者配置設定使用我們的評估方法确定。第二步使用該相似性矩陣并考慮門檻值和期望的基數。當基數為1-1時,我們采用最短增廣路徑算法[9],在多項式時間内找到該優化問題的最優解(即,将配置設定問題降級到二分圖中的最大權重比對)。

6. EVALUATION

6.評估

The design of optimal methods to find correct and complete mappings between real-world ontologies is a hard task for several reasons. First of all, an algorithm may be effective for a given scenario, but not for others. Even within the same scenario, the use of different parameters can change significantly the outcome. Moreover, in interviewing domain experts in the geospatial domain, we discovered that they do not trust automatic methods unless quality metrics are associated with the matching results. These observations have motivated a variety of evaluation techniques, that determine runtime and accuracy (precision, recall, and F-measure).

由于幾個原因,設計在現實世界本體之間找到正确和完整映射的最佳方法是一項艱巨的任務。首先,算法可能對給定場景有效,但對其他場景則無效。即使在相同的情況下,使用不同的參數也可以顯着改變結果。此外,在通路地理空間域中的域專家時,我們發現他們不信任自動方法,除非品質度量與比對結果相關聯。這些觀察結果激發了各種評估技術,這些技術決定了運作時間和準确性(精确度,召回率和F測量值)。

The most effective evaluation technique compares the mappings found by the system between the two ontologies with a reference matching or “gold standard,” which is a set of correct and complete mappings as built by domain experts. When a reference matching is available, the AgreementMaker can determine the quality of the found matching analytically or visually. A reference matching can also be used to tune algorithms by using a feedback mechanism provided by a succession of runs.

最有效的評估技術将系統在兩個本體之間發現的映射與參考比對或“黃金标準”進行比較,後者是由領域專家建構的一組正确和完整的映射。當參考比對可用時,AgreementMaker可以分析或直覺地确定找到的比對的品質。參考比對也可以用于通過使用由一系列運作提供的回報機制來調整算法。

When a gold standard is not available, “inherent” quality measures need to be considered. Quality measures can be defined at two levels as associated with the two main modules of a matcher (see Figure 2): similarity or selection level. We can consider local quality as associated with a correspondence at the similarity level (or mapping at the selection level) or global quality as associated with all the correspondences at the similarity level (or with all possible mappings at the selection level). We have incorporated in our system a global-selection quality measure proposed by others [8] and a local-similarity quality measure that we have devised. Experiments have shown that our quality measure is usually effective in defining weights for the LWC matcher.

如果沒有黃金标準,則需要考慮“固有的”品質措施。品質測量可以在兩個級别定義,與比對器的兩個主要子產品相關聯(參見圖2):相似性或選擇級别。我們可以将與相似性級别(或選擇級别的映射)的對應關聯的本地品質或與相似性級别(或選擇級别的所有可能映射)的所有對應關聯的全局品質相關聯【PS這什麼鬼!!!】。我們已經在我們的系統中納入了其他人提出的全球選擇品質測量[8]以及我們設計的局部相似性品質測量。實驗表明,我們的品質測量通常在定義LWC比對器的權重方面是有效的。

7. DEMONSTRATION

7.示範

Our demo focuses on the matching methods and evaluation strategies for determining the efficiency of ontology matching methods. Due to the tight integration of the evaluation strategies with the graphical user interface, a unique feature of our system, all the steps will be performed through the interface. Users will start by uploading their own ontologies, load our own, or download ontologies from the web, thus taking advantage of the several standard formats supported. Users can then explore the interface freely or follow a walk-through, consisting of browsing the ontologies, expanding and contracting nodes, and customizing the display. They have access to the information associated with each concept to be aligned, including descriptions, annotations, and (context) relations, and they can use them to visually detect mappings.

我們的示範側重于确定本體比對方法的效率的比對方法和評估政策。由于評估政策與圖形使用者界面(我們系統的獨特功能)的緊密內建,所有步驟都将通過界面執行。使用者将首先上傳他們自己的本體(加載我們提供的本體,或從網上下載下傳的本體)進而利用支援的幾種标準格式。然後,使用者可以自由地浏覽界面或按照演練進行浏覽,包括浏覽本體,擴充和收縮節點以及自定義顯示。他們可以通路與要對齊的每個概念相關的資訊,包括描述,注釋和(上下文)關系,他們可以使用它們來直覺地檢測映射。

正文之後

初版是直接CAJViewer文字識别,然後用python進行清洗,然後谷歌檔案直接翻譯,最後整合起來的。是以估摸着友好度比較低,等我看完之後慢慢一點點的改正吧。。