Thank you for your interest in YYD.

Lexicography innovation in the context of technological innovation

Swan Tapu

Lexicography is experiencing an "identity crisis." The challenge is threefold: First, the emergence and application of new technologies are having a significant impact on almost every aspect of lexicography. Second, the business model of dictionary publishing is gradually becoming obsolete. The traditional lexico-sponsored publishing model, developed thanks to printing technology, can no longer adapt to the new developments of the digital age, and lexicography as a cultural practice is facing severe challenges. Third, there is increasing competition from other types of information sources, such as search engines, machine translation programs and applications. We may interpret these competitions as silent criticisms of the quality, relevance, and accessibility of paper dictionaries in the context of the digital age. Lexicography urgently needs to be "reinvented" in order to avoid a deeper crisis. The word "innovation" not only refers to the improvement of methods and technologies, but also refers to the new transformation of the entire field. From the perspective of technology-enabled lexicography, "technological innovation" for lexicography not only means adopting new methods and technologies to improve the compilation process, but also means presenting dictionaries to target users in a more novel and friendly way, and also means developing new dictionary products. Based on the digital dictionary project I participated in, I would like to share my understanding with my Chinese counterparts from three aspects: "concept", "practice" and "reflection".

First, in terms of ideas, what is "lexicography modernization"? In my view, it means that editors should shift their focus from compiling dictionaries to building and using dictionary databases. In other words, the primary product produced by a lexicographer should no longer be the dictionary text, but the dictionary data stored in the database. This data can be presented to users in the form of paper dictionaries or embedded in digital learning tools such as "e-readers", "writing assistants", "translation software", etc. This requires that the storage, organization and presentation of language data used for lexicography should be free from the traditional framework and the use of application programming interfaces (APIs) to realize the compilation intent of "one database, multiple digital tools". Taking writing (including native language writing and second language assignments) as an example, on the one hand, more and more people are writing on electronic devices; On the other hand, language abuse and contamination are everywhere. Paper dictionaries are no longer favored, and passive-check-up writing tools are not really helpful. In the face of this reality, how can lexicography help? In my opinion, the development of an embeddable "writing assistant" is worth paying attention to, which can turn writing tools into products that provide proactive language services and provide real-time language guidance in the interaction with users.

Next, I would like to share my practical experience with one of the digital dictionary practice projects I participated in (the Spanish version of "Writing Assistant") as an example. Last year, I had the pleasure of working for two months at Ordbogen, a well-known Danish digital dictionary company, which focuses on the research and development of language services, digital teaching materials, online dictionaries and writing assistants. I was invited to participate in the "Writing Assistant" R&D team, which brings together experts from various disciplines such as information science and lexicography to explore how to apply the AI-powered language model GECToR (Grammatical Error Correction: Tag, Not Rewrite) to develop new digital products that serve the needs of writing, which is still ongoing. The model, developed using a neural network, has been trained on an English corpus and is freely available on the web. As a lexicographer, I am mainly involved in three aspects: corpus training, functional design, and user interaction.

First, the corpus training of GECToR is divided into four stages, including: (1) Spanish corpus training (completed); (2) Add synthetic data from the dictionary database (completed); (3) add semi-synthetic data (in progress); (4) Add natural language data (not started yet). My main task is to help make the product more user-friendly, which includes: (1) writing "text snippets" in Spanish to explain problems and give suggestions; (2) Write "additional text" to prompt vocabulary, grammar, stylistic and other knowledge; (3) Translation of Spanish pairs into English, Danish, Italian and Chinese.

Second, based on the research and thinking of the "writing assistants" (such as Grammarly, LanguageTool, ProWritingAid, etc.) that have been marketed in Europe and the United States, I summarize the functional design of the "writing assistant" into six aspects: (1) identification function, that is, to find problems that may be encountered in writing; (2) error correction function, that is, to provide users with alternative options; (3) prediction function, including completing the spelling form of words, predicting the next words that may appear in writing; (4) Transformation function, i.e., optimizing syntax, adjusting style, etc.; (5) Translation function, that is, to provide the corresponding words of the target language; (6) Check function, that is, provide dictionary database retrieval interface. The "Writing Assistant" (Spanish version) that we are currently working on has already implemented the functions mentioned above, such as "Prediction", "Translation", and "Check", and the development of the other three functions is also in progress. In order to achieve the above functions, lexicography is very important: on the one hand, the structure and form of metadata in the dictionary database must adapt to the needs of "writing assistants" for data extraction and fusion; On the other hand, dictionary definitions need to be fully structured.

Third, through the experience and comparative analysis of the existing "writing assistants", I summarized the communicative nature embodied in them into five levels: (1) automatic error correction, regardless of user acceptance or not; (2) unexplained recommendations; (3) a proposal with a brief explanation; (4) Suggestions with additional explanations; (5) Expanded interpretation. I call the first two levels of interaction "non-friendly communication", the middle two levels of interaction "friendly communication", and the last level of communication belongs to the user-oriented deep learning level. In addition, for the "Writing Assistant" (Spanish version), which is currently under development, we plan to use more tests to validate and optimize the user-friendliness of the way text data is presented.

Finally, I would like to share my thoughts on the future of digital dictionaries. On the one hand, we need to rethink the role of "lexicography": (1) to provide synthetic dictionary data; (2) training language models; (3) Provide background dictionary data check; (4) Interaction with users. On the other hand, we also need to rethink the role of "dictionary editors": (1) participate in language model training; (2) construction of "dictionary database"; (3) Develop a "communicative database" to provide short texts and provide users with more suggestions for language use. Here, I would like to highlight the difference between a "dictionary database" and a "communicative database": the former is a dictionary-based database that can be used to provide dictionary metadata resources; The latter is a problem-based database that can be used to provide scenario-based language services. In addition, the impact of AI technology on lexicography concepts and technological innovations is already being felt, but the current use of AI to develop "dictionary-like" language tools is not omnipotent. For example, it needs unambiguous data to improve the efficiency of language services. Lexicography editors need to keep pace with the times and take practical actions for interdisciplinary cooperation, which is the inevitable path for the innovation and development of lexicography.

(This article was first published in Language Strategy Research, Issue 3, 2024)