Review

Opal is a one-stop machine learning platform developed by iQIYI's big data team, which aims to improve feature iteration and model training efficiency, and help businesses increase revenue. The entire platform covers multiple key links in the machine learning life cycle, including feature production, sample construction, model exploration, model training, and model deployment. As the cornerstone of model training, the importance of features is self-evident, and how to allow users to improve the iterative efficiency of features on the business side with the help of the platform's capabilities, so as to achieve the expected goals, is one of the key issues that the platform needs to think about. In Opal, feature production, storage, access, etc., together form the core functions of the feature center. This article will focus on feature center-related work. For more information about Opal, see: Opal Machine Learning Platform: iQIYI's Digital Intelligence Integration Practice

What is the feature center

To put it simply, a feature hub is a tool platform for producing, sharing, and managing features of machine learning models. Algorithm engineers or data analysts can easily create and share features on the platform, and the platform side can help solve various problems encountered in the process of feature production and use, improving the iterative efficiency of features.

The feature center is basically suitable for all scenarios that require features, such as recommendations, advertisements, and risk control. The feature table is registered in the feature center, which can automatically complete the construction of online and offline tables to ensure the consistency of online and offline, and ensure that features can be shared with multiple people when only one feature table exists, reducing resource costs. In addition, the feature center can also save time and cost, and complex SQL operations that were originally required, such as exporting training tables and data guide tables, can now be completed through a simple drag-and-drop configuration through the Web UI in the feature center.

Problems solved by feature centers

The algorithm model is essentially a mapping function, with a numeric vector input and an output that ranks the candidate set based on a certain target. In the iQIYI scenario, during offline training, algorithm engineers need to extract features from the original log and build training samples based on them. In the case of a real-time service, the corresponding original features are queried out based on the user ID and video ID of the user on the device side, and the original features are converted into training features based on the DSL configuration, and finally the prediction service is called to obtain the predicted value.

Generally speaking, the only way to improve the performance of the model is the following two aspects:

Model-side optimization: This is a system tuning strategy that focuses on optimizing the model structure and tuning the model hyperparameters.
Data-side optimization: Corresponding to model-side optimization, this is a system tuning strategy that focuses on adjusting the quality of the dataset, and improves the effect of the model by improving the data quality. However, in practice, it is easy for people to think that the model is not effective because the model is not good, but they ignore the huge impact of the dataset itself on the model performance.

There is a consensus in the industry that "data (features) determine the upper limit of the model, and the model structure and parameter tuning are only approaching this upper limit", and the importance of data-side optimization to the overall effect improvement is self-evident. So how do you enable engineers to efficiently optimize based on the data side? The answer is feature centers. The platform needs to be able to solve the various challenges encountered in the process of data-side optimization:

How do I deal with a large number of user requests? iQIYI has a large number of movie-watching users, and the access to features is a very high-frequency operation, how to deal with high QPS requests is one of the challenges faced by the feature center;
How to meet the real-time requirements for features? In the scenarios of advertising, recommendation, and risk control, in order to ensure the output effect of the algorithm model, the real-time requirements for features are rising.
How can I improve the scalability and flexibility of my features? Business scenarios are becoming increasingly complex, and feature requirements are flexible and changeable. From the production of basic features to statistical sequence feature groups, from simple statistics of offline features to window calculation and cross-feature of real-time features, the business side needs a feature middle platform that can support the new feature types and requirements that are gradually derived.
How to meet the business requirements of rapid iteration? The service-oriented DSL provided by the feature middle platform needs sufficient scenarios, the feature production link allows the business to write as little code as possible, and the underlying computing engine and storage engine are completely transparent to the business, completely releasing the burden of business computing, storage selection, and tuning, completely realizing the large-scale production of real-time basic features, and continuously improving feature productivity.

Specifically, the functions of the feature center need to cover at least the following aspects:

Feature input: how to manage the data sources of the line of business, including various text files, Parquet files, Hive tables, etc.;
Feature calculation: how to express the feature calculation logic and efficiently extract the required features from the original log;
Feature storage: What type of system the calculated features are stored in, which involves the trade-off between storage cost and access efficiency;
Feature conversion: The conversion from the original feature to the model feature, including the parsing and conversion of various DSLs.

Diagram of the overall architecture of the feature center

iQIYI Opal Machine Learning Platform: Feature Center Construction Practice

According to the different roles of features in the whole link, features can be divided into two categories: feature production and feature use.

Feature production: To solve the problem of how to perform feature calculation from various types of big data sources and how to store the obtained results, the platform has designed a set of efficient feature calculation operators, with drag-and-drop web pages, to host the feature production tasks of operation and maintenance users and assist users to efficiently manage their features.
Feature use: To solve how to realize online access to features, the feature center introduces the concept of feature view to realize functions such as feature reuse and custom conversion, and provides Opal Feat View SDK to realize dynamic serialization of features, automatic perception of multiple data centers, nearby access and other functions, shielding the underlying storage details for the business side, so that it can focus on business logic.

Introduction to the feature center function

As can be seen from the previous section, feature centers can be functionally divided into the production of features and the use of features. Furthermore, according to the production delay of the target feature, the feature production can be divided into offline feature group and real-time feature group. The features of each are described below.

Offline feature groups

Generally speaking, when we need to train a machine learning model, we usually preprocess a set of data and transform the original data into a feature vector or feature set, which can be better understood and used by the model. For example, if you want to count a user's purchase behavior in the past week, use this statistical result as a feature. In some cases, in order to improve processing speed and efficiency, we calculate these features in advance and store them for later use. We refer to this pre-computed and stored set of features as offline feature sets.

As shown in the figure below, the Opal offline feature group reads data from the Source node based on the DAG graph built by the user by dragging and dropping, and the data can be a Hive table, an Iceberg table, or a TFRecord and Parquet file, and then transforms the data through the intermediate computing nodes abstracted by Opal, and finally writes the resulting features to the storage system described by the Target node. As shown in the diagram, Opal currently supports a variety of common feature storage formats.

Feature metadata management, feature production DAG configuration

By dragging the operators on the left side of the platform to combine, the production task configuration of a DAG is obtained, and the platform supports the management of feature metadata and feature schema around the DAG.

SQL 语法解析及 Schema 推断

The platform provides syntax parsing and verification functions, so that users can observe the field details of each operator's output in real time during configuration, eliminating the need to find errors at the task submission stage, which can reduce user debugging time.

Feature quality verification and early warning

The platform supports various checks on the produced features, such as the zero value rate, null value rate, quantile number, maximum and minimum value of the feature, etc., and provides a visual page for users to view the feature quality.

Task reruns and abnormal alarms

At present, multiple teams such as recommendation middle platform, advertising algorithm, and business risk control have accessed the offline feature production module, and more than 300 feature groups have been produced based on the platform.

Real-time feature groups

Real-time feature groups are different from offline feature groups. Offline feature groups typically include features that can be pre-computed and stored, while real-time feature groups contain features that are generated in real time or near real-time. The acquisition and calculation of real-time features usually require a strong data infrastructure and real-time data processing capabilities.

Real-time feature groups are important in many real-time decision-making and forecasting systems, such as recommender systems, fraud detection, financial transactions, and more. Managing real-time feature groups and ensuring the accuracy and timely acquisition of real-time feature values plays an important role in the performance and practicability of these systems.

Similar to offline feature groups, Opal provides a DAG-based real-time feature processing flow map, which can ingest data from Kafka / Iceberg / MySQL, use the platform's integrated operators for calculation, and then output it to Kafka and other data storage media.

Other features available in the offline feature group module are also being added in the real-time module.

Real-time feature production DAG configuration

The platform provides feature metadata and schema management, and the configuration of a real-time feature can be completed through the combination of various operators.

An example of a sliding window operator

Through simple configuration, window conversion can be realized, and users do not need to write complex SQL statements, which greatly reduces the development cost of business teams.

Window feature merging and state reuse

Opal supports merging feature calculation tasks with different window periods but the same sliding step size to consolidate the calculations in the same window, which can greatly save state space and reduce task resource occupation.

Feature view

Unified end-to-end access to online feature groups

Offline feature groups and real-time feature groups solve the problems of feature configuration, processing, and storage, while feature view solves the problem of reading features. As shown below, after the offline feature group and the real-time feature group generate features, they are stored in a variety of different offline storage media, which cannot be directly used by the online engine service of the business side, and a database irrigation service is required to import the feature group from the offline storage to the online cache (Couchbase, Redis, HBase, etc.), and Opal implements feature irrigation through the feature view, and provides a unified client to realize the reading of features, so that users do not need to pay attention to the details of feature irrigation.

Feature transformation and feature derivation

In some cases, the user needs to convert an existing feature, such as taking a logarithm or quadrarithm of the original feature and returning it to the downstream. To support these features, Opal provides a flexible and efficient set of DSL feature transformation expressions. Its basic format is as follows:

Grammatical explanation:

The function name is a keyword predefined by the platform
Parameters can be feature variables or various types of literal constants

Numeric set constants, e.g. [1, 2, 4]
String set constants, e.g. ['aaa', 'bbb']
Feature variable, any valid feature name in the view, identified by backticks, e.g. 'city'
Numeric constants, any valid numbers, e.g. 123
字符串常量,由单引号标识,例如:'hello world'
A set constant, identified by a middle parentheses

Conversion examples:

Java client access simplifies the feature acquisition process

As shown in the figure, the SDK shields the underlying storage resources, so that users do not need to dock with various complex caches, and can access the features by introducing the SDK.

O&M monitoring dashboard

After the client is connected, the SDK automatically delivers metrics to the metrics service side, and you can observe whether the service is running properly based on the Grafana dashboard monitoring.

Service access

At present, advertising, recommendation, and risk control services have been connected to the feature center of the Opal platform to varying degrees, and a series of corresponding upgrades have been carried out, and the feature iteration efficiency of each business has been increased by 0.4 times to 3 times, the delay of obtaining features on the service engine side has been reduced by about 50%, and the demand accumulation on the feature side has achieved zero backlog. For details of the architecture changes before and after service access and the benefits brought to the business, please refer to the article Opal Machine Learning Platform: Summary of Business Practices in iQIYI's Digital Intelligence Integration Practice. In a series of articles, we will also launch the experience sharing of Opal-based feature assessment architecture transformation written by business students.

Planning for the future

In the future, feature centers on the Opal platform will be enhanced in the following ways to help businesses achieve better results:

Feature sharing: With more and more features managed in the platform, it is inevitable that the problem of feature double calculation will occur, and it is necessary for the platform to realize feature sharing to avoid repeated production by users;
Quality verification of real-time features: For offline features, the platform has a relatively complete feature verification module, which can ensure the quality of output features, and also needs to have a corresponding quality monitoring service for real-time features;
Feature heat calculation: The heat of each feature is calculated based on the situation that the feature is accessed online, and the heat can be used to assist the business side in evaluating the importance of the feature in the future.

Author: Big Data Team

Source-WeChat public account: iQiyi technical product team

Source: https://mp.weixin.qq.com/s/68x3hr1WlnziVIE93Sia4g

iQIYI Opal Machine Learning Platform: Feature Center Construction Practice