IBM has launched an innovative framework to evaluate the output of large models in a "black box" manner

author：Not bald programmer 2024-07-02 14:01:00

Compared with performance and evaluation rankings, the accuracy, security, and interpretability of the content output of large models are more important, and it is impossible to talk about these commercializations.

IBM researchers have developed a black-box framework to evaluate the output, confidence, and more of a large model without having to access its internal structure, parameters, or training data.

Address: https://arxiv.org/abs/2406.04370

In order to elicit the variability of large models in outputs, the researchers propose six different cue perturbation strategies: 1) stochastic decoding, which uses different decoding techniques, such as greedy search, beam search, and core sampling, to generate multiple outputs, thus reflecting the uncertainty of the model's response to them.

2) paraphrasing, by paraphrasing the context of the prompt, such as using reverse translation technology, translating text from one language to another and back again, in order to observe changes in the output. If the paraphrased output is semantically consistent with the original output, this indicates that the model is fairly confident in its output.

3) Sentence arrangement, which tests the consistency of the model's output by changing the order of named entities in the input. If the model is very confident in its output, then the output should remain the same even if the order of the entities changes.

4) Entity frequency amplification, by repeating sentences containing named entities, to test whether the model will change its output due to the repetition of information.

5) Stop word removal, by removing common stop words, to see if these words, which are usually considered to be less informative, have an impact on the model's response.

6) Segmentation response consistency, by randomly splitting the output of the model into two parts, and using the NLI model to measure the semantic consistency between the two parts.

In addition, based on these strategies, the researchers constructed two features, semantic and syntactic, which were used to train the confidence model. Semantic features mainly focus on the number of semantic equivalence sets output, and if the output of a large model can form multiple semantic equivalence sets, it means that the model is not confident in its output.

Syntactic features are evaluated by calculating the syntactic similarity between outputs, and the higher the similarity, the higher the model's confidence in its output.

During model training, researchers use a standard supervised learning process to adjust model parameters by pairing features with labels (generated based on how well the output matches the standard answers).

The creation of the label is based on a concise rule: if the LORD score of the model's output with the real answer exceeds a certain threshold (e.g., 0.3), the model is considered correct in its answer to the question (label 1);

Otherwise, it is considered an error (labeled 0). This method is very simple and efficient, and it can effectively distinguish the performance of the model on different problems.

为了评估该框架的性能，研究人员在TriviaQA、SQuAD、CoQA和Natural Questions数据集上，通过在Flan-ul2、Llama-13b和Mistral-7b三款知名开源大模型上进行了实验。

The results show that the framework not only significantly outperforms the existing black-box confidence estimation methods on multiple datasets, but also improves the performance of AUROC indicators by more than 10%.

According to the researchers, the framework is highly extensible and applicable, and different perturbation strategies can be added to it at any time to detect and adapt to different types of large models. At the same time, it only needs to be trained on a large model for confidence model, which can be applied to similar models in most cases.

IBM has launched an innovative framework to evaluate the output of large models in a "black box" manner

Read on

AI programming assistant Tongyi Lingcode, character video generation large model Vimi, Alipay intelligent assistant...... Where are these treasures of the town|2024 World Artificial Intelligence Conference

Generate high-quality 3D footage in 1 minute! Meta threw out the 3D model of Wensheng, and the effect demonstration was amazing

How can a team of 4 people create nearly 10 million business value with a large model? ｜AICon

Boltzmann phase field model of dendrite growth and anisotropic neutrons in rapid solidification of binary alloys

CIC Advisor's Perspective| CIC consultant's analysis of the five forces model driving the development of the precision optics industry

Measured Wenxin Large Model 4.0 Turbo! Faster speeds, upgraded effects, waiting for you to experience!

Revealed: The Trillions of MoE+ Multimodal Large Model Matrix of Step Stars was unveiled

The first open-source, native multimodal generation model: one-click generation of "fried egg" graphic recipes

Mori Shigekuan is a slam dunk master first-day talent? The model is O'Neill, which is enough to illustrate his strength

Little Ape Learning and Training Machine brought the latest AI large model application results to the World Artificial Intelligence Conference

Meta 3D Gen AI model was unveiled, and Google/Weimei Holographic boosted the innovation of deep learning 3D holographic technology

This article provides you with an in-depth analysis of the LLaMA2 model architecture

AI music generator Suno released iPhone version丨Meta released 3D modeling generative AI model

Production capacity increased by 20%, how did this insurance model do it?

There is no suspense in the battle for Apple's AI's domestic large model

The iPhone 16 series is starting to pay attention to large memory? It has been revealed that the iPhone16 is equipped with 8GB memory as standard, thanks to Apple's AI gift! Can anyone help me ask, though