laitimes

Can't wait for OpenAI's Q*, Huawei Noah's secret weapon to explore LLM inference, MindStar, came first

author:Heart of the Machine Pro

AIxiv is a column that publishes academic and technical content in the heart of the machine. In the past few years, the AI xiv column has received more than 2,000 reports, covering the top laboratories of major universities and enterprises around the world, effectively promoting academic exchanges and dissemination. If you have a great job to share, please feel free to submit or contact us. Submission mailbox:[email protected];[email protected]

The authors of this paper are Kang Jikun, Li Xinze, Chen Xi, Amirreza Kazemi, and Chen Boxing from Huawei's Noah Ark Laboratory in Montreal.

Artificial intelligence (AI) has come a long way in the last decade, especially in the fields of natural language processing and computer vision. However, how to improve the cognitive and reasoning capabilities of AI is still a huge challenge.

近期,一篇题为《MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time》的论文提出了基于树搜索的推理时间能力提升方法 MindStar [1],该方法在开源模型 Llama-13-B 与 Mistral-7B 上达到了近似闭源大模型 GPT-3.5 与 Grok-1 在数学问题上的推理能力。

Can't wait for OpenAI's Q*, Huawei Noah's secret weapon to explore LLM inference, MindStar, came first

论文标题:MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time

Address: https://arxiv.org/abs/2405.16265v2

MindStar for math problems:

Can't wait for OpenAI's Q*, Huawei Noah's secret weapon to explore LLM inference, MindStar, came first

Figure 1: Mathematical accuracy of different large language models. LLaMA-2-13B is similar to GPT-3.5 (4-shot) in mathematical performance, but saves about 200 times the computational resources.

1. Introduction

With the rapid growth in model size, Transformer-based large language models (LLMs) have demonstrated impressive results in areas such as instruction following [1,2], coding assistance [3,4], and creative writing [5]. However, unlocking the ability of LLMs to solve complex inference tasks remains a challenge. Some recent studies [6,7] have attempted to solve this problem by supervised fine-tuning (SFT), by mixing new inference data samples with the original dataset, causing LLMs to learn the underlying distribution of these samples and trying to mimic the learned logic to solve unseen inference tasks. Although this approach has performance gains, it relies heavily on extensive training and additional data preparation [8,9].

The Llama-3 report [10] highlights an important observation: when faced with a challenging inference problem, models sometimes generate the correct inference trajectory. This suggests that the model knows how to produce the correct answer, but has difficulty in choosing. Based on this finding, we ask a simple question: Can we augment LLMs' inference by helping them choose the right output? To explore this, we conducted an experiment utilizing different reward models for LLMs output selection. Experimental results show that step-level selection is significantly better than traditional CoT methods.

2. The MindStar Method

Can't wait for OpenAI's Q*, Huawei Noah's secret weapon to explore LLM inference, MindStar, came first

Figure 2 Diagram of MindStar's algorithm architecture

We introduce a new inference search framework, MindStar(M*), which treats the inference task as a search problem and uses the Process-supervised Reward Model (PRM) to efficiently navigate in the inference tree space to identify the approximate optimal path. The combination of Beam Search (BS) and Levin Tree Search (LevinTS) further enhances the search efficiency and ensures that the best inference path can be found within the finite computational complexity.

2.1 Process Supervision Reward Model

Can't wait for OpenAI's Q*, Huawei Noah's secret weapon to explore LLM inference, MindStar, came first

2.2 Inference Path Expansion

Can't wait for OpenAI's Q*, Huawei Noah's secret weapon to explore LLM inference, MindStar, came first

2.3 Inference path selection

After extending the inference tree, we use a pre-trained Process Supervised Reward Model (PRM) to evaluate each newly generated step. As mentioned earlier, PRM takes paths and steps and returns the corresponding reward value. After the evaluation, we need a tree search algorithm to select the next node to scale. Our framework does not depend on a specific search algorithm, and in this work, we instantiate two of the best preferred search methods, Beam Search and Levin Tree Search.

3. Results and Discussion

Extensive evaluation on GSM8K and MATH datasets has shown that M* significantly improves the inference capabilities of open-source models such as LLaMA-2 and is comparable to larger-scale closed-source models such as GPT-3.5 and Grok-1, while significantly reducing model size and computational cost. These findings highlight the potential to shift computing resources from fine-tuning to inference time search, opening up a new avenue for future research on efficient inference augmentation techniques.

Can't wait for OpenAI's Q*, Huawei Noah's secret weapon to explore LLM inference, MindStar, came first

Table 1 shows how the various schemes compare on the GSM8K and MATH inference benchmarks. The number for each entry represents the percentage of the problem resolved. The symbol SC@32 represents self-consistency among the 32 candidate results, while n-shot represents the results of the few-shot examples. CoT-SC@16 refers to self-consistency among 16 chain-of-thought (CoT) candidates. BS@16 stands for the bundle search method, i.e., 16 candidate results are involved at each step level, while LevinTS@16 details a Levin tree search method that uses the same number of candidate results. Notably, the most recent result for GPT-4 on the MATH dataset is GPT-4-turbo-0409, which we particularly emphasize as it represents the best performance in the GPT-4 family.

Can't wait for OpenAI's Q*, Huawei Noah's secret weapon to explore LLM inference, MindStar, came first

Figure 3 shows how M* performance changes with the number of step-level candidates. We chose Llama-2-13B as the base model and Beam Search (BS) as the search algorithm, respectively.

Can't wait for OpenAI's Q*, Huawei Noah's secret weapon to explore LLM inference, MindStar, came first

Figure 4 Scaling laws of the Llama-2 and Llama-3 model families on the MATH dataset. All results are derived from their original resources. We use the Scipy tool and a logarithmic function to calculate the fit curve.

Can't wait for OpenAI's Q*, Huawei Noah's secret weapon to explore LLM inference, MindStar, came first

Table 2 The average number of tokens produced by different methods when answering questions

4. Conclusion

This article introduces MindStar(M*), a novel search-based inference framework for enhancing the inference capabilities of pre-trained large language models. By treating the inference task as a search problem and leveraging a process-supervised reward model, M* efficiently navigates through the inference tree space to identify approximately optimal paths. The idea of combining beam search and Levin tree search further enhances the search efficiency and guarantees that the best inference path can be found within the finite computational complexity. Extensive experimental results show that M* significantly improves the inference capabilities of open-source models, and its performance is comparable to that of larger-scale closed-source models, while significantly reducing model size and computational cost.

These research results show that there is great potential to transfer computing resources from fine-tuning to inference time search, which opens up a new way for the research of efficient inference enhancement techniques in the future.

Bibliography:

[1] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.

[2] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.

[3] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.

[4] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.

[5] Carlos Gómez-Rodríguez and Paul Williams. A confederacy of models: A comprehensive evaluation of llms on creative writing. arXiv preprint arXiv:2310.08433, 2023.

[6] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.

[7] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.

[8] Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba. Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786, 2023.

[9] Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935, 2023.

[10] Meta AI. Introducing meta llama 3: The most capable openly available llm to date, April 2024. URL https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-04-30.

Read on