ICML 2024|How exactly does Transformer reason? Sample-based or rule-based

Click on the blue word

Follow and star

Never get lost

Institute of Computer Vision

ICML 2024|How exactly does Transformer reason? Sample-based or rule-based

Computer Vision Research Institute

Scan the QR code on the homepage to get how to join

Address: https://arxiv.org/abs/2402.17709
Project homepage: https://github.com/GraphPKU/Case_or_Rule
论文标题：Case-Based or Rule-Based: How Do Transformers Do the Math?

Special column of the Institute of Computer Vision

Column of Computer Vision Institute

While large language models (LLMs) such as ChatGPT have demonstrated impressive performance in a variety of complex tasks, they still face difficulties in handling some mathematical reasoning problems that are simple to humans, such as adding long integers. Humans can easily learn the basic rules of addition, such as vertical addition, and apply them to new addition problems of arbitrary length, but LLMs struggle to do so. Instead, they may rely on similar examples seen in the training corpus to help solve the problem. The ICML 2024 paper by Muhan Zhang's team from Peking University profoundly examines this phenomenon. Researchers define these two different reasoning mechanisms as "rule-based reasoning" and "case-based reasoning." Figure 1 illustrates the different patterns that two inference mechanisms employ when confronted with the same addition problem.

Figure 1: Case-based reasoning vs. rule-based reasoning Diagram Since rule-based reasoning is essential for systematic generalization, the authors explore what reasoning mechanisms transformers use in mathematical problems such as "". To test whether the model relies on a specific example to solve a problem, the authors use the Leave-Square-Out method. The main idea is that you first need to locate the samples in the training set that the model might depend on, and then remove them from the training set to see if they affect the model's test performance. For mathematical reasoning, the authors' hypothesis is that transformers tend to rely on training samples that are "close" to the test samples for inference when solving a certain test sample. Therefore, the authors dug out a square in the two-dimensional space of the sample as a test square. According to the hypothesis, if the model is doing case-based reasoning, and the model relies on the training sample that is close to the test sample for inference, then the model will not be able to answer the test samples near the center of the square because the model has not seen a close example in the training set.

Figure 2: GPT-2's correctness on the whole dataset after fine-tune using the Leave-Square-Out method on addition, modular addition, nine-decimal addition, and linear regression. The square area in the red box is the test set, and the other parts are the training set. Through experiments on five mathematical tasks, including addition, modular addition, nine-decimal addition, linear regression, and chicken-rabbit co-cage problem, the transformers all exhibited case-based reasoning behavior. The authors used the Leave-Square-Out method to fine-tune GPT-2, and the specific model performance is shown in Figure 2. It can be seen that in the test set, the performance of the model decreases rapidly from the boundary to the center, and holes appear. This means that when we move similar cases around holes out of the training set, the model can't make accurate inferences about the test samples in holes. That is, it shows that the model relies on similar cases for inference. In order to ensure the fairness of the conclusion, the authors also used the random split method to divide the training set/test set of the dataset, and observed that the model can easily achieve an accuracy rate of close to 100% on the test set under random split, indicating that the number of training examples in the Leave-Square-Out experiment is enough for the model to complete the inference, and once again confirms that the transformers are doing example-based inference (because random All test samples under split have similar training samples). Does Scratchpad change model inference behavior?

Figure 3: The accuracy of the model in the test square after fine-tuning GPT-2 on an additive task using scratchpad. In addition, the authors explore whether it is possible to eliminate the behavior of case-based reasoning by adding a scratchpad, i.e., instructing the model to add one by one in the output, so that the model can move to rule-based reasoning (see Figure 4 for the specific method of scratchpad). Figure 3 shows the accuracy of the model after fine-tune GPT-2 on an additive task using scratchpad in the test square. On the one hand, it can be found that there are still some areas in the test square that the model cannot get right, indicating that the model is still doing case-based reasoning. On the other hand, the model's dependence on training examples when using Scratchpad is clearly changed compared to the fact that the model has a single continuous hole in the test square when using Scratchpad. Specifically, the uncorrect areas in the test square are presented as triangles with hypotenses along the "carry boundary" of the single and ten digits. For example, the second left-to-right image (test square side length) in Figure 3 has two triangular regions, and the accuracy of the model is almost zero. Small triangles indicate that the model cannot solve problems such as 47+48 because the training set does not include a ten-bit upcarry step (all forty + forty examples are in the test set). For test samples that don't involve a decimal carry, such as 42+43, the model is successful because it can learn from a large amount of other training data as an intermediate step of 4+4 (for example). For data in large triangles, the model cannot solve problems such as 57+58 because there are no cases in the training set that need to be rounded to the hundreds. The shape and location of these black areas suggest that the model will only succeed if each step of the test case becomes obsolete in the training set; Otherwise, it will fail. More importantly, this phenomenon suggests that even with the help of step-by-step reasoning, transformers struggle to learn rule-based reasoning – the model is still mechanically memorizing individual steps it has seen, but not learning the rules behind it. In addition to other influencing factors, the authors also conducted rich tests on the location and size of the test square, the size of the model (including GPT-2-Medium, and the larger models: Llama-2-7B and GPT-3.5-Turbo), and the size of the dataset. The conclusion of the model in case-based reasoning is uniform. Details of the experiment can be found in the article. Rule-Following Fine-Tuning (RFFT)Through the above intervention experiments, the authors found that transformers tend to use case-based reasoning in mathematical reasoning, however, case-based reasoning greatly limits the generalization ability of the model, because it means that the model needs to have seen similar samples in the training set if it wants to do the new test sample correctly. It is nearly impossible to cover all the similar samples of the unknown inference problem in the training set (especially for problems with length generalization).

Figure 4: Input-output sequence of direct answer, scratchpad, and rule-following To alleviate this problem, the authors propose a rule-following fine-tuning technique called Rule-Following Fine-Tuning (RFFT) to teach transformers rule-based reasoning. Specifically, as shown in Figure 4, RFFT provides explicit rules in the input and then instructs transformers to recall and execute the rules line by line. In the experiment, the authors fine-tune Llama-2-7B and GPT-3.5-turbo using the three methods shown in Figure 4 on 1-5 digit additions, and tested them on 6-9 versus 6-15 digit OOD addition tasks, respectively.

Figure 5: Llama-2-7b and GPT-3.5-turbo As can be seen from Figure 5, RFFT significantly outperforms the two fine-tuning methods of direct answer and scratchpad in terms of length generalization. When using Llama-2-7B for RFFT, the model maintained 91.1% accuracy in 9-digit additions. In contrast, models that use Scratchpad for fine-tune have less than 40% accuracy in this task. For GPT-3.5-turbo, which has a stronger base capability, RFFT enables it to staggeringly generalize to additions involving up to 12 digits, and despite only training 100 training samples on 1-5 digits, it still maintains more than 95% accuracy on 12-digit additions. This also significantly exceeds the results of ScratchPad and Direct Answer. These results highlight the effectiveness of RFFT in guiding transformers to rule-based reasoning, and show its potential to enhance the generalization ability of model length. Notably, the authors found that Llama-2-7B requires 150,000 training samples to generalize to a 9-digit number, while GPT-3.5 takes only 100 training samples to master the rules and generalize to a 12-digit number. Thus, rule-following may be a meta-learning ability - it may be enhanced by training on diverse rule-following data and can be more easily migrated to new domains that have not been seen in the training set. Correspondingly, the stronger the underlying model, the easier it is to understand and learn new rules. This is also in line with the human ability to learn new rules – experienced learners tend to learn faster. In summary, this paper explores whether transformers use case-based reasoning or rule-based reasoning when doing mathematical reasoning problems, and proposes a rule-following fine-tuning method to explicitly teach transformers to rule-based reasoning. RFFT demonstrates strong length generalization capabilities and has the potential to comprehensively improve the inference capabilities of LLMs.

END

Please contact this official account for authorization for reprinting

The Computer Vision Research Institute Learning Group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, mainly focusing on object detection, object tracking, image segmentation, OCR, model quantization, model deployment and other research directions. The institute shares the latest paper algorithm and new framework every day, provides one-click download of papers, and shares practical projects. The institute mainly focuses on "technical research" and "practical implementation". The institute will share the practice process for different fields, so that everyone can truly experience the real scene of getting rid of theory, and cultivate the habit of loving hands-on programming and thinking with their brains!

🔗