laitimes

International experience and implications for the development of test score reports

author:Yungta English

Thank you for your interest in YYD.

International experience and implications for the development of test score reports

International experience and implications for the development of test score reports

Sun Hang Jin Yan

Abstract:The score report is the presentation of test results, and as a communication medium between test developers and test users, it greatly affects whether the test can achieve its intended purpose. For more than 10 years, the research and practice of score reporting in the field of education and psychometrics have achieved fruitful results. This paper focuses on the international cutting-edge score report development process and evaluation methods, analyzes the cases of large-scale foreign language test score reports at home and abroad, and proposes that the reform and innovation of mainland education test score reports can be carried out from three aspects: overall planning, theoretical construction and strengthening the application of information technology, so as to provide reference for the theoretical and practical innovation of mainland score reports.

Keywords: score reporting; educational evaluation reform; R&D framework; evaluation system; International experience

As a form of presentation of test results, the score report is an important part of the test function. According to Zapata-Rivera et al., score reporting is a bridge between test information and test users' decisions or actions [1]. No matter how scientifically sound the conception, design, and delivery of a test is, all the upfront efforts will be wasted if the test users do not understand and use the score report correctly[2]. As the external image of the test, the score report is the most intuitive material for the public to understand the test information, and directly affects the public's perception of the test [3]. Historically, test developers have focused much of their efforts on developing exams that are technically capable of passing the scrutiny of the public and professionals, with relatively little research on how test results are organized, reported, and used [4]. For a long time, score reports for most exams contained almost only the overall score and information that was not very relevant to the test user, creating a negative stereotype that the test was just a number for each test taker or a bunch of incomprehensible information[5]585. Studies have shown that there are many difficulties in understanding and using test results for educational policymakers, educators, and the general public [6-8].

For more than a decade, the public and the education sector have been constantly rethinking the washback of examinations, which has prompted the field of education and psychometrics to recognise and pay attention to the importance of score reporting. The study of score reporting has grown rapidly and has become an important and independent area of study. At the same time, the research content of score reporting is no longer limited to the psychometric characteristics of analyzing test scores, but also extends to how to fully consider the needs and characteristics of specific audiences when designing score reports, how to use different graphics and supporting materials to improve users' understanding of score reports, and how to promote the fair use of test information [9].

At present, the large-scale educational examination organization in mainland China is strictly standardized and the quality of propositions is high, but there is insufficient attention to test score reporting, and there are few related researches, and an independent theoretical system and practical model have not yet been formed. Relevant studies in China have found that the score report has problems such as single content and form, and the guiding role of teaching and learning is not strong, and candidates believe that the richness of the existing large-scale test score report needs to be improved [10-11]. At the same time, more and more scholars in China are aware of the importance of mining and using test data, reforming and innovating score reporting [12-14]. In view of this, this paper summarizes the practical status of large-scale test score reporting by combing and analyzing the basic characteristics, development steps and evaluation methods of score reporting in international education test research, and explores the content and form of scientific and effective score reporting, so as to promote the theoretical research and practical exploration of score reporting in mainland education.

International experience and implications for the development of test score reports

1. Basic characteristics of score reporting

There is no one-size-fits-all score reporting model due to the obvious differences in the purpose of the test and the score reporting object of different tests. However, there are many commonalities in the factors that should be considered in the development of score reports and the components of score reports. On the basis of summarizing a large number of existing score reports, some researchers have constructed a summary table of the basic characteristics of score reports [15], as shown in Table 1.

International experience and implications for the development of test score reports

The framework divides the characteristics of score reporting into eight basic elements, including: 1) the reporting object, which refers to the users of the report, including students, parents, teachers, education administrators, etc., to a large extent determines the content of the report and the way the information is presented; 2) The score scale refers to the presentation form of scores, including raw scores, standard scores, percentiles and other forms, and each score has its own advantages and disadvantages; 3) Score-reference, which refers to the test score that can provide a student's position in the norm group of the school, region or country (i.e., norm reference), or whether the student's mastery of certain knowledge content meets the standard (i.e., the standard reference); 4) Evaluation units, including question scores, individual scores and total scores, the total score is the most common evaluation unit, and the individual scores in specific knowledge/skill areas can provide more instructive information for teaching and learning, such as providing diagnostic feedback; 5) Report unit, which means that the report can be provided to candidates at different levels such as individual, class, school, region, etc., and each unit is unique; 6) Measurement error, which refers to the presentation and interpretation of the measurement error of the test in the report, such as providing the measurement error of the individual score at the same time; 7) Presentation method: refers to the three presentation methods of figures, charts and text descriptions contained in the report; 8) Report media refers to the three ways of report dissemination, the paper version is the traditional report medium, and the online version (including static and interactive type) is becoming more and more common with the development of information technology. Among the above eight elements, the reporting object and the reporting unit are the definition of the target audience of the report, and the four items of sub-scale, score reference, evaluation unit and measurement error mainly focus on the reporting method of the test results, the presentation method and the reporting medium emphasize the information presentation and distribution form of the report content. When developing a score report, test developers should first consider and acknowledge these essential characteristics.

Zenisky and Hambleton point out that the content of the score report (description and data), the method of distribution (paper or electronic version), and the target audience (individuals or groups) should be fully considered in the analysis of the basic elements of the score report[5]586-591. Among them, the description part of the report content is a basic description and explanation of the score report, including the test name/logo, test date, report title, report purpose, test purpose, introductory introduction, test taker individual or group information title, external links to additional resources (such as course materials, explanatory guides), score usage guidance, glossary, next steps and other information. The data section is the core of the score report, which includes seven main contents: summative results, performance level descriptions, individual performance, item-level results, norm-referenced results, formative or diagnostic information, progress predictions, and item mapping. In addition, according to the 2014 edition of the Standards for Educational and Psychological Testing, testing bodies should help reporting respondents correctly understand the meaning of test scores[16]119. Therefore, score reports should be clear, easy to understand, and provide relevant information about the interpretation of scores, such as explanations of scoring methods and score accuracy. It should also be recognized that the most important information contained in a score report and the best way to present it depend on the target audience, the purpose of the test, and the psychometric nature of the test score [17].

International experience and implications for the development of test score reports

2. R&D framework and process of score reporting

In recent years, a number of studies have been published in the field of education and psychometrics on the steps to develop score reports. These studies provide a scientifically sound, research-based framework for test developers to gather evidence of validity. Among them, the three frameworks developed by Zapata-Rivera [18], Hambleton et al. [19], and Slater et al. [20] are widely used, and the names of the frameworks are named after the principal investigators, and the main stages and specific content examples of each framework are shown in Table 2.

International experience and implications for the development of test score reports

By comparing the above three frameworks, it can be seen that the Zapata-Rivera framework corresponds to the first three phases of Hambleton et al., while the latter emphasizes the importance of continuous adjustment and maintenance of post-publication score reports in the fourth phase. Frameworks such as Slater added a development schedule for score reporting (Phase 2) and emphasized the importance of collecting feedback from test clients (Phase 4) and user feedback (Phase 5). Overall, the three frameworks emphasize four main phases: pre-report preparation, sample report development, feedback collection and revision, and score reporting. The following takes Hambleton and other frameworks as examples to introduce in detail the actions and precautions that test developers should take at each stage.

(1) Phase 1: Lay the foundation for R&D

Hambleton and Zenisky divide the groundwork for the development of the score report into four steps. The first step is to explain the factors that should be considered throughout the test design of the score report. For example, what competencies or skills does the exam measure? What information does the score report want to provide? The primary focus of this step is to ensure that the score report accurately reflects what the exam is intended to achieve, and to clarify and explain the relationship between the test, scores, and score reporting in the earliest stages of test development. The second step is to identify the reporting recipients, i.e., identify the stakeholders who will make decisions or take action based on the score report. For example, who are the main consumers of score reports? What decisions will they make based on the score report? Different groups have different needs for score reporting, for example, teachers can adjust their teaching according to score reports, students can use reports to carry out targeted self-directed learning, and education management departments can use reports as an important basis for evaluation and screening. The third step is to conduct a needs analysis of the reporting object. In communicating with report users, attention should be paid to what they want to know about test performance, what information is useful to them, and how to obtain and use score reports[5]593. The analysis of the respondents included not only identifying their needs, but also understanding their assessment literacy [1]. Groups with less knowledge of the assessment need to be given more explanatory and supporting information to help them understand the report correctly. The fourth step is to read the relevant literature, including industry codes of conduct and relevant empirical research [21]. These codes of conduct and research can provide a scientifically reliable reference for test developers. At present, many test providers have made their score report formats public, and by reviewing score report examples, you can learn from best practices and avoid repeating the mistakes made by others [2].

(2) Phase 2: Develop a sample score report

Based on the information gathered in Phase 1, the test developer designs one or more sample score reports. The sample score report, also known as the prospective score report (PSR), is a model of the content and presentation of the final score report [22]. According to the test development process, developers should design the PSR in advance at the beginning of test development, and revise it accordingly according to the development changes in the test development process. However, the reality is that many score reports are often not put on the research agenda until the final stages of test development, leaving very limited time and resources to design and revise score reports[5]591, which may not provide some important information. In addition, PSR requires the collaboration of experts in different fields as it needs to be matched to the test objectives and the needs of a specific audience, as well as the overall design and clarity and accuracy of the information presentation. According to the form and function of the score report, the expert team includes experts in specialized fields, experts in metrology, information technology experts, graphic designers, etc. [19].

(3) Phase 3: Obtain feedback and make corrections

This phase aims to obtain feedback on the PSR and revise the report based on the feedback, which is an indispensable and important step in the development of the score report. First, an internal audit is conducted and a satisfactory result is reached, which requires several rounds of review and revision of the PSR by the exam developer based on the information gathered. Secondly, an external audit was conducted using a variety of research methods, such as questionnaires, focus groups, interviews, sound thinking, direct observation, and eye tracking experiments, to collect the attitudes and feelings of the report users and to explore whether they could correctly understand the content of the report. Test developers can provide report users with different versions of their score reports to understand which features in which version they prefer [20]. Finally, the data should be carefully and carefully analyzed and used as a basis for revising the report. This is an iterative process, as the score report has to go through multiple rounds of revisions before the version is finalized and officially put into use.

(4) Phase 4: Assessment and maintenance report

After the score report is officially released, the test developer will also maintain it. The data collection methods mentioned in Phase III could also be applied to this phase. Evaluation score reports require large-scale collection of feedback from the respondents, both on the content and format of the report (e.g., readability of information, visualization, preferences for different content and presentations), and on the understanding and use of the score report. For example, can the reporting object correctly describe the meaning of the score? What decisions do they make based on score reports? At this stage, the focus should be on the extent to which the reporting user understands (or misunderstands) and uses (or misuses or abuses) the score report, and what are the positive and negative impacts of this.

In conclusion, the R&D framework can provide a scientific and effective template for the design of score reports, so that the test development work can be based on rules and evidence. The above R&D framework has some flexibility and can be applied to different exam scenarios. In addition, the development step of score reports has evolved into an iterative methodology design, in which the information gathered at a later stage is continuously modified and repeated as needed [17,19], which is reflected in Zapata-Rivera's score report development process, as shown in Figure 1.

International experience and implications for the development of test score reports
International experience and implications for the development of test score reports

3. The evaluation system of the score report

In order to ensure the validity of the score report, it is necessary to objectively evaluate the score report itself and the R&D process.

Corresponding to the score report development framework, Zenisky and Hambleton designed an evaluation table for how to evaluate the development process of score reports at each stage, as shown in Table 3[5]595. The evaluation form consists of open-ended questions and is designed to encourage test developers to clearly document the details of the score report development process. Zenisky and Hambleton argue that by explicitly documenting and illustrating the process of score report development, testing bodies can accumulate evidence of the validity of score report development to support the fair use of the report [5]597.

International experience and implications for the development of test score reports

In addition, Zenisky and Hambleton asked 37 guiding questions on how to evaluate score reports, covering eight areas. Here, only one representative question is enumerated for each domain, which is as follows: 1) in general, to understand whether the score report reflects the information needs of key stakeholders; 2) In terms of report introduction and description, find out whether the score report states the purpose of the test; 3) In terms of test scores and performance levels, find out whether the score report describes in detail the performance level or mental state used, such as pass, fail, basic, proficient, pass, etc.; 4) In terms of test performance indicators, find out whether the score report informs the user of the accuracy of the score; 5) Other aspects of the content of the score report, understand whether the score report provides contact information such as phone number and website, so that the report user can consult if he has any questions; 6) In terms of language expression, know whether the score report removes statistical or other technical terms and symbols that are difficult for report users to understand; 7) In terms of design, understand whether the report is clearly and logically divided into different sections to improve readability; 8) In terms of explanatory guides and other supporting materials, understand whether explanatory guides exist and whether they can provide clear and useful information. Taken together, these guiding questions are a summary of existing score reporting practices and research, which can provide support for a comprehensive evaluation of score reporting.

From the perspective of the interpretation and use of scores in the concept of test validity, O'Leary proposed the outcomes-focused evaluation of score reporting, which includes two principles, clarity and practicality, and is divided into seven elements [23]. Clarity requires that the score report be easy to understand and consists of four elements: 1) design features, i.e., the design of the score report must be based on existing best practices, including the best contemporary reporting examples, codes of conduct, and literature recommendations; 2) explanatory guidelines, which mean that score reports must be self-contained and that the additional work required to fully understand the information in the report should be minimized; 3) Presentation method, which means that the score report must integrate multiple data presentation methods; 4) Language form: The language of the score report must be easy to understand, while the practicality requires that the purpose of the score report, the expected interpretation, and the expected actions and consequences must be clear.

In summary, there are some differences between the above three score report evaluation systems, among which, the first evaluation form designed by Zenisky and Hambleton is designed to evaluate the score report development process and guide the test developers to self-examine the development steps; The second system focuses on the score report itself, examining the various components of the score report by asking questions; The evaluation system proposed by O'Leary originated from the researcher's requirement for the validity of score reporting, and evaluated the report from the perspective of the use of results. The test development agency can choose the appropriate score report evaluation system according to its own needs.

International experience and implications for the development of test score reports

4. Analysis and comparison of cases of large-scale foreign language test score reports at home and abroad

Promoting and deepening the reform and innovation of score reporting should be based on an understanding of existing practices. Therefore, based on the framework of the basic characteristics of Ryan's score reports, Zenisky and Hambleton, this study selects the score reports of seven language tests with a large number of testers at home and abroad, which are representative and influential, and the results are shown in Table 4.

International experience and implications for the development of test score reports

The seven language tests are TOEFL (TOEFLiBT), IELTS, Cambridge Advanced English Certificate (CAE), Pearson Academic English (PTE), Duolingo English Test (Duolingo), CET-4 and Chinese Proficiency Test (HSK). Table 4 shows the types of information currently contained in the seven-item score report and how the information is presented. Since all seven score reports contain basic information (candidate and exam information), they are not presented repeatedly in the table. The focus of this study is on the score report of the test, so other information that appears on the official website of the test is not presented in the table.

As can be seen from Table 4, there are commonalities and considerable differences in the content and form of the scores of the existing large-scale foreign language tests. First of all, the total score of the exam is the information that candidates pay the most attention to. The scoring scale of the above seven tests ranges from 9~710 points, indicating that the scoring systems of different score reports vary greatly. Zenisky and Hambleton argue that the different scoring methods are used for the exams, in part because the examining bodies want to distinguish their test scores from other tests in order to prevent misinterpretation of the exam[5]590. However, the variety and complexity of the grading system can make it difficult for non-testing experts, such as test takers, parents, teachers, etc., to understand the true meaning of the scores. To help test users better understand the meaning of scores, some tests provide a mapping of scores to a grading scale. For example, IELTS and CAE directly reflect the comparison of scores with the Common European Framework of Reference for Languages (CEFR) in score reporting; Although TOEFL and Duolingo are not directly reflected in the score report, the corresponding information can be found on the official website, and Duolingo also provides a comparison of TOEFL and IELTS scores to help readers understand their ability level. Secondly, in terms of the richness of information, although these tests have slightly different classifications of language skills, they all provide individual scores for language skills. In addition, the TOEFL also provides test takers with the highest scores in history, the HSK specifically reports percentages to help test takers understand their score position in the norm group, and the CET-4 provides norm information, individual scores, percentile tables of total scores, and other information in the score interpretation on the official website. Thirdly, in terms of providing diagnostic information, PTE provides a candidate's personal skill profile and provides skill definitions and personalized recommendations to help candidates further understand their strengths and weaknesses. Finally, in terms of language proficiency descriptions, Duolingo reports on test takers' overall ability, CET-4 reports on three levels of language proficiency in oral tests, and tests such as TOEFL describe different skills and levels on their official websites, but they are not reflected in specific score reports. In addition, most of the above-mentioned exams put more detailed explanatory guidance on the official website of the score, but whether candidates can quickly find this information is related to whether the location is indicated on the score report, among which the TOEFL, CAE, Duolingo, and CET-4 tests directly indicate the location of the relevant information on the report.

The following is an analysis of the score reports of the Duolingo and PTE exams, as shown in Figure 2 and Figure 3.

International experience and implications for the development of test score reports

As you can see from Figure 2, the Duolingo score report can be divided into three areas. The first part is the basic information about the candidates and the exam. The second and third parts are the main body of the report, which are the candidate's total score and individual score, respectively. In the second part, the report presents the overall score that the report user is most concerned about by increasing font size and orange font, and provides a brief description of the candidate's overall language ability in the form of a bullet point to help the report user understand the tasks that the candidate can complete in English. In addition to numerical and textual presentations, the report also provides a graphical representation of the position of the candidate's score on the scale. In the third section, the four sub-scores for Reading and Writing, Reading and Listening, Listening and Speaking, and Writing and Speaking are reported, also in bright orange font, supplemented by concise text descriptions and graphical presentations. It is important to note that the Duolingo report presents the score range of the candidate's total score and individual scores in the score display graph, which relates to the precision of test scores. The 2014 edition of the Educational and Psychometric Standards explicitly states that test developers should provide reporting users with information about the accuracy of scores [16]119, and it has been suggested that information on the accuracy of scores can help prevent users from over-interpreting information on scores [7]. Duolingo's practice of graphically displaying test taker score ranges follows the standard of practice for good score reporting and provides a useful exploration of how to report score accuracy. However, since the meaning of the score range is not explained in the report, it remains to be studied whether users understand this part of the information. Finally, the Duolingo report also informs test users of the URL for the details of the test scoring at the bottom with "Learn more" information, which is indicated in orange font.

Duolingo's score report is concise, clear, readable, without excessive information accumulation, scientific and reasonable in spatial organization, color use, and chart, text, and number collocation, which is in line with the basic principles of effective score reporting, and providing test takers with a score range is also a highlight. However, some studies have found that reporting users often have difficulty understanding information about the precision of the score [7] (e.g., standard error and confidence intervals), which may be more effective if supplemented with appropriate explanation.

Figure 3 shows the score report for PTE. The report contains four sections. The first section is the candidate's test number and registration information, and the right side is the candidate's total score, and the importance of the information is highlighted by means of graphics and background color. The second section is a sub-score for the four communicative skills of listening, reading, speaking, and writing, which are also emphasized through graphics and colors. The third division is divided into two parts, and the left side uses a bar chart to present the individual skill scores and the total score again, showing the comparison between the scores more intuitively; The right side presents the candidate information. The fourth section is located at the bottom of the report and is the test room information. When candidates read the PTE report online, they will also be presented with a personal skills portfolio of eight sub-skills, including speaking and writing, open-ended responses, short writing, as well as an overview of the skills and personalized recommendations. In the personal skills profile section, the single or multiple skills involved are visualized with graphics such as headphones, books, dialog boxes and pens, and the level of competency in the field is shown in a bar chart, while the suggested section uses a textual description of the main points. These detailed diagnostic information for individual test takers can help test takers understand their strengths and weaknesses of language ability and the direction of future learning, and help test takers to carry out targeted independent learning.

In terms of content, the PTE score report is rich in information, in addition to reporting candidate and test information, total scores, and individual scores, it also provides detailed diagnostic information and future suggestions to improve the learning promotion function of the test. In terms of presentation, the report effectively combines numbers, text, and graphs, but the layout of the candidate information section on the right side of the third section may be further optimized, and it may be clearer if merged into the first section.

Overall, the above reports show the style and content of some excellent score reports, including the use of different forms of information (including text, numbers, and charts), highlighting important information, reasonable partitioning according to the importance of content, and providing explanatory guidelines for scores. However, there are some issues, such as the lack of personalized feedback for some reports and the lack of specific location information for other resources.

International experience and implications for the development of test score reports

5. Enlightenment for the reform of the mainland education test score report

Educational examinations in mainland China are diverse, large-scale, and have far-reaching social impact, and play an important role in promoting educational equity and social stability [26]. Considering the huge impact of test scores on society, examination bodies should fully understand the importance of test score reporting and actively explore the reform of score reporting. Specifically, drawing on the international advanced experience and practices, the relevant research and practice of the mainland in the future can be carried out from the following three aspects.

First, the overall planning of the test project, from the earliest stage of development to the comprehensive design and consideration of the score report. There are four main aspects that need to be planned in advance: 1) determine the nature and purpose of the examination, and fully consider the information needs and assessment literacy of teachers, students, schools and other relevant aspects; 2) Incorporate the research and development of supporting materials such as score interpretation guidelines into the design plan, for example, when developing and evaluating the score report samples, different empirical research methods (such as sound thinking, questionnaires, interviews, eye tracking experiments) should be used to investigate the attitudes, preferences, and understanding of various report users (such as students, teachers, and educational administrators) towards the score report, and modify them accordingly based on the feedback of the users; 3) After the score report is issued, the follow-up study, case study, ethnographic research and other research methods should be used to continuously investigate the decisions and actions taken by the report users, and pay special attention to the guiding role of the score report on students' learning and teachers' teaching; 4) Actively learn from the international cutting-edge score report development framework and excellent cases, record and evaluate the specific development process in detail, and collect evidence of the validity of score report development.

Second, it is necessary to construct and innovate theories to form a code of conduct and guidelines for the development and evaluation of localized score reports. The theoretical construction includes the basic characteristics of score reporting, development steps, evaluation methods, and validity verification. The Code of Conduct and Guidelines define the principles and standards that should be followed for excellent score reporting, which can be found in the 2014 edition of the Educational and Psychometric Standards for Score Reporting[16]119-144, and the 2014 International Test Commission's guidelines for quality standards for score reporting [27]. The establishment of a code of conduct and guidelines for localized score reporting will help standardize the practice of score reporting in mainland education and improve the quality of score reporting.

Third, actively explore the design, development and application of information technology, especially the online interactive score report assisted by artificial intelligence. Online interactive score reports allow report users to select and rank the information presented, explore deeper information, and change the way information is presented, making targeted, personalized three-dimensional and multi-dimensional score reports a reality. However, the similarities and differences between the steps and principles to be followed in developing an interactive report and the traditional written report, and how to achieve a breakthrough in technology, are worthy of further research and exploration. The development of online interactive score reports is inseparable from the cooperation of interdisciplinary experts, and the applications and roles of cognitive science, information design, aesthetics, user interface research and other fields in the design and development of score reports should be fully explored.

The interpretation of test scores begins with people reading score reports, so the design and distribution of score reports directly affects the effectiveness of the test [17]. A good score report should provide test stakeholders with the information they need to take sound action in a way that they can understand [2]. In the context of deepening the reform of educational evaluation in the new era, relevant fields in China should change and innovate the design concept of score reports, dig deep into test data to provide multi-dimensional and effective score reports, and provide rich information feedback for teaching and learning. By helping and guiding the public to correctly understand and use the test results, we will build and promote a scientific teaching-learning-evaluation linkage system, so as to improve the overall quality of educational examinations.

(References omitted)

(This article was first published in China Examinations, Issue 6, 2024)

International experience and implications for the development of test score reports

Read on