Group photo New York

The Futures and Promises of International Educational Assessment

By Bryan Maddox and Camilla Addey, Laboratory of International Assessment Studies

This blog post reports findings from the Laboratory of International Assessment Studies ESRC Seminar Series, The potentials, politics and practices of international educational assessment. The series aimed to create a space for innovative and in-depth debate on international assessment methodology and key policy issues surrounding international testing, to create new critical perspectives on their role in education practice and policy. It acted as a vehicle for academic enquiry into international educational assessments; – to develop an agenda for international assessment studies as a field of academic research and publication; and – to build an international network of researchers and practitioners around a set of common interests and concerns.

The blog synthesises the main seminar themes, to stimulate debate, and suggests some themes and recommendations ahead of our final seminar at Humboldt University in Berlin (September, 2016). The views presented here are not necessarily those of the seminar presenters – nor does it capture all of the rich discussions and presentations that have taken place in the series.  The full set of seminar programmes and video recordings from keynote speakers are available on the Laboratory of International Assessment Studies website

We focus here on the technical-social interface in the production, reception and use of international assessment data. It is widely argued that ILSAs operate as an ‘expert’ realm of technical activity and governance (Grek 2013).  However, drawing from Science and Technology Studies, we argue that many of the operational challenges and opportunities in the field concern a combination of social and technical concerns and dynamics. We view a lack of understanding of the socio-technical dynamics involved in the production, reception and use of international assessments as an area of risk. That is, the neglect of social-technical dynamics in large-scale educational testing can lead to negative impacts, data under-utilisation, and declining public and political support for educational testing.

Good Assessment?

An overarching theme in the series is the view that the practices of international assessment should be guided both by an understanding of the potentials and limitations of large-scale assessment, and by a concern with and responsibility for the consequences of testing, which should involve a consideration of a host of local and contextual factors. This involves a broad understanding of the socio-cultural dynamics of particular locations and the particular educational and assessment needs within these contexts. This has implications for responsibilities in the production, reception and use of assessment data.

As Hattie (2014) argues: ‘Poor interpretations of reports are harmful even if they are based on carefully designed tests’ (p34).  However, we can extend this point to consider how testing agencies anticipate and take responsibility for unintended and ‘off label’ uses of assessment data (See the video of Bruno Zumbo’s seminar presentation). This means anticipating and considering the social dynamics of how tests are developed used and how they impact. It is our view that ‘good’ assessment should support informed policy making that is sensitive to the characteristics of diverse societies, and understands these characteristics as complex and multi-dimensional. It should support informed and equitable and nuanced policy making processes and not create simplistic understandings or unwanted or ill-informed policy shocks.

Improving Assessment, Accountability and Governance

A second overarching theme concerns the impacts of ILSA’s, assessment technologies and data on systems of educational governance, accountability and decision making. Assessment practices and data change the way that we understand and represent educational achievement internationally, nationally and locally at the level of schools and classrooms dashboard data, topological cultures (see Lewis, Sellar and Lingard, 2016). One of the contemporary challenges is engaging with how ILSA assessment practices and data are redistributing authority and responsibility, and to consider how far the effects of assessments can be considered beneficial and just (Zumbo and Hubley, 2016). This involves a focus on how assessment practices and data are understood and incorporated into democratic (and undemocratic) institutions of decision making and accountability. This extends from how to use data, to questions about what kinds of ‘data needs’ are expressed and what data are produced and valued – and why.

The Challenges of Difference

Concerns about test validity and fairness may reduce confidence about the use of large-scale, standardised assessment data in contexts of high socio-economic, cultural and linguistic diversity (see the video of Ron Hambleton’s presentation).  ILSA’s are not sufficiently sensitive to and responsive to issues of diversity which are sometimes treated merely as sources of noise and bias. There is a growing perception that assessment design and associated procedures for researching validity should be more carefully and purposely aligned to these concerns. What psychometric models and systems of representation would support better understandings of inclusive assessment? What assumptions about the treatment of difference would support those understandings? Conventional (routine) statistical procedures of DIF analysis are not necessarily adequate for making sense of assessment performance in contexts of diversity or for reassuring the users of assessment data about test fairness and reliability.ILSA’ should therefore consider ways to improve how they capture, recognise understand the characteristics and performance of diverse populations.

Improving Public Reception

What new investments would support improved public understanding and engagement with international large-scale testing? The field of ILSA’s is characterised by complexity and technical virtuosity. It highlights the potentials of assessment techniques and data to transform educational policy making and comparison. However, the relationships between technical, social and political dynamics of data reception are complex, and not yet properly understood. This involves risks of over-simplification and unintended negative consequences for policy making such as political ‘wash-back’ effects, resistance to large-scale testing, and data misinterpretation and under-utilisation.  Furthermore, the particular dynamics of how ‘reference societies’ are compared, and how data is selected and presented often responds to, and becomes part of complex socio-cultural and political concerns (see seminar video presentations by Pizmony-Levy and by Waldow). Governments and international testing agencies could therefore increase resources directed to improved public understanding, social uses and reception of assessment data.  

Print media often have insufficient time and expertise to analyse assessment results, and as a result are prone to publicising simplistic, scandalising and misleading headlines (seminar presentation by Megan Knight). Media organisations often rely on academic specialists for advice, but those academics do not always have sufficient prior access to embargoed data. Poor quality public reception of assessment findings can create unanticipated negative impacts on educational and political systems (including movements like ‘opt out’ against large-scale testing). Data under-utilisation is a risk if educationalists and politicians fear data release and analysis (i.e. if it is associated with negative headlines of ‘shock’ and ‘scandal’).  Media embargo practices and timescales should be reconsidered to maximise opportunities for accurate and considered data reception.

‘Data shocks’ are not necessarily beneficial. They are sometimes associated with unwelcome ‘wash-back’ effects on national politics. In some cases this may undermine confidence in educational systems and rapid policy borrowing. These shocks and responses can fuel movements against large-scale testing. The influence of large-scale assessment and its potentials to transform educational systems depends on positive public (including teachers, parents) and political opinion about their value.  Testing agencies should therefore reconsider the politics and virtue of ‘shock’ and are integrated more effectively into democratic systems of decision-making, transparency and accountability. Is there a more effective way for testing agencies to create informed policy change?

As a concluding comment, we have welcomed the collaboration in the ESRC series between academics, policy makers and testing institutions. As Pizmony-Levy (2013) has argued, large-scale assessments initially involved academic researchers in the design and running of assessment programmes.  More recently they have found themselves in more responsive and peripheral roles, secondary data analysis and critical policy studies.  The exchanges between testing agencies and academics has therefore been especially valuable.

Photograph: Participants at the ESRC Seminar in New York.


References Cited

Hattie, J. (2014). The Last of the 20th-Century Test Standards. Educational Measurement: Issues and Practice, , Vol. 33, No. 4, pp. 34–35

Grek, S. (2013). Expert moves: International Comparative Testing and the Rise of Expertocracy.  Journal of Educational Policy, 28 (5), 695-709.

Lewis, S. Sellar, S. & Lingard, S. (2016) ‘PISA for Schools: Topological Rationality and New Spaces of the OECD’s Global Education Governance’, Comparative Education Review. 60 (1), 27-57.

Pizmony-Levy, O. (2013). Testing for all: the emergence and development of international assessments of student achievement, 1958-2012. Bloomington, IN, Indiana University. Ph.D.

Pizmony-Levy, O. (2015) Concluding Remarks. In ESRC Seminar on International convergence of educational reforms. 5th and 6th March.  Columbia Teachers College, New York. 5th and 6th March.  See video on

Waldow, F. (2015). ‘Projecting Images of the “Good” and the “Bad School”: Using Top Scorers in Large-Scale Assessments as Reference Societies’  ESRC Seminar on International convergence of educational reforms. 5th and 6th March. Columbia Teachers College, New York. See video on

Zumbo, B.D. (2015, November). Tides, Rips, and Eerie Calm at the Confluence of Data Uses, Consequences, and Validity. Plenary address, ‘The Production of Data in International Assessments’, ESRC Research Seminar organised by the Laboratory of International Assessment Studies.   []   Click here for the Abstract / Summary.

Zumbo, B. D. and Hubley, A.M. (2016). Bringing consequences and side effects of testing to the foreground. Assessment in Education: Principles, Policy and Practice, 23 (2), 299-303.