Hostname: page-component-7c8c6479df-27gpq Total loading time: 0 Render date: 2024-03-28T12:12:13.312Z Has data issue: false hasContentIssue false

Recent developments in language assessment and the case of four large-scale tests of ESOL ability

Published online by Cambridge University Press:  01 January 2009

Stephen Stoynoff*
Affiliation:
Minnesota State University, Mankato, USAstephen.stoynoff@mnsu.edu

Abstract

This review article surveys recent developments and validation activities related to four large-scale tests of L2 English ability: the iBT TOEFL, the IELTS, the FCE, and the TOEIC. In addition to describing recent changes to these tests, the paper reports on validation activities that were conducted on the measures. The results of this research constitute some of the evidence available to support claims that these tests are suitable for their intended purposes. The discussion is organized on the basis of a framework that considers test purpose, selected test method characteristics, and important aspects of test usefulness.

Type
State-of-the-Art Article
Copyright
Copyright © Cambridge University Press 2008

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alderson, J. C. (2000). Assessing reading. New York: Cambridge University Press.Google Scholar
Alderson, J. C. (2004). The shape of things to come: Will it be the normal distribution? In Milanovic, & Weir, (eds.), 1–26.Google Scholar
ALTE (Association of Language Testers in Europe) (2001). Code of practice. http://www.alte.org.Google Scholar
AERA (American Educational Research Association), APA (American Psychological Association) & NCME (National Council on Measurement in Education) (1999). Standards for educational and psychological testing. Washington, DC: AERA.Google Scholar
Bachman, L. F. (1990). Fundamental considerations in language testing. New York: Oxford University Press.Google Scholar
Bachman, L. F. (2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing 17.1, 142.Google Scholar
Bachman, L. F. (2002). Some reflections on task-based language performance assessments. Language Testing 19.4, 453476.Google Scholar
Bachman, L. F. (2005). Building and supporting a case for test use. Language Assessment Quarterly 2.1, 134.Google Scholar
Bachman, L. F., Davidson, F., Ryan, K. & Choi, I. C. (1995). An investigation of the comparability of two tests of English as a foreign language ability. New York: UCLES/Cambridge University Press.Google Scholar
Bachman, L. F. & Palmer, A. S. (1996). Language testing in practice. New York: Oxford University Press.Google Scholar
Barker, F. (2006). Corpora and language assessment: Trends and prospects. Research Notes 26, 24.Google Scholar
Barker, F., McKenna, S., Murray, S. & Vidakovic, I. (2007). Overview of FCE and CAE review project research. Research Notes 30, 3134.Google Scholar
Bejar, I., Douglas, D., Jamieson, J., Nissan, S. & Turner, J. (2000). TOEFL 2000 Listening Framework: A working paper (TOEFL Monograph 19). Princeton, NJ: Educational Testing Service.Google Scholar
Biber, D., Conrad, S., Byrd, P. & Helt, M. (2002). Speaking and writing in the university: A multidimensional comparison. TESOL Quarterly 36. 1, 948.Google Scholar
Blackhurst, A. (2004). IELTS test performance data 2003. Research Notes 18, 1820.Google Scholar
Blackhurst, A. (2005). Listening, reading, and writing on computer-based and paper-based versions of IELTS. Research Notes 21, 1417.Google Scholar
Brown, A., Iwashita, N., McNamara, T. & O'Hagan, S. (2005). An examination of rater orientations and test-taker performance on English-for-academic-purposes speaking tasks (TOEFL Monograph 29). Princeton, NJ: Educational Testing Service.Google Scholar
Buck, G. (2001). Assessing listening. Cambridge: Cambridge University Press.Google Scholar
Butler, F., Eignor, D., Jones, D., McNamara, T. & Suomi, B. (2000). TOEFL 2000 Speaking Framework: A working paper (TOEFL Monograph 20). Princeton, NJ: Educational Testing Service.Google Scholar
Cambridge ESOL (2003). IELTS handbook. Cambridge: Cambridge ESOL.Google Scholar
Cambridge ESOL (2004). The IELTS joint-funded program celebrates a decade of research. Research Notes 18, 2021.Google Scholar
Cambridge ESOL (2005). IELTS specimen materials. Cambridge: Cambridge ESOL.Google Scholar
Cambridge ESOL (2006). IELTS test performance data 2004. Research Notes 23, 1315.Google Scholar
Cambridge ESOL (2007b). FCE handbook for teachers. Cambridge: Cambridge ESOL.Google Scholar
Cambridge ESOL (2007c). Research notes. http://www.cambridgeesol.org/rs_notes.Google Scholar
Cambridge ESOL (not dated). IELTS 2007 examinees. http://www.ielts.org.Google Scholar
Canagarajah, S. (2006). Changing communicative needs, revised assessment objectives: Testing English as an international language. Language Assessment Quarterly 3.3, 229242.Google Scholar
Canale, M. & Swain, M. (1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied Linguistics 1.1, 147.Google Scholar
Chalhoub-Deville, M. (2003). Second language interaction: Current perspectives and future trends. Language Testing 20.4, 369383.Google Scholar
Chalhoub-Deville, M. & Deville, C. (2005). A look back at and forward to what language testers measure. In Hinkel, (ed.), 815–831.Google Scholar
Chalhoub-Deville, M. & Turner, C. E. (2000). What to look for in ESL admission tests: Cambridge certificate exams, IELTS, and TOEFL. System 28.4, 523539.Google Scholar
Chapelle, C. A. & Douglas, D. (2006). Assessing language ability through computer technology. New York: Cambridge University Press.Google Scholar
Chapelle, C. A., Enright, M. K. & Jamieson, J. M. (eds.) (2008). Building a validity argument for the Test of English as a Foreign Language. New York: Routledge.Google Scholar
Chapelle, C. A., Grabe, W. & Berns, M. (1997). Communicative language proficiency: Definitions and implications for TOEFL 2000 (TOEFL Monograph 10). Princeton, NJ: Educational Testing Service.Google Scholar
Chapelle, C. A., Jamieson, J. M. & Hegelheimer, V. (2003). Validation of a web-based ESL test. Language Testing 20.4, 409439.Google Scholar
Cheng, L., Watanabe, Y. & Curtis, A. (eds.) (2004). Washback in language testing: Research contexts and methods. London: Erlbaum.Google Scholar
Choi, I., Kim, K. & Boo, J. (2003). Comparability of a paper-based language test and a computer-based language test. Language Testing 20.3, 295320.Google Scholar
Clapham, C. (1996). The development of the IELTS: A study of the effect of background knowledge on reading comprehension. Cambridge: UCLES/Cambridge University Press.Google Scholar
Cohen, A. D. & Upton, T. (2006). Strategies in responding to the New TOEFL Reading tasks (TOEFL Monograph 33). Princeton, NJ: Educational Testing Service.Google Scholar
Cooze, M. & Shaw, S. (2007). Establishing the impact of reduced input and output length in FCE and CAE writing. Research Notes 30, 1519.Google Scholar
Crow, C. & Hubbard, C. (2006). ESOL Professional Support Network (PSN) Extranet. Research Notes 23, 67.Google Scholar
Council of Europe (2001). Common European Framework of Reference for Languages. New York: Cambridge University Press.Google Scholar
Cumming, A., Kantor, R., Baba, K., Erdosy, U. & James, M. (2006). Analysis of discourse features and verification of scoring levels for independent and integrated prototype written tasks for next generation TOEFL (TOEFL Monograph 30). Princeton, NJ: Educational Testing Service.Google Scholar
Cumming, A., Kantor, R., Powers, D., Santos, T. & Taylor, C. (2000). TOEFL 2000 Writing Framework: A working paper (TOEFL Monograph 18). Princeton, NJ: Educational Testing Service.Google Scholar
Davies, A., Hamp-Lyons, L. & Kemp, C. (2003). Whose norms? International proficiency tests in English. World Englishes 22.4, 571584.Google Scholar
Douglas, D. (2000). Assessing languages for specific purposes. New York: Cambridge University Press.Google Scholar
Douglas, D. & Hegelheimer, V. (2007). Assessing language using computer technology. Annual Review of Applied Linguistics 27, 115132.Google Scholar
Eckes, T., Ellis, M., Kalnberzina, V., Pižorn, K., Springer, C., Szollás, K. & Tsagari, C. (2005). Progress and problems in reforming language examinations in Europe: Cameos from the Baltic States, Greece, Hungary, Poland, Slovenia, France and Germany. Language Testing 22.3, 355377.Google Scholar
Educational Testing Service (2002). The ETS standards for quality and fairness. Princeton, NJ: Educational Testing Service.Google Scholar
Educational Testing Service (2003a). The ETS fairness review guidelines. Princeton, NJ: Educational Testing Service.Google Scholar
Educational Testing Service (2003b). TOEIC from A–Z. Princeton, NJ: Educational Testing Service.Google Scholar
Educational Testing Service (2004). English language competency descriptors. Princeton, NJ: Educational Testing Service.Google Scholar
Educational Testing Service (2006). The official guide to the new TOEFL iBT. New York: McGraw-Hill.Google Scholar
Educational Testing Service (2007a). ETS and ELS, world leaders in education, join forces. http://www.ets.org.Google Scholar
Educational Testing Service (2007b). TOEFL iBT reliability and generalizability of scores. Princeton, NJ: Educational Testing Service.Google Scholar
Educational Testing Service (2007c). TOEIC examinee handbook: Listening and reading. Princeton, NJ: Educational Testing Service.Google Scholar
Educational Testing Service (2007d). New TOEIC listening and Reading. http://www.ets.org/portal/site/ets/menuitem.Google Scholar
Educational Testing Service (2007e). TOEIC speaking and writing tests. Princeton, NJ: Educational Testing Service.Google Scholar
Elder, C. & Davies, A. (2006). Assessing English as a lingua franca. Annual Review of Applied Linguistics 23, 282301.Google Scholar
Embretson, S. E. (2007). Construct validity: A universal validity system or just another test evaluation procedure? Educational Researcher 36.8, 449455.Google Scholar
Enright, M., Grabe, W., Koda, K., Mosenthal, P., Mulcahy-Ernt, P. & Schedl, M. (2000). TOEFL 2000 Reading Framework: A working paper (TOEFL Monograph 17). Princeton, NJ: Educational Testing Service.Google Scholar
Falvey, P. & Shaw, S. (2006). IELTS writing: Revising assessment criteria and scales (Phase 5). Research Notes 23, 712.Google Scholar
Fried-Booth, D. (2007). Reviewing Part 1 of the FCE Listening Test. Research Notes 30, 2324.Google Scholar
Fulcher, G. (2003). Testing second language speaking. London: Pearson.Google Scholar
Fulcher, G. (2004). Deluded by artifices? The Common European Framework and harmonization. Language Assessment Quarterly 1.4, 253266.Google Scholar
Galaczi, E. (2005). Upper Main Suite speaking assessments: Towards an understanding of assessment criteria and oral examiner behaviour. Research Notes 20, 1619.Google Scholar
Green, T. & Maycock, L. (2004). Computer-based IELTS and paper-based versions of IELTS. Research Notes 18, 36.Google Scholar
Gutteridge, M. (2006). ESOL special circumstances 2004: A review of upper main suite provision. Research Notes 23, 1719.Google Scholar
Hamp-Lyons, L. (2000). Social, professional, and individual responsibility in language testing. System 28.4, 579591.Google Scholar
Hawkey, R. (2006). Impact theory and practice. Cambridge: UCLES/Cambridge University Press.Google Scholar
Hinkel, E. (ed.) (2005). Handbook of research in second language teaching and learning. Mahwah, NJ: Erlbaum.Google Scholar
Hughes, G. (2006). The effect of editing on language used in the FCE Reading texts: A case study. Research Notes 26, 1921.Google Scholar
Jamieson, J. [M.] (2005). Trends in computer-based second language assessment. Annual Review of Applied Linguistics 25, 228242.Google Scholar
Jamieson, J. M., Jones, S., Kirsch, I., Mosenthal, P. & Taylor, C. A. (2000). TOEFL 2000 Framework: A working paper (TOEFL Monograph 16). Princeton, NJ: Educational Testing Service.Google Scholar
Jenkins, J. (2006a). Current perspectives on teaching world Englishes and English as a lingua franca. TESOL Quarterly 40.1, 157181.Google Scholar
Jenkins, J. (2006b). The spread of EIL: A testing time for testers. ELT Journal 60.1, 4250.Google Scholar
Joint Committee on Testing Practices. (2004). Code of fair testing practices in education. Washington, DC: AERA.Google Scholar
Kane, M. (1992). An argument-based approach to validation. Psychological Bulletin 112, 527535.Google Scholar
Kane, M. (2002). Validating high-stakes testing programs. Educational Measurement: Issues and Practice 21.2, 3141.Google Scholar
Kane, M. (2004). Certification testing as an illustration of argument-based validation. Measurement 2.3, 135170.Google Scholar
Kennedy, C. & Thorp, D. (2007). A corpus-based investigation of linguistic responses to an IELTS Academic Writing task. In Taylor, & Falvey, (eds.), 316–377.Google Scholar
Kern, R. (2006). Perspectives on technology in learning and teaching languages. TESOL Quarterly 40.1, 183210.Google Scholar
Kramsch, C. (1993). Context and culture in language teaching. Oxford: Oxford University Press.Google Scholar
Kunnan, A. J. (2004). Test fairness. In Milanovic, & Weir, (eds.), 27–48.Google Scholar
Lazaraton, A. (1996). Interlocutor support in oral proficiency interviews: The case of CASE. Language Testing 13.2, 151172.Google Scholar
Lee, Y. (2006). Dependability of scores for a new ESL speaking assessment consisting of integrated and independent tasks. Language Testing 23.2, 13166.Google Scholar
Lee, Y. & Kantor, R. (2005). Dependability of new ESL writing test scores: Evaluating prototype tasks and alternative rating schemes (TOEFL Monograph 31). Princeton, NJ: Educational Testing Service.Google Scholar
Leeson, H. (2006). The mode effect: A literature review of human and technological issues in computerized testing. International Journal of Testing 6.1, 124.Google Scholar
Leung, C. & Lewkowicz, J. (2006). Expanding horizons and unsolved conundrums: Language testing and assessment. TESOL Quarterly 40.1, 211234.Google Scholar
Lumley, T. & Brown, A. (2005). Research methods in language testing. In Hinkel, (ed.), 833–855.Google Scholar
Maycock, L. & Green, T. (2005). The effects on performance of computer familiarity and attitudes towards CB IELTS. Research Notes 20, 38.Google Scholar
Mayor, B., Hewings, A., North, S., Swann, J. & Coffin, C. (2007). A linguistic analysis of Chinese and Greek L1 scripts for IELTS Academic Writing Task 2. In Taylor, & Falvery, (eds.), 250–313.Google Scholar
McNamara, T. F. (1996). Measuring second language performance. New York: Longman.Google Scholar
Messick, S. (1989). Validity. In Linn, R. (ed.), Educational measurement (3rd edn.). New York: Macmillan, 13103.Google Scholar
Milanovic, M. & Weir, C. (eds.) (2004). European language testing in a global context. New York: UCLES/Cambridge University Press.Google Scholar
Mislevy, R., Steinberg, L. & Almond, R. (2002). Design and analysis in task-based language assessment. Language Testing 19.4, 477496.Google Scholar
Mislevy, R., Steinberg, L., Breyer, F., Almond, R. & Johnson, L. (2002). Making sense of data from complex assessments. Applied Measurement in Education 15.4, 363389.Google Scholar
Moore, T. & Morton, J. (2007). Authenticity in the IELTS Academic Module Writing test: A comparative study of Task 2 items and university assignments. In Taylor, & Falvey, (eds.), 197–248.Google Scholar
Moss, P. A. (2007). Reconstructing validity. Educational Researcher 36.8, 470476.Google Scholar
Murray, S. (2007). Broadening the cultural context of examination materials. Research Notes 27, 1922.Google Scholar
Norris, J. (2002). Interpretations, intended uses and designs in task-based language assessment. Language Testing 19.4, 337346.Google Scholar
Norris, J., Brown, J. D., Hudson, T. & Bonk, W. (2002). Examinee abilities and task difficulty in task-based second language performance assessment. Language Testing 19.4, 395418.Google Scholar
Qi, L. (2005). Stakeholders' conflicting aims undermine the washback function of a high-stakes test. Language Testing 22.2, 142173.Google Scholar
O'Loughlin, K. & Wigglesworth, G. (2007). Investigating task design in Academic Writing prompts. In Taylor, & Falvey, (eds.), 379–419.Google Scholar
Orr, M. (2002). The FCE speaking test: Using rater reports to help interpret test scores. System 30.2, 143154.Google Scholar
O'Sullivan, B. (2002). Learner acquaintanceship and OPI pair-task performance. Language Testing 19.3, 277295.Google Scholar
O'Sullivan, B. (2004). Modelling factors affecting oral language test performance: A large-scale study. In Milanovic, & Weir, (eds.), 129–142.Google Scholar
O'Sullivan, B. (2005). International English Language Testing System. In Stoynoff, S. & Chapelle, C. A. (eds.), ESOL tests and testing. Alexandria, VA: TESOL, 7386.Google Scholar
Purpura, J. (2004). Assessing grammar. New York: Cambridge University Press.Google Scholar
Read, J. & Chapelle, C. A. (2001). A framework for second language vocabulary assessment. Language Testing 18.1, 132.Google Scholar
Rosenfeld, M., Leung, S. & Oltman, P. (2001). The Reading, Writing, Speaking, and Listening tasks important for academic success at the undergraduate and graduate levels (TOEFL Monograph 21). Princeton, NJ: Educational Testing Service.Google Scholar
Saville, N. & Hawkey, R. (2004). The IELTS impact study: Investigating washback on teaching materials. In Cheng, et al. (eds.), 73–96.Google Scholar
Sawaki, Y., Stricker, L. & Oranje, A. (2008). Factor structure of the TOEFL Internet-based test: Exploration in a field trial sample (TOEFL iBT Research Report 04). Princeton, NJ: Educational Testing Service.Google Scholar
Spolsky, B. (1995). Measured words. New York: Oxford University Press.Google Scholar
Tannenbaum, R. & Wylie, E. C. (2005). Mapping English language proficiency scores onto the Common European Framework (TOEFL Research Reports 80). Princeton, NJ: Educational Testing Service.Google Scholar
Taylor, L. (2006). The changing landscape of English: Implications for language assessment. ELT Journal 60.1, 5160.Google Scholar
Taylor, L. & Falvey, P. (2007). IELTS collected papers: Research in speaking and writing assessment. Cambridge: UCLES/Cambridge University Press.Google Scholar
Thighe, D. (2007). Cambridge ESOL and tests of English for specific purposes. Research Notes 27, 24.Google Scholar
Turner, C. E. (2006). Professionalism and high-stakes tests: Teachers' perspectives when dealing with educational change introduced through provincial exams. TESL Canada Journal 23.2, 5476.Google Scholar
Wall, D. (2005). The impact of high-stakes examinations on classroom teaching. Cambridge: UCLES/Cambridge University Press.Google Scholar
Wall, D. & Horák, T. (2006). The impact of changes in the TOEFL examination on teaching and learning in Central and Eastern Europe: Phase 1, the baseline study (TOEFL Monograph 34). Princeton, NJ: Educational Testing Service.Google Scholar
Weigle, S. (2002). Assessing writing. New York: Cambridge University Press.Google Scholar
Weir, C. J. (2005a). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing 22.3, 281300.Google Scholar
Weir, C. J. (2005b). Language testing and validation: An evidence-based approach. Basingstoke: Palgrave.Google Scholar
Weir, C. J. & Shaw, S. D. (2005). Establishing the validity of Cambridge ESOL writing tests: Towards the implementation of a socio-cognitive model for test validation. Research Notes 21, 1014.Google Scholar
Wilson, K. (1999). Validating a test designed to assess ESL proficiency at lower developmental levels (ETS Research Report 99-23). Princeton, NJ: Educational Testing Service.Google Scholar
Wilson, K. (2000). An exploratory dimensionality assessment of the TOEIC test (Research Report 00-14). Princeton, NJ: Educational Testing Service.Google Scholar
Woodford, P. (1982). An introduction to TOEIC: The initial validity study (TOEIC Research Summaries 00). Princeton, NJ: Educational Testing Service.Google Scholar
Young, R. (2000). Interactional competence: Challenges for validity. Presented at the Language Testing Research Colloquium, 11 March 11 2000, Vancouver, Canada. http://www.wisc.edu/english/rfyoung.Google Scholar
Zenisky, A. & Sireci, S.. (2002). Technological innovations in large-scale assessment. Applied Measurement in Education 15.4, 337362.Google Scholar
Zhang, S. (2006). Investigating the relative effects of persons, items, sections, and languages on TOEIC score dependability. Language Testing 23.3, 351369.Google Scholar