Hostname: page-component-7c8c6479df-7qhmt Total loading time: 0 Render date: 2024-03-28T13:50:38.026Z Has data issue: false hasContentIssue false

Estimating the latent number of types in growing corpora with reduced cost–accuracy trade-off*

Published online by Cambridge University Press:  24 February 2015

SHOHEI HIDAKA*
Affiliation:
School of Knowledge Science, Japan Advanced Institute of Science and Technology
*
Address for correspondence: Shohei Hidaka, 1-1 Asahidai, Nomi, Ishikawa, 923-1292, Japan. tel: +81-761-51-1717; fax: +81-761-51-1775; e-mail: shhidaka@jaist.ac.jp

Abstract

The number of unique words in children's speech is one of most basic statistics indicating their language development. We may, however, face difficulties when trying to accurately evaluate the number of unique words in a child's growing corpus over time with a limited sample size. This study proposes a novel technique to estimate the latent number of words from a series of words uttered by children. This technique utilizes statistical properties of the number of types as a function of the number of sampled tokens. We tested the practical effectiveness of the proposed method in the empirical data analysis of the cross-sectional and longitudinal samples. The converging empirical evidence indicates that the proposed estimator improves the accuracy of vocabulary size estimation over a set of existing estimators. Utilizing this efficient estimator, we propose a new sampling scheme for vocabulary assessment that has lower cost and higher accuracy compared to existing methods.

Type
Articles
Copyright
Copyright © Cambridge University Press 2015 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

*

This work was supported by grants from the Japan Society for the Promotion of Science (JSPS): Grant-in-Aid for Scientific Research B No. 23300099 and Grant-in-Aid for Challenging Exploratory Research No. 25560297. I am grateful to two anonymous reviewers for their careful reading and fruitful suggestions.

References

REFERENCES

Akaike, H. (1974). A new look at the statistical model identification. Automatic Control, IEEE Transactions, 19(6), 716723.CrossRefGoogle Scholar
Baayen, R. H. (2001). Word frequency distributions, Vol. 18. Dordrecht: Kluwer Academic Publishers.CrossRefGoogle Scholar
Bates, E., & Carnevale, G. F. (1993). New directions in research on language development. Developmental Review, 13, 436470.CrossRefGoogle Scholar
Bates, E., Dale, P. S., & Thal, D. (1995). Individual differences and their implications for theories of language development. In Fletcher, P. & MacWhinney, B. (Eds.), The handbook of child language (pp. 96151). Oxford: Basil Blackwell.Google Scholar
Bloom, P. (2000). How children learn the meaning of words. Cambridge, MA: MIT Press.CrossRefGoogle Scholar
Bornstein, M. H., & Haynes, O. M. (1998). Vocabulary competence in early childhood: measurement, latent construct, and predictive validity. Child Development, 69(3), 654671.CrossRefGoogle ScholarPubMed
Braunwald, S. R., & Brislin, R. W. (1979). The diary method updated. In Ochs, E. & Schieffelin, B. B. (Eds.), Developmental pragmatics (pp. 2142). New York: Academic Press.Google Scholar
Brown, R. (1968). The development of wh questions in child speech. Journal of Verbal Learning and Verbal Behavior. 7(2), 279290.CrossRefGoogle Scholar
Brown, R. (1973). A first language: the early stages. Cambridge, MA: Harvard, University Press.CrossRefGoogle Scholar
Bunge, J., & Fitzpatrick, M. (1993). Estimating the number of species: a review. Journal of the American Statistical Association, 88(421), 364373.Google Scholar
Camaioni, L., Castelli, M. C., Longobardi, E., & Volterra, V. (1991). A parent report instrument for early language assessment. First Language, 11(33), 345358.CrossRefGoogle Scholar
Chao, A., & Shen, T. J. (2003). Nonparametric estimation of Shannon's index of diversity when there are unseen species in sample. Environmental and Ecological Statistics, 10(4), 429443.CrossRefGoogle Scholar
Chomsky, N. (1972). Language and mind. New York: Harcourt Brace Jovanovich.Google Scholar
Dale, P., & Fenson, L. (1996). Lexical development norms for young children. Behavior Research Methods, 28(1), 125127.Google Scholar
Darwin, C. R. (1877). A biographical sketch of an infant. Mind, 2, 286294.Google Scholar
Dromi, E. (1987). Early lexical development. Cambridge: Cambridge University Press.Google Scholar
Dugast, D. (1979). Vocabulaire et Stylistique. I Théâtre et Dialogue. Travaux de Linguistique Quantitative. Geneva: Slatkine-Champion.Google Scholar
Edwards, R., & Collins, L. (2011). Lexical frequency profiles and Zipf's law. Language Learning, 61(1), 130.CrossRefGoogle Scholar
Edwards, R., & Collins, L. (2013). Modelling L2 vocabulary learning. In Jarvis, S. & Daller, M. (Eds.), Vocabulary knowledge: human ratings and automated measures (pp. 157183). Amsterdam: Benjamins.CrossRefGoogle Scholar
Fenson, L., Dale, P. S., Reznick, J. S., Bates, E., Thal, D. J., Pethick, S. J., & Stiles, J. (1994). Variability in early communicative development. Monographs of the Society for Research in Child Development, 59(5), 1185.CrossRefGoogle ScholarPubMed
Fenson, L., Dale, P. S., Reznick, J. S., Thal, D., Bates, E., Hartung, J. P., Pethick, S., & Reilly, J. S. (1993). MacArthur Communicative Development Inventories: user's guide and technical manual. San Diego, CA: Singular Publishing Group.Google Scholar
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3/4), 237264.CrossRefGoogle Scholar
Guiraud, H. (1954). Les Caractères Statistiques du Vocabulaire. Paris: Presses Universitaires de France.Google Scholar
Herdan, G. (1960). Type–token mathematics: a textbook of mathematical linguistics. The Hague: Mouton & Co.Google Scholar
Hidaka, S. (2014). General type–token distribution. Biometrika, 101(4), 9991002.CrossRefGoogle Scholar
Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260), 663685.CrossRefGoogle Scholar
Houston-Price, C., Mather, E., & Sakkalou, E. (2007). Discrepancy between parental reports of infants’ receptive vocabulary and infants’ behaviour in a preferential looking task. Journal of Child Language, 34(4), 701724.CrossRefGoogle Scholar
Kornai, A. (2002). How many words are there? Glottometrics, 4, 6186.Google Scholar
Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978990.CrossRefGoogle Scholar
Law, J., & Roy, P. (2008). Parental report of infant language skills: a review of the development and application of the Communicative Development Inventories. Child and Adolescent Mental Health, 13(4), 198206.CrossRefGoogle ScholarPubMed
Leopold, W. F. (1949). Speech development of a bilingual child. Evanston, IL: Northwestern University Press.Google Scholar
MacWhinney, B., & Snow, C. (1990). The child language data exchange system: an update. Journal of Child Language, 17(2), 457472.CrossRefGoogle ScholarPubMed
Malvern, D., & Richards, B. (2002). Investigating accommodation in language proficiency interviews using a new measure of lexical diversity. Language Testing, 19, 85104.CrossRefGoogle Scholar
Malvern, D., & Richards, B. (2012). Measures of lexical richness. In Chapelle, C. A. (Ed.), Encyclopedia of applied linguistics. Hoboken, NJ: John Wiley and Sons.Google Scholar
McCarthy, P. M., & Jarvis, S. (2007). Vocd: a theoretical and empirical evaluation. Language Testing, 24, 459488.CrossRefGoogle Scholar
McCarthy, P. M., & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: a validation study of sophisticated approaches to lexical diversity assessment. Behavior Research Methods, 42, 381392.CrossRefGoogle ScholarPubMed
Meara, P. M., & Alcoy, J. C. O. (2010). Words as species: an alternative approach to estimating productive vocabulary size. Reading in a Foreign Language, 22(1), 222236.Google Scholar
Mervis, C. B., Mervis, C. A., Johnson, K. E., & Bertrand, J. (1992). Studying early lexical development: the value of the systematic diary method. In Rovee-Collier, C. & Lipsitt, L. P. (Eds.), Advances in infancy research (pp. 291378). Norwood, NJ: Ablex.Google Scholar
Pine, J. M., Freudenthal, D., Krajewski, G., & Gobet, F. (2013). Do young children have adult-like syntactic categories? Zipf's law and the case of the determiner. Cognition, 127(3), 345360.CrossRefGoogle Scholar
Pinker, S. (1991). Rules of language. Science, 253, 530535.CrossRefGoogle ScholarPubMed
Pinker, S. (1994). The language instinct: how the mind creates language. New York: Morrow.CrossRefGoogle Scholar
Reznick, J. S., & Goldfield, B. A. (1994). Diary vs. representative checklist assessment of productive vocabulary. Journal of Child Language, 21(2), 465472.CrossRefGoogle ScholarPubMed
Ring, E. D., & Fenson, L. (2000). The correspondence between parent report and child performance for receptive and expressive vocabulary beyond infancy. First Language, 20(59), 141159.CrossRefGoogle Scholar
Robinson, B. F., & Mervis, C. B. (1999). Comparing productive vocabulary measures from the CDI and a systematic diary study. Journal of Child Language, 26(1), 177185.CrossRefGoogle Scholar
Rowland, C. F., & Fletcher, S. L. (2006). The effect of sampling on estimates of lexical specificity and error rates. Journal of Child Language, 33(4), 859877.CrossRefGoogle ScholarPubMed
Roy, B. C., Frank, M. C., & Roy, D. (2009). Exploring word learning in a high-density longitudinal corpus. In Taatgen, Niels & van Rijn, Hedderik (Eds.), Proceedings of the Thirty First Annual Conference of the Cognitive Science Society (pp. 2106–2111). Amsterdam: Cognitive Science Society.Google Scholar
Salerni, N., Assanelli, A., D'Odorico, L., & Rossi, G. (2007). Qualitative aspects of productive vocabulary at the 200- and 500-word stages: a comparison between spontaneous speech and parental report data. First Language, 27(1), 7587.CrossRefGoogle Scholar
Sampson, G., & Gale, W. A. (1995). Good–Turing frequency estimation without tears. Journal of Quantitative Linguistics, 2(3), 217237.Google Scholar
Samuelson, L. K., & Smith, L. B. (1999). Early noun vocabularies: Do ontology, category organization and syntax correspond? Cognition, 73(1), 133.CrossRefGoogle ScholarPubMed
Thomson, G. H., & Thompson, J. R. (1915). Outlines of a method of the quantitative analysis of writing vocabularies. British Journal of Psychology, 8, 5269.Google Scholar
Tomasello, M. (1992). First verbs: a case study of early grammatical development. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
Tomasello, M. (1995). Language is not an instinct. Cognitive Development, 10(1), 131156.CrossRefGoogle Scholar
Tomasello, M., & Stahl, D. (2004). Sampling children's spontaneous speech: How much is enough? Journal of Child Language, 31(1), 101122.CrossRefGoogle Scholar
Tuldava, J. (1996). The frequency spectrum of text and vocabulary. Journal of Quantitative Linguistics, 3(1), 3850.CrossRefGoogle Scholar
Tweedie, F. J., & Baayen, R. H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32(5), 323352.CrossRefGoogle Scholar
Weir, R. H. (1962). Language in the crib. The Hague: Mouton & Co.Google Scholar
Weitzman, M. (1971). How useful is the logarithmic type–token ratio? Journal of Linguistics, 7, 237243.CrossRefGoogle Scholar
Yang, C. D. (2004). Universal grammar, statistics or both? Trends in Cognitive Sciences, 8(10), 451456.CrossRefGoogle ScholarPubMed
Yoshida, H., & Smith, L. B. (2003). Shifting ontological boundaries: how Japanese- and English-speaking children generalize names for animals and artifacts. Developmental Science, 6(1), 117.CrossRefGoogle Scholar
Yu, C., & Smith, L. B. (2012). Embodied attention and word learning by toddlers. Cognition, 125(2), 244262.CrossRefGoogle ScholarPubMed
Zipf, G. K. (1949). Human behavior and the principle of least-effort. Cambridge, MA: Addison-Wesley.Google Scholar