a1 Research Foundation – Flanders (FWO), Egmontstraat 5, 1000 Brussels, Belgium email: firstname.lastname@example.org
a2 Quantitative Lexicology and Variational Linguistics (QLVL), University of Leuven, Blijde-Inkomststraat 21 P.O. Box 3308, 3000 Leuven, Belgium email: email@example.com, firstname.lastname@example.org
Languages are not uniform. Speakers of different language varieties use certain words differently – more or less frequently, or with different meanings. We argue that distributional semantics is the ideal framework for the investigation of such lexical variation. We address two research questions and present our analysis of the lexical variation between Belgian Dutch and Netherlandic Dutch. The first question involves a classic application of distributional models: the automatic retrieval of synonyms. We use corpora of two different language varieties to identify the Netherlandic Dutch synonyms for a set of typically Belgian words. Second, we address the problem of automatically identifying words that are typical of a given lect, either because of their high frequency or because of their divergent meaning. Overall, we show that distributional models are able to identify more lectal markers than traditional keyword methods. Distributional models also have a bias towards a different type of variation. In summary, our results demonstrate how distributional semantics can help research in variational linguistics, with possible future applications in lexicography or terminology extraction.
(Received July 15 2009)
(Revised February 25 2010)
(Accepted May 14 2010)