Hostname: page-component-8448b6f56d-c4f8m Total loading time: 0 Render date: 2024-04-24T08:17:53.396Z Has data issue: false hasContentIssue false

Using shallow linguistic analysis to improve search on Danish compounds

Published online by Cambridge University Press:  09 June 2006

BOLETTE SANDFORD PEDERSEN
Affiliation:
Center for Sprogteknologi, University of Copenhagen, Njalsgade 80, DK-2300 S, Denmark e-mail: bolette@cst.dk

Abstract

In this paper we focus on a specific search-related query expansion topic, namely search on Danish compounds and expansion to some of their synonymous phrases. Compounds constitute a specific issue in search, in particular in languages where they are written in one word, as is the case for Danish and the other Scandinavian languages. For such languages, expansion of the query compound into separate lemmas is a way of finding the often frequent alternative synonymous phrases in which the content of a compound can also be expressed. However, it is crucial to note that the number of irrelevant hits is generally very high when using this expansion strategy. The aim of this paper is therefore to examine how we can obtain better search results on split compounds, partly by looking at the internal structure of the original compound, partly by analyzing the context in which the split compound occurs. In this context, we pursue two hypotheses: (1) that some categories of compounds are more likely to have synonymous ‘split’ counterparts than others; and (2) that search results where both the search words (obtained by splitting the compound) occur in the same noun phrase, are more likely to contain a synonymous phrase to the original compound query. The search results from 410 enhanced compound queries are used as a test bed for our experiments. On these search results, we perform a shallow linguistic analysis and introduce a new, linguistically based threshold for retrieved hits. The results obtained by using this strategy demonstrate that compound splitting combined with a shallow linguistic analysis focusing on the argument structure of the compound head as well as on the recognition of NPs, can improve search by substantially bringing down the number of irrelevant hits.

Type
Papers
Copyright
2006 Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)