Mostly-unsupervised statistical segmentation of Japanese kanji sequences

RIE KUBOTA ANDO; LILLIAN LEE

doi:10.1017/S1351324902002954

Mostly-unsupervised statistical segmentation of Japanese kanji sequences

Published online by Cambridge University Press: 04 August 2003

RIE KUBOTA ANDO and

LILLIAN LEE

Show author details

RIE KUBOTA ANDO: Affiliation:
IBM Thomas J. Watson Research Center, P.O. Box 704, Yorktown Heights, NY 10598, USA e-mail: rie1@us.ibm.com
LILLIAN LEE: Affiliation:
Department of Computer Science, Cornell University, Ithaca, NY 14853-7501 USA e-mail: llee@cs.cornell.edu

Article contents

Abstract

Get access

Rights & Permissions

Abstract

Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introduce a novel, more robust statistical method utilizing unsegmented training data. Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics. The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese. Additionally, we present a two-level annotation scheme for Japanese to incorporate multiple segmentation granularities, and introduce two novel evaluation metrics, both based on the notion of a compatible bracket, that can account for multiple granularities simultaneously.

Type: Research Article
Information: Natural Language Engineering , Volume 9 , Issue 2 , June 2003 , pp. 127 - 149

DOI: https://doi.org/10.1017/S1351324902002954 [Opens in a new window]

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article contents

Mostly-unsupervised statistical segmentation of Japanese kanji sequences

Abstract

Access options

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests