Splitting-merging model of Chinese word tokenization and segmentation

YUAN YAO; KIM TEN LUA

doi:10.1017/S1351324998002058

Splitting-merging model of Chinese word tokenization and segmentation

Published online by Cambridge University Press: 01 December 1998

YUAN YAO and

KIM TEN LUA

Show author details

YUAN YAO: Affiliation:
Department of Information Systems & Computer Science, National University of Singapore, Lower Kent Ridge Road, Singapore 119260, e-mail: yaoyuan@iscs.nw.edu.sgluakt@iscs.nw.edu.sg
KIM TEN LUA: Affiliation:
Department of Information Systems & Computer Science, National University of Singapore, Lower Kent Ridge Road, Singapore 119260, e-mail: yaoyuan@iscs.nw.edu.sgluakt@iscs.nw.edu.sg

Article contents

Abstract

Get access

Rights & Permissions

Abstract

Currently, word tokenization and segmentation are still a hot topic in natural language processing, especially for languages like Chinese in which there is no blank space for word delimitation. Three major problems are faced: (1) tokenizing direction and efficiency; (2) insufficient tokenization dictionary and new words; and (3) ambiguity of tokenization and segmentation. Most existing tokenization and segmentation methods have not dealt with the above problems together. To tackle the three problems in one basket, this paper presents a novel dictionary-based method called the Splitting-Merging Model (SMM) for Chinese word tokenization and segmentation. It uses the mutual information of Chinese characters to find the boundaries and the non-boundaries of Chinese words, and finally leads to a word segmentation by resolving ambiguities and detecting new words.

Type: Research Article
Information: Natural Language Engineering , Volume 4 , Issue 4 , December 1998 , pp. 309 - 324

DOI: https://doi.org/10.1017/S1351324998002058 [Opens in a new window]

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article contents

Splitting-merging model of Chinese word tokenization and segmentation

Abstract

Access options

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests