Hostname: page-component-8448b6f56d-c4f8m Total loading time: 0 Render date: 2024-04-19T07:34:46.979Z Has data issue: false hasContentIssue false

Pattern-based unsupervised parsing method

Published online by Cambridge University Press:  04 June 2014

JESÚS SANTAMARÍA
Affiliation:
Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain email: jsant@lsi.uned.es, lurdes@lsi.uned.es
LOURDES ARAUJO
Affiliation:
Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain email: jsant@lsi.uned.es, lurdes@lsi.uned.es

Abstract

We have developed a heuristic method for unsupervised parsing of unrestricted text. Our method relies on detecting certain patterns of part-of-speech tag sequences of words in sentences. This detection is based on statistical data obtained from the corpus and allows us to classify part-of-speech tags into classes that play specific roles in the parse trees. These classes are then used to construct the parse tree of new sentences via a set of deterministic rules. Aiming to asses the viability of the method on different languages, we have tested it on English, Spanish, Italian, Hebrew, German, and Chinese. We have obtained a significant improvement over other unsupervised approaches for some languages, including English, and provided, as far as we know, the first results of this kind for others.

Type
Articles
Copyright
Copyright © Cambridge University Press 2014 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abney, S., Flickenger, S., Gdaniec, C., Grishman, C., Harrison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M., Roukos, S., Santorini, B., and Strzalkowski, T. 1991. Procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings of Human Language Technologies, North American Chapter of the ACL, pp. 306–11. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Bod, R. 2006. Unsupervised parsing with u-dop. In Proceedings of the Conference on Computational Natural Language Learning, pp 8592. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Bod, R. 2007. Is the end of supervised parsing in sight. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 400–7. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Carroll, G., and Charniak, E. 1992. Two experiments on learning probabilistic dependency grammars from corpora. In Working Notes of the Workshop Statistically-Based NLP Techniques, pp. 113. Palo Alto, CA: Association for the Advancement of Artificial Intelligence.Google Scholar
Clark, A. 2000. Inducing syntactic categories by context distribution clustering. In Proceedigns of the Workshop on Learning Language in Logic and the Conference on Computational Natural Language Learning (vol. 7), pp. 91–4. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Cohen, S. B., and Smith, N. A. 2009. Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. In Proceedings of Human Language Technologies, North American Chapter of the ACL, pp. 7482. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Das, D., and Petrov, S. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 600–9. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Gao, J., and Johnson, M. 2008. A comparison of Bayesian estimators for unsupervised hidden Markov model POS taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 344–52. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Golland, D., DeNero, J., and Uszkoreit, J. 2012. A feature-rich constituent context model for grammar induction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (vol. 2), pp. 1722. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Klein, D., and Manning, C. D. 2004. Corpus-based induction of syntactic structure: models of dependency and constituency. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 478–85. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Klein, D., and Manning, C. D. 2005. Natural language grammar induction with a generative constituent-context model. Pattern Recognition 38 (9), 1407–19.Google Scholar
Lesmo, L., Lombardo, V., and Bosco, C. 2002. Treebank development: the TUT approach. In Proceedings of the International Conference on Natural Language Processing, pp. 6170. Noida, India: Vikas.Google Scholar
Magerman, D. M., and Marcus, M. P. 1990. Parsing a natural language using mutual information statistics. In Proceedings of the National Conference on Artificial Intelligence (vol. 2), pp. 984–9. Palo Alto, CA: Association for the Advancement of Artificial Intelligence.Google Scholar
Maier, W. 2006. Annotation schemes and their influence on parsing results. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 1924. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A., 1994. Building a large annotated corpus of English: the PENN treebank. Computational Linguistics 19 (2): 313–30.Google Scholar
Moreno, A., Grishman, R., Lopez, S., Sanchez, F., and Sekine, S. 2000. A treebank of Spanish and its application to parsing. In Proceedings of the International Conference on Language Resources & Evaluation, pp. 107–11. Paris, France: European Language Resources Association.Google Scholar
Petrov, S. 2010. Products of random latent variable grammars. In Proceedings of North American Chapter of the Association for Computational Linguistics, pp. 1927. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Petrov, S., Das, D., and McDonald, R. 2012. A universal part-of-speech tagset. In Proceedings of the International Conference on Language Resources & Evaluation, pp. 2089–96. Paris, France: European Language Resources Association.Google Scholar
Santamaría, J., and Araujo, L. 2010. Identifying patterns for unsupervised grammar induction. In Proceedings of the Conference on Computational Natural Language Learning, pp. 3845. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Seginer, Y. 2007a. Fast unsupervised incremental parsing. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL), pp. 384–91. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Seginer, Y. 2007b. Learning Syntactic Structure. PhD thesis, Faculty of Science, University of Amsterdam, The Netherlands.Google Scholar
Sima’an, K., Itai, A., Winter, Y., Altman, A. and Nativ, N., 2001. Building a tree-bank of modern Hebrew text. Traitment Automatique des Langues 42: 346–80.Google Scholar
Skut, W., Krenn, B., Brants, T., and Uszkoreit, H. 1997. An annotation scheme for free word order languages. In Proceedings of the Conference on Applied Natural Language Processing, pp. 8895. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. 2011. Punctuation: making a point in unsupervised dependency parsing. In Proceedings of the Conference on Computational Natural Language Learning, pp. 1928. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. 2012. Capitalization cues improve dependency grammar induction. In NAACL-HLT: Workshop on Inducing Linguistic Structure (WILS 2012), pp. 1622. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Stolcke, A. and Omohundro, S. M. 1994. Inducing probabilistic grammars by Bayesian model merging. In Proceedings of the International Colloquium on Grammatical Inference and Applications (ICGI), pp. 106–18. London: Springer-Verlag.Google Scholar
van Zaanen, M. 2000. Abl: alignment-based learning. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 961–7. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar
Xia, F., Palmer, M., Xue, N., Okurowski, M. E., Kovarik, J., dong Chiou, F., Huang, S., Kroch, T., and Marcus, M. 2000. Developing guidelines and ensuring consistency for Chinese text annotation. In Proceedings of the International Conference on Language Resources & Evaluation, pp. 18. Paris, France: European Language Resources Association.Google Scholar