Pattern-based unsupervised parsing method

JESÚS SANTAMARÍA; LOURDES ARAUJO

doi:10.1017/S1351324914000072

Pattern-based unsupervised parsing method

Published online by Cambridge University Press: 04 June 2014

JESÚS SANTAMARÍA and

LOURDES ARAUJO

Show author details

JESÚS SANTAMARÍA: Affiliation:
Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain email: jsant@lsi.uned.es, lurdes@lsi.uned.es
LOURDES ARAUJO: Affiliation:
Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Madrid 28040, Spain email: jsant@lsi.uned.es, lurdes@lsi.uned.es

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

We have developed a heuristic method for unsupervised parsing of unrestricted text. Our method relies on detecting certain patterns of part-of-speech tag sequences of words in sentences. This detection is based on statistical data obtained from the corpus and allows us to classify part-of-speech tags into classes that play specific roles in the parse trees. These classes are then used to construct the parse tree of new sentences via a set of deterministic rules. Aiming to asses the viability of the method on different languages, we have tested it on English, Spanish, Italian, Hebrew, German, and Chinese. We have obtained a significant improvement over other unsupervised approaches for some languages, including English, and provided, as far as we know, the first results of this kind for others.

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 3 , May 2016 , pp. 397 - 422

DOI: https://doi.org/10.1017/S1351324914000072 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Abney, S., Flickenger, S., Gdaniec, C., Grishman, C., Harrison, P., Hindle, D., Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M., Roukos, S., Santorini, B., and Strzalkowski, T. 1991. Procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings of Human Language Technologies, North American Chapter of the ACL, pp. 306–11. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Bod, R. 2006. Unsupervised parsing with u-dop. In Proceedings of the Conference on Computational Natural Language Learning, pp 85–92. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Bod, R. 2007. Is the end of supervised parsing in sight. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 400–7. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Carroll, G., and Charniak, E. 1992. Two experiments on learning probabilistic dependency grammars from corpora. In Working Notes of the Workshop Statistically-Based NLP Techniques, pp. 1–13. Palo Alto, CA: Association for the Advancement of Artificial Intelligence.Google Scholar

Clark, A. 2000. Inducing syntactic categories by context distribution clustering. In Proceedigns of the Workshop on Learning Language in Logic and the Conference on Computational Natural Language Learning (vol. 7), pp. 91–4. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Cohen, S. B., and Smith, N. A. 2009. Shared logistic normal distributions for soft parameter tying in unsupervised grammar induction. In Proceedings of Human Language Technologies, North American Chapter of the ACL, pp. 74–82. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Das, D., and Petrov, S. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 600–9. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Gao, J., and Johnson, M. 2008. A comparison of Bayesian estimators for unsupervised hidden Markov model POS taggers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 344–52. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Golland, D., DeNero, J., and Uszkoreit, J. 2012. A feature-rich constituent context model for grammar induction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (vol. 2), pp. 17–22. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Klein, D., and Manning, C. D. 2004. Corpus-based induction of syntactic structure: models of dependency and constituency. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 478–85. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Klein, D., and Manning, C. D. 2005. Natural language grammar induction with a generative constituent-context model. Pattern Recognition 38 (9), 1407–19.Google Scholar

Lesmo, L., Lombardo, V., and Bosco, C. 2002. Treebank development: the TUT approach. In Proceedings of the International Conference on Natural Language Processing, pp. 61–70. Noida, India: Vikas.Google Scholar

Magerman, D. M., and Marcus, M. P. 1990. Parsing a natural language using mutual information statistics. In Proceedings of the National Conference on Artificial Intelligence (vol. 2), pp. 984–9. Palo Alto, CA: Association for the Advancement of Artificial Intelligence.Google Scholar

Maier, W. 2006. Annotation schemes and their influence on parsing results. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 19–24. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A., 1994. Building a large annotated corpus of English: the PENN treebank. Computational Linguistics 19 (2): 313–30.Google Scholar

Moreno, A., Grishman, R., Lopez, S., Sanchez, F., and Sekine, S. 2000. A treebank of Spanish and its application to parsing. In Proceedings of the International Conference on Language Resources & Evaluation, pp. 107–11. Paris, France: European Language Resources Association.Google Scholar

Petrov, S. 2010. Products of random latent variable grammars. In Proceedings of North American Chapter of the Association for Computational Linguistics, pp. 19–27. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Petrov, S., Das, D., and McDonald, R. 2012. A universal part-of-speech tagset. In Proceedings of the International Conference on Language Resources & Evaluation, pp. 2089–96. Paris, France: European Language Resources Association.Google Scholar

Santamaría, J., and Araujo, L. 2010. Identifying patterns for unsupervised grammar induction. In Proceedings of the Conference on Computational Natural Language Learning, pp. 38–45. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Seginer, Y. 2007a. Fast unsupervised incremental parsing. In Proceedings of the Meeting of the Association for Computational Linguistics (ACL), pp. 384–91. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Seginer, Y. 2007b. Learning Syntactic Structure. PhD thesis, Faculty of Science, University of Amsterdam, The Netherlands.Google Scholar

Sima’an, K., Itai, A., Winter, Y., Altman, A. and Nativ, N., 2001. Building a tree-bank of modern Hebrew text. Traitment Automatique des Langues 42: 346–80.Google Scholar

Skut, W., Krenn, B., Brants, T., and Uszkoreit, H. 1997. An annotation scheme for free word order languages. In Proceedings of the Conference on Applied Natural Language Processing, pp. 88–95. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. 2011. Punctuation: making a point in unsupervised dependency parsing. In Proceedings of the Conference on Computational Natural Language Learning, pp. 19–28. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Spitkovsky, V. I., Alshawi, H., and Jurafsky, D. 2012. Capitalization cues improve dependency grammar induction. In NAACL-HLT: Workshop on Inducing Linguistic Structure (WILS 2012), pp. 16–22. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Stolcke, A. and Omohundro, S. M. 1994. Inducing probabilistic grammars by Bayesian model merging. In Proceedings of the International Colloquium on Grammatical Inference and Applications (ICGI), pp. 106–18. London: Springer-Verlag.Google Scholar

van Zaanen, M. 2000. Abl: alignment-based learning. In Proceedings of the International Conference on Computational Linguistics (COLING), pp. 961–7. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Xia, F., Palmer, M., Xue, N., Okurowski, M. E., Kovarik, J., dong Chiou, F., Huang, S., Kroch, T., and Marcus, M. 2000. Developing guidelines and ensuring consistency for Chinese text annotation. In Proceedings of the International Conference on Language Resources & Evaluation, pp. 1–8. Paris, France: European Language Resources Association.Google Scholar

Article contents

Pattern-based unsupervised parsing method

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests