A tree-based learning approach for document structure analysis and its application to web search

F. CANAN PEMBE; TUNGA GÜNGÖR

doi:10.1017/S1351324914000023

A tree-based learning approach for document structure analysis and its application to web search

Published online by Cambridge University Press: 24 February 2014

F. CANAN PEMBE and

TUNGA GÜNGÖR

Show author details

F. CANAN PEMBE: Affiliation:
TÜBİTAK BİLGEM, 41470, Gebze, Kocaeli, Turkey e-mail: canan.pembe@tubitak.gov.tr
TUNGA GÜNGÖR: Affiliation:
Department of Computer Engineering, Boğaziçi University, İstanbul, 34342, Turkey e-mail: gungort@boun.edu.tr

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

In this paper, we study the problem of structural analysis of Web documents aiming at extracting the sectional hierarchy of a document. In general, a document can be represented as a hierarchy of sections and subsections with corresponding headings and subheadings. We developed two machine learning models: heading extraction model and hierarchy extraction model. Heading extraction was formulated as a classification problem whereas a tree-based learning approach was employed in hierarchy extraction. For this purpose, we developed an incremental learning algorithm based on support vector machines and perceptrons. The models were evaluated in detail with respect to the performance of the heading and hierarchy extraction tasks. For comparison, a baseline rule-based approach was used that relies on heuristics and HTML document object model tree processing. The machine learning approach, which is a fully automatic approach, outperformed the rule-based approach. We also analyzed the effect of document structuring on automatic summarization in the context of Web search. The results of the task-based evaluation on TREC queries showed that structured summaries are superior to unstructured summaries both in terms of accuracy and user ratings, and enable the users to determine the relevancy of search results more accurately than search engine snippets.

Type: Articles
Information: Natural Language Engineering , Volume 21 , Issue 4 , August 2015 , pp. 569 - 605

DOI: https://doi.org/10.1017/S1351324914000023 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2014

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alpaydın, E., 2004. Introduction to Machine Learning. Cambridge, UK: MIT Press.Google Scholar

Baeza-Yates, R., and Ribeiro-Neto, B., 1999. Modern Information Retrieval. New York: Addison-Wesley.Google Scholar

Branavan, S. R. K., Deshpande, P., and Barzilay, R. 2007. Generating a table-of-contents. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, 176–183, Prague, Czech Republic.Google Scholar

Brugger, R., Zramdini, A., and Ingold, R., 1997. Modeling documents for structure recognition using generalized n-grams. In Proceedings of International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 56–60.Google Scholar

Buyukkokten, O., Garcia-Molina, H., and Paepcke, A., 2001. Accordion summarization for end-game browsing on PDAs and cellular phones. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, New York, NY, USA, pp. 213–220.CrossRef Google Scholar

Buyukkokten, O., Kaljuvee, O., Garcia-Molina, H., Paepcke, A., and Winograd, T., 2002. Efficient web browsing on handheld devices using page and form summarization. ACM Transactions on Information Systems 20 (1): 82–115.Google Scholar

Chaudhuri, B. B., 2006. Digital Document Processing: Major Directions and Recent Advances. London: Springer.Google Scholar

Chawla, N. V. 2005. Data mining for imbalanced datasets: an overview. In Maimon, O., and Rokach, L. (eds.), Data Mining and Knowledge Discovery Handbook, pp. 853–67. New York: Springer.CrossRef Google Scholar

Chen, Y., Ma, W., and Zhang, H., 2003. Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of the 12th International World Wide Web Conference, New York, NY, USA, pp. 225–33.CrossRef Google Scholar

Cobra: Java HTML Renderer and Parser 2010. http://lobobrowser.org/cobra.jsp.Google Scholar

Collins, M., and Roark, B., 2004. Incremental parsing with the perceptron algorithm. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 111–8.Google Scholar

Covington, M. A., 2001. A fundamental algorithm for dependency parsing. In Proceedings of ACM Southeast Conference, Athens, GA, USA, pp. 95–102.Google Scholar

Curran, J. R., and Wong, R. K., 1999. Transformation-based learning for automatic translation from HTML to XML. In Proceedings of the 4th Australasian Document Computing Symposium, Coffs Harbour, NSW, Australia, pp. 55–62.Google Scholar

Desmarais, F.-X., Gagnon, M., and Zouaq, A., 2012. Comparing a rule-based and a machine learning approach for semantic analysis. In Proceedings of the 6th International Conference on Advances in Semantic Processing, Barcelona, Spain, pp. 103–8.Google Scholar

Document Object Model 2012. http://www.w3.org/dom.Google Scholar

Feng, J., Haffner, P., and Gilbert, M., 2005. A learning approach to discovering web page semantic structures. In Proceedings of the 8th International Conference on Document Analysis and Recognition, Washington, DC, pp. 1055–9.Google Scholar

Ganganwar, V., 2012. An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering 2 (4): 42–7.Google Scholar

Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P., 2003. DOM-based content extraction of HTML documents. In Proceedings of the 12th International Conference on World Wide Web, New York, NY, USA, pp. 207–14.CrossRef Google Scholar

Gupta, S., Kaiser, G., and Stolfo, S., 2005. Extracting context to improve accuracy for HTML content extraction. In Proceedings of the 14th International Conference on World Wide Web, New York, NY, USA, pp. 1114–5.Google Scholar

Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 3rd ed.New York: Springer.CrossRef Google Scholar

Hobson, S. P., Dorr, B. J., Monz, C., and Schwartz, R., 2007. Task-based evaluation of text summarization using relevance prediction. Information Processing and Management 43 (6): 1482–99.CrossRef Google Scholar

HTML 4.01 Specification 1999. http://www.w3.org/TR/html401/.Google Scholar

Indurkhya, N., and Damerau, F. J. 2010. Handbook of Natural Language Processing, 2nd ed.Boca Raton, FL: CRC Press.Google Scholar

Ingwersen, P., and Jarvelin, K., 2005. The Turn: Integration of Information Seeking and Retrieval in Context. Dordrecht: Springer.Google Scholar

Irmak, U., and Kraft, R., 2010. A scalable machine-learning approach for semi-structured named entity recognition. In Proceedings of the International World Wide Web Conference, New York, NY, USA, pp. 461–70.CrossRef Google Scholar

Jensen, S. H., Madsen, M., and Moller, A., 2011. Modeling the HTML DOM and browser API in static analysis of JavaScript web applications. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, Szeged, Hungary, pp. 59–69.CrossRef Google Scholar

Joachims, T. 1999. Making large-scale SVM learning practical. In Schölkopf, B., Burges, C., and Smola, A. (eds.), Advances in Kernel Methods - Support Vector Learning. Cambridge, UK: MIT Press.Google Scholar

Joachims, T., 2002. Learning to Classify Text Using Support Vector Machines. Boston: Kluwer Academic Publishers.CrossRef Google Scholar

Kao, H.-Y., Ho, J.-M., and Chen, M.-S., 2004. DOMISA: DOM-based information space adsorption for web information hierarchy mining. In Proceedings of the 4th SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, pp. 312–20.Google Scholar

Klink, S., Dengel, A., and Kieninger, T., 2000. Document structure analysis based on layout and textual features. In Proceedings of the International Workshop on Document Analysis Systems, Kaiserslautern, Germany, pp. 99–111.Google Scholar

Le, D. X., and Thoma, G. R., 2003. Automated document labeling for Web-based online medical journals. In Proceedings of the 7th World Multiconference on Systemics, Cybernetics and Informatics, Orlando, Florida, USA, pp. 411–15.Google Scholar

Li, Y., Wang, L., Wang, J., Yue, J., and Zhao, M., 2013. An approach of web page information extraction. In Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering, Hangzhou, China, pp. 2217–9.Google Scholar

Liu, Y., Wang, Q., and Wang, QX., 2006. A heuristic approach for topical information extraction from news pages. In Proceedings of the 7th International Conference on Web Information Systems Engineering, Wuhan, China, pp. 357–62.Google Scholar

Mandhani, B., and Meila, M., 2009. Tractable search for learning exponential models of rankings. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Clearwater, Florida, USA, pp. 392–9.Google Scholar

Mani, I., 2001. Automatic Summarisation. Amsterdam: John Benjamins.CrossRef Google Scholar

Mani, I., Klein, G., House, D., Hirschman, L., Firmin, T., and Sundheim, B., 2002. SUMMAC: a text summarization evaluation. Natural Language Engineering 8 (1): 43–68.CrossRef Google Scholar

Mao, S., Rosenfeld, A., and Kanungo, T., 2003. Document structure analysis algorithms: a literature survey. In Proceedings of SPIE Electronic Imaging, Santa Clara, California, USA, pp. 197–207.Google Scholar

Markey, K., 2007. Twenty-five years of end-user searching, Part 1: research findings. Journal of the American Society for Information Science and Technology 58 (8): 1071–81.CrossRef Google Scholar

Mayfield, J., McNamee, P., Piatko, C., and Pearce, C., 2003. Lattice-based tagging using support vector machines. In Proceedings of the 12th International Conference on Information and Knowledge Management, New Orleans, Los Angeles, USA, pp. 303–8.Google Scholar

Mukherjee, S., Yang, G., Tan, W., and Ramakrishnan, I. V., 2003. Automatic discovery of semantic structures in HTML documents. In Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, UK, pp. 245–9.Google Scholar

Niyogi, D., and Srihari, S. N., 1995. Knowledge-based derivation of document logical structure. In Proceedings of the International Conference on Document Analysis and Recognition, Montreal, Canada, pp. 472–5.CrossRef Google Scholar

Pembe, F. C., and Güngör, T., 2009. Structure-preserving and query-biased document summarisation for web searching. Online Information Review 33 (4): 696–719.CrossRef Google Scholar

Pinto, D., McCallum, A., Wei, X., and Croft, W. B., 2003. Table extraction using conditional random fields. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 235–42.Google Scholar

Platt, J. C. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Smola, A., Bartlett, P., Scholkopf, B., and Schuurmans, D. (eds.), Advances in Large Margin Classifiers, pp. 61–74. Cambridge, UK: MIT Press.Google Scholar

Rahman, A. F. R., Alam, H., and Hartono, R. 2001. Content extraction from HTML documents. In Proceedings of the 1st International Workshop on Web Document Analysis. Seattle, Washington, USA.Google Scholar

Shilman, M., Liang, P., and Viola, P., 2005. Learning non-generative grammatical models for document analysis. In Proceedings of the 10th IEEE International Conference on Computer Vision, Beijing, China, pp. 962–9.Google Scholar

Text Retrieval Conference 2010. http://trec.nist.gov.Google Scholar

Tombros, A., and Sanderson, M., 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 2–10.Google Scholar

Weiss, G. M., McCarthy, K., and Zabar, B. 2007. Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In Proceedings of the International Conference on Data Mining, Las Vegas, Nevada, USA, pp. 35–41.Google Scholar

White, R. W., Jose, J. M., and Ruthven, I., 2003. A task-oriented study on the influencing effects of query-biased summarization in web searching. Information Processing and Management 39 (5): 707–33.CrossRef Google Scholar

Xiao, X., Luo, Q., Hong, D., Fu, H., Xie, X., and Ma, W-Y. 2009. Browsing on small displays by transforming web pages into hierarchically structured subpages. ACM Transactions on the Web 3 (1): 4:1–4:36.CrossRef Google Scholar

Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin, C. Y., and Li, H., 2007. Web page title extraction and its application. Information Processing and Management 43 (5): 1332–47.CrossRef Google Scholar

Yang, C. C., and Wang, F. L., 2008. Hierarchical summarization of large documents. Journal of the American Society for Information Science and Technology 59 (6): 887–902.CrossRef Google Scholar

Yang, Y., and Zhang, H. J., 2001. HTML page analysis based on visual cues. In Proceedings of the 6th International Conference on Document Analysis and Recognition, Washington, USA, p. 859.Google Scholar

Article contents

A tree-based learning approach for document structure analysis and its application to web search

Abstract

Access options

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests