Hostname: page-component-8448b6f56d-xtgtn Total loading time: 0 Render date: 2024-04-24T10:09:17.743Z Has data issue: false hasContentIssue false

A tree-based learning approach for document structure analysis and its application to web search

Published online by Cambridge University Press:  24 February 2014

F. CANAN PEMBE
Affiliation:
TÜBİTAK BİLGEM, 41470, Gebze, Kocaeli, Turkey e-mail: canan.pembe@tubitak.gov.tr
TUNGA GÜNGÖR
Affiliation:
Department of Computer Engineering, Boğaziçi University, İstanbul, 34342, Turkey e-mail: gungort@boun.edu.tr

Abstract

In this paper, we study the problem of structural analysis of Web documents aiming at extracting the sectional hierarchy of a document. In general, a document can be represented as a hierarchy of sections and subsections with corresponding headings and subheadings. We developed two machine learning models: heading extraction model and hierarchy extraction model. Heading extraction was formulated as a classification problem whereas a tree-based learning approach was employed in hierarchy extraction. For this purpose, we developed an incremental learning algorithm based on support vector machines and perceptrons. The models were evaluated in detail with respect to the performance of the heading and hierarchy extraction tasks. For comparison, a baseline rule-based approach was used that relies on heuristics and HTML document object model tree processing. The machine learning approach, which is a fully automatic approach, outperformed the rule-based approach. We also analyzed the effect of document structuring on automatic summarization in the context of Web search. The results of the task-based evaluation on TREC queries showed that structured summaries are superior to unstructured summaries both in terms of accuracy and user ratings, and enable the users to determine the relevancy of search results more accurately than search engine snippets.

Type
Articles
Copyright
Copyright © Cambridge University Press 2014 

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Alpaydın, E., 2004. Introduction to Machine Learning. Cambridge, UK: MIT Press.Google Scholar
Baeza-Yates, R., and Ribeiro-Neto, B., 1999. Modern Information Retrieval. New York: Addison-Wesley.Google Scholar
Branavan, S. R. K., Deshpande, P., and Barzilay, R. 2007. Generating a table-of-contents. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, 176183, Prague, Czech Republic.Google Scholar
Brugger, R., Zramdini, A., and Ingold, R., 1997. Modeling documents for structure recognition using generalized n-grams. In Proceedings of International Conference on Document Analysis and Recognition, Ulm, Germany, pp. 5660.Google Scholar
Buyukkokten, O., Garcia-Molina, H., and Paepcke, A., 2001. Accordion summarization for end-game browsing on PDAs and cellular phones. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, New York, NY, USA, pp. 213220.CrossRefGoogle Scholar
Buyukkokten, O., Kaljuvee, O., Garcia-Molina, H., Paepcke, A., and Winograd, T., 2002. Efficient web browsing on handheld devices using page and form summarization. ACM Transactions on Information Systems 20 (1): 82115.Google Scholar
Chaudhuri, B. B., 2006. Digital Document Processing: Major Directions and Recent Advances. London: Springer.Google Scholar
Chawla, N. V. 2005. Data mining for imbalanced datasets: an overview. In Maimon, O., and Rokach, L. (eds.), Data Mining and Knowledge Discovery Handbook, pp. 853–67. New York: Springer.CrossRefGoogle Scholar
Chen, Y., Ma, W., and Zhang, H., 2003. Detecting web page structure for adaptive viewing on small form factor devices. In Proceedings of the 12th International World Wide Web Conference, New York, NY, USA, pp. 225–33.CrossRefGoogle Scholar
Cobra: Java HTML Renderer and Parser 2010. http://lobobrowser.org/cobra.jsp.Google Scholar
Collins, M., and Roark, B., 2004. Incremental parsing with the perceptron algorithm. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 111–8.Google Scholar
Covington, M. A., 2001. A fundamental algorithm for dependency parsing. In Proceedings of ACM Southeast Conference, Athens, GA, USA, pp. 95102.Google Scholar
Curran, J. R., and Wong, R. K., 1999. Transformation-based learning for automatic translation from HTML to XML. In Proceedings of the 4th Australasian Document Computing Symposium, Coffs Harbour, NSW, Australia, pp. 5562.Google Scholar
Desmarais, F.-X., Gagnon, M., and Zouaq, A., 2012. Comparing a rule-based and a machine learning approach for semantic analysis. In Proceedings of the 6th International Conference on Advances in Semantic Processing, Barcelona, Spain, pp. 103–8.Google Scholar
Document Object Model 2012. http://www.w3.org/dom.Google Scholar
Feng, J., Haffner, P., and Gilbert, M., 2005. A learning approach to discovering web page semantic structures. In Proceedings of the 8th International Conference on Document Analysis and Recognition, Washington, DC, pp. 1055–9.Google Scholar
Ganganwar, V., 2012. An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering 2 (4): 42–7.Google Scholar
Gupta, S., Kaiser, G., Neistadt, D., and Grimm, P., 2003. DOM-based content extraction of HTML documents. In Proceedings of the 12th International Conference on World Wide Web, New York, NY, USA, pp. 207–14.CrossRefGoogle Scholar
Gupta, S., Kaiser, G., and Stolfo, S., 2005. Extracting context to improve accuracy for HTML content extraction. In Proceedings of the 14th International Conference on World Wide Web, New York, NY, USA, pp. 1114–5.Google Scholar
Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 3rd ed.New York: Springer.CrossRefGoogle Scholar
Hobson, S. P., Dorr, B. J., Monz, C., and Schwartz, R., 2007. Task-based evaluation of text summarization using relevance prediction. Information Processing and Management 43 (6): 1482–99.CrossRefGoogle Scholar
HTML 4.01 Specification 1999. http://www.w3.org/TR/html401/.Google Scholar
Indurkhya, N., and Damerau, F. J. 2010. Handbook of Natural Language Processing, 2nd ed.Boca Raton, FL: CRC Press.Google Scholar
Ingwersen, P., and Jarvelin, K., 2005. The Turn: Integration of Information Seeking and Retrieval in Context. Dordrecht: Springer.Google Scholar
Irmak, U., and Kraft, R., 2010. A scalable machine-learning approach for semi-structured named entity recognition. In Proceedings of the International World Wide Web Conference, New York, NY, USA, pp. 461–70.CrossRefGoogle Scholar
Jensen, S. H., Madsen, M., and Moller, A., 2011. Modeling the HTML DOM and browser API in static analysis of JavaScript web applications. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, Szeged, Hungary, pp. 5969.CrossRefGoogle Scholar
Joachims, T. 1999. Making large-scale SVM learning practical. In Schölkopf, B., Burges, C., and Smola, A. (eds.), Advances in Kernel Methods - Support Vector Learning. Cambridge, UK: MIT Press.Google Scholar
Joachims, T., 2002. Learning to Classify Text Using Support Vector Machines. Boston: Kluwer Academic Publishers.CrossRefGoogle Scholar
Kao, H.-Y., Ho, J.-M., and Chen, M.-S., 2004. DOMISA: DOM-based information space adsorption for web information hierarchy mining. In Proceedings of the 4th SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, pp. 312–20.Google Scholar
Klink, S., Dengel, A., and Kieninger, T., 2000. Document structure analysis based on layout and textual features. In Proceedings of the International Workshop on Document Analysis Systems, Kaiserslautern, Germany, pp. 99111.Google Scholar
Le, D. X., and Thoma, G. R., 2003. Automated document labeling for Web-based online medical journals. In Proceedings of the 7th World Multiconference on Systemics, Cybernetics and Informatics, Orlando, Florida, USA, pp. 411–15.Google Scholar
Li, Y., Wang, L., Wang, J., Yue, J., and Zhao, M., 2013. An approach of web page information extraction. In Proceedings of the 2nd International Conference on Computer Science and Electronics Engineering, Hangzhou, China, pp. 2217–9.Google Scholar
Liu, Y., Wang, Q., and Wang, QX., 2006. A heuristic approach for topical information extraction from news pages. In Proceedings of the 7th International Conference on Web Information Systems Engineering, Wuhan, China, pp. 357–62.Google Scholar
Mandhani, B., and Meila, M., 2009. Tractable search for learning exponential models of rankings. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, Clearwater, Florida, USA, pp. 392–9.Google Scholar
Mani, I., 2001. Automatic Summarisation. Amsterdam: John Benjamins.CrossRefGoogle Scholar
Mani, I., Klein, G., House, D., Hirschman, L., Firmin, T., and Sundheim, B., 2002. SUMMAC: a text summarization evaluation. Natural Language Engineering 8 (1): 4368.CrossRefGoogle Scholar
Mao, S., Rosenfeld, A., and Kanungo, T., 2003. Document structure analysis algorithms: a literature survey. In Proceedings of SPIE Electronic Imaging, Santa Clara, California, USA, pp. 197207.Google Scholar
Markey, K., 2007. Twenty-five years of end-user searching, Part 1: research findings. Journal of the American Society for Information Science and Technology 58 (8): 1071–81.CrossRefGoogle Scholar
Mayfield, J., McNamee, P., Piatko, C., and Pearce, C., 2003. Lattice-based tagging using support vector machines. In Proceedings of the 12th International Conference on Information and Knowledge Management, New Orleans, Los Angeles, USA, pp. 303–8.Google Scholar
Mukherjee, S., Yang, G., Tan, W., and Ramakrishnan, I. V., 2003. Automatic discovery of semantic structures in HTML documents. In Proceedings of the 7th International Conference on Document Analysis and Recognition, Edinburgh, UK, pp. 245–9.Google Scholar
Niyogi, D., and Srihari, S. N., 1995. Knowledge-based derivation of document logical structure. In Proceedings of the International Conference on Document Analysis and Recognition, Montreal, Canada, pp. 472–5.CrossRefGoogle Scholar
Pembe, F. C., and Güngör, T., 2009. Structure-preserving and query-biased document summarisation for web searching. Online Information Review 33 (4): 696719.CrossRefGoogle Scholar
Pinto, D., McCallum, A., Wei, X., and Croft, W. B., 2003. Table extraction using conditional random fields. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Toronto, Canada, pp. 235–42.Google Scholar
Platt, J. C. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Smola, A., Bartlett, P., Scholkopf, B., and Schuurmans, D. (eds.), Advances in Large Margin Classifiers, pp. 6174. Cambridge, UK: MIT Press.Google Scholar
Rahman, A. F. R., Alam, H., and Hartono, R. 2001. Content extraction from HTML documents. In Proceedings of the 1st International Workshop on Web Document Analysis. Seattle, Washington, USA.Google Scholar
Shilman, M., Liang, P., and Viola, P., 2005. Learning non-generative grammatical models for document analysis. In Proceedings of the 10th IEEE International Conference on Computer Vision, Beijing, China, pp. 962–9.Google Scholar
Text Retrieval Conference 2010. http://trec.nist.gov.Google Scholar
Tombros, A., and Sanderson, M., 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the 21th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 210.Google Scholar
Weiss, G. M., McCarthy, K., and Zabar, B. 2007. Cost-sensitive learning vs. sampling: which is best for handling unbalanced classes with unequal error costs? In Proceedings of the International Conference on Data Mining, Las Vegas, Nevada, USA, pp. 3541.Google Scholar
White, R. W., Jose, J. M., and Ruthven, I., 2003. A task-oriented study on the influencing effects of query-biased summarization in web searching. Information Processing and Management 39 (5): 707–33.CrossRefGoogle Scholar
Xiao, X., Luo, Q., Hong, D., Fu, H., Xie, X., and Ma, W-Y. 2009. Browsing on small displays by transforming web pages into hierarchically structured subpages. ACM Transactions on the Web 3 (1): 4:1–4:36.CrossRefGoogle Scholar
Xue, Y., Hu, Y., Xin, G., Song, R., Shi, S., Cao, Y., Lin, C. Y., and Li, H., 2007. Web page title extraction and its application. Information Processing and Management 43 (5): 1332–47.CrossRefGoogle Scholar
Yang, C. C., and Wang, F. L., 2008. Hierarchical summarization of large documents. Journal of the American Society for Information Science and Technology 59 (6): 887902.CrossRefGoogle Scholar
Yang, Y., and Zhang, H. J., 2001. HTML page analysis based on visual cues. In Proceedings of the 6th International Conference on Document Analysis and Recognition, Washington, USA, p. 859.Google Scholar