μRaptor: A DOM-based system with appetite for hCard elements

Emir Muñoz, Luca Costabello, Pierre Yves Vandenbussche

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

This paper describes μRaptor, a DOM-based method to extract hCard microformats from HTML pages stripped of microformat markup. μRaptor extracts DOM sub-trees, converts them into rules, and uses them to extract hCard microformats. Besides, we use co-occurring CSS classes to improve the overall precision. Results on train data show 0.96 precision and 0.83 F1 measure by considering only the most common tree patterns. Furthermore, we propose the adoption of additional constraint rules on the values of hCard elements to further improve the extraction.

Original languageEnglish
Pages (from-to)67-71
Number of pages5
JournalCEUR Workshop Proceedings
Volume1267
StatePublished - 2014
Externally publishedYes
Event2nd International Workshop on Linked Data for Information Extraction, LD4IE 2014, Co-located with the 13th International Semantic Web Conference, ISWC 2014 - Riva del Garda, Italy
Duration: Oct 20 2014 → …

Fingerprint

Dive into the research topics of 'μRaptor: A DOM-based system with appetite for hCard elements'. Together they form a unique fingerprint.

Cite this