Abstract
This paper describes μRaptor, a DOM-based method to extract hCard microformats from HTML pages stripped of microformat markup. μRaptor extracts DOM sub-trees, converts them into rules, and uses them to extract hCard microformats. Besides, we use co-occurring CSS classes to improve the overall precision. Results on train data show 0.96 precision and 0.83 F1 measure by considering only the most common tree patterns. Furthermore, we propose the adoption of additional constraint rules on the values of hCard elements to further improve the extraction.
Original language | English |
---|---|
Pages (from-to) | 67-71 |
Number of pages | 5 |
Journal | CEUR Workshop Proceedings |
Volume | 1267 |
State | Published - 2014 |
Externally published | Yes |
Event | 2nd International Workshop on Linked Data for Information Extraction, LD4IE 2014, Co-located with the 13th International Semantic Web Conference, ISWC 2014 - Riva del Garda, Italy Duration: Oct 20 2014 → … |