Linked Data for Information Extraction 
Challenge 2014 
Tasks and Results 
Robert Meusel and Heiko Paulheim
2 
Task 
Creation of an information extraction system that scrape 
structured information from HTML web sites. 
 Training dataset was created from HTML pages, which are 
annotated using Microformats hCard. 
 The data is a subset of the WebDataCommons Microformats 
Dataset. 
 The original data is provided by the Common Crawl Foundation, 
the largest public available collection of web crawls 
Linked Data for Information Extractin Challenge 2014 - Task and Results
3 
The Common Crawl Foundation (CC) 
 Non-profit foundation dedicated to building and maintaining 
an open crawl of the Web 
 9 crawl corpora from 2008 till 2014 available so far 
 Crawling Strategies: 
• Earlier crawled using BFS (with link discovery) seeded with a large list of ranked 
Seeds (PageRank), current crawls are gathered using a >6billion URL seed list 
from the blekko search index 
• By this, all crawls represent the popular part of the Web 
 Data availability 
• CC provides three different datasets for each crawl 
• All data can be freely downloaded from AWS S3 
Linked Data for Information Extractin Challenge 2014 - Task and Results
4 
The WebDataCommons Project 
Extraction of Structured Data from the Common Crawl Corpora 
 Extracts information annotated with the Markup languages 
Microformats, Microdata and RDFa 
 Till now, three different datasets gathered from crawls of 2010, 
2012, and 2013 
RDFa 
Microdata 
Microformats 
Linked Data for Information Extractin Challenge 2014 - Task and Results
5 
Extracting the Data 
 Webmaster markup their information within the HTML page 
directly using one of the three markup languages 
 Using Any23 (http://any23.apache.org/) those information are 
extracted as RDF triples 
Any23 
1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://schema.org/Product> . 
2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG 
Fuu00DFballschuh"@de . 
3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://schema.org/Offer> . 
4. _:node1 <http://schema.org/Offer/price> "u20AC 219,95"@de . 
5. _:node1 <http://schema.org/Offer/priceCurrency> "EUR"@de . 
6. … 
Linked Data for Information Extractin Challenge 2014 - Task and Results
6 
The Original Dataset of 2013 
 Over 1.7 million domains using at least one markup language 
 Over 17 billion quads with over 4 billion records (typed entities) 
 hCard the most dominant among domains 
Linked Data for Information Extractin Challenge 2014 - Task and Results
7 
Extraction of Challenge Dataset 
 Selected a subset of over 10k web pages from the corpus 
including over 450k extracted triples (annotated with MF hCard) 
• Training: 9 877 web pages / 373 501 triples 
• Test: 2 379 web pages / 85 248 triples 
Linked Data for Information Extractin Challenge 2014 - Task and Results
8 
Creation of the Gold Standard 
 Input: Annotated HTML Pages & Triples (extracted with Any23) 
 After extraction of triples, all hCard tags are replaced 
• Replacement by random generated tags 
• stable per page, but different across pages 
• Replacement of comments: as CMS systems like to comment 
<!– here is the name of the company --> 
 Output 
• Training: 
• Annotated HTML Page 
• Cleaned HTML Page 
• Triples 
• Testing: 
• Cleaned HTML Page 
• Triples (not public) 
Linked Data for Information Extractin Challenge 2014 - Task and Results
9 
Overview: Dataset Creation and Evaluation Process 
Linked Data for Information Extractin Challenge 2014 - Task and Results
10 
Evaluation 
 Methodology: We consider each triple within extracted 
statements (submission) and extracted statements (Any23 from 
original test HTML pages) as equal if they have the same 
predicate and object for one page. 
 Baseline: Each page has at least one statement declaring there 
is one VCard 
_:1 rdf:type hcard:Vcard . 
Linked Data for Information Extractin Challenge 2014 - Task and Results
11 
Challenge Results 
 We got one submission (which you will learn about in some 
minutes) 
 The submission outperforms the baseline for Recall and F-Measure 
 The Gold Standard is not perfect, as within the data, we also 
find names and other attributes without a giving a type 
(whenever webmasters did not model this) Even a perfect 
extraction system would not reach a precision of 1. 
Linked Data for Information Extractin Challenge 2014 - Task and Results
12 
Outlook: LD4IE Challenge 2015 
 Include more classes (e.g. Microdata and/or RDFa) 
 Add negative examples to generate a more realistic setting 
• as today, systems can assume there is something within the test sample 
• challenge of making sure, that in the negative examples there is no not marked 
data included 
 Improve representativity of the challenge dataset 
• Wide-spread CMS systems automatically allow marking up of articles, posts etc. 
• Eliminate such bias, if present for next challenges 
<html> 
Linked Data for Information Extractin Challenge 2014 - Task and Results 
<html> 
MF:hCard 
</html> 
<html> 
</html> 
<html> 
MF:hCard 
</html> 
</html> 
<html> 
Microdata 
</html> 
<html> 
RDFa 
</html>