Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

Linked Data for Information Extraction
Challenge 2014
Tasks and Results
Robert Meusel and Heiko Paulheim

2
Task
Creation of an information extraction system that scrape
structured information from HTML web sites.
 Training dataset was created from HTML pages, which are
annotated using Microformats hCard.
 The data is a subset of the WebDataCommons Microformats
Dataset.
 The original data is provided by the Common Crawl Foundation,
the largest public available collection of web crawls
Linked Data for Information Extractin Challenge 2014 - Task and Results

3
The Common Crawl Foundation (CC)
 Non-profit foundation dedicated to building and maintaining
an open crawl of the Web
 9 crawl corpora from 2008 till 2014 available so far
 Crawling Strategies:
• Earlier crawled using BFS (with link discovery) seeded with a large list of ranked
Seeds (PageRank), current crawls are gathered using a >6billion URL seed list
from the blekko search index
• By this, all crawls represent the popular part of the Web
 Data availability
• CC provides three different datasets for each crawl
• All data can be freely downloaded from AWS S3

4
The WebDataCommons Project
Extraction of Structured Data from the Common Crawl Corpora
 Extracts information annotated with the Markup languages
Microformats, Microdata and RDFa
 Till now, three different datasets gathered from crawls of 2010,
2012, and 2013
RDFa
Microdata
Microformats

5
Extracting the Data
 Webmaster markup their information within the HTML page
directly using one of the three markup languages
 Using Any23 (http://any23.apache.org/) those information are
extracted as RDF triples
Any23
1. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/Product> .
2. _:node1 <http://schema.org/Product/name> "Predator Instinct FG
Fuu00DFballschuh"@de .
3. _:node1 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/Offer> .
4. _:node1 <http://schema.org/Offer/price> "u20AC 219,95"@de .
5. _:node1 <http://schema.org/Offer/priceCurrency> "EUR"@de .
6. …

6
The Original Dataset of 2013
 Over 1.7 million domains using at least one markup language
 Over 17 billion quads with over 4 billion records (typed entities)
 hCard the most dominant among domains

7
Extraction of Challenge Dataset
 Selected a subset of over 10k web pages from the corpus
including over 450k extracted triples (annotated with MF hCard)
• Training: 9 877 web pages / 373 501 triples
• Test: 2 379 web pages / 85 248 triples

8
Creation of the Gold Standard
 Input: Annotated HTML Pages & Triples (extracted with Any23)
 After extraction of triples, all hCard tags are replaced
• Replacement by random generated tags
• stable per page, but different across pages
• Replacement of comments: as CMS systems like to comment
<!– here is the name of the company -->
 Output
• Training:
• Annotated HTML Page
• Cleaned HTML Page
• Triples
• Testing:
• Cleaned HTML Page
• Triples (not public)

9
Overview: Dataset Creation and Evaluation Process

10
Evaluation
 Methodology: We consider each triple within extracted
statements (submission) and extracted statements (Any23 from
original test HTML pages) as equal if they have the same
predicate and object for one page.
 Baseline: Each page has at least one statement declaring there
is one VCard
_:1 rdf:type hcard:Vcard .

11
Challenge Results
 We got one submission (which you will learn about in some
minutes)
 The submission outperforms the baseline for Recall and F-Measure
 The Gold Standard is not perfect, as within the data, we also
find names and other attributes without a giving a type
(whenever webmasters did not model this) Even a perfect
extraction system would not reach a precision of 1.

12
Outlook: LD4IE Challenge 2015
 Include more classes (e.g. Microdata and/or RDFa)
 Add negative examples to generate a more realistic setting
• as today, systems can assume there is something within the test sample
• challenge of making sure, that in the negative examples there is no not marked
data included
 Improve representativity of the challenge dataset
• Wide-spread CMS systems automatically allow marking up of articles, posts etc.
• Eliminate such bias, if present for next challenges
<html>
<html>
MF:hCard
</html>
<html>
</html>
<html>
MF:hCard
</html>
</html>
<html>
Microdata
</html>
<html>
RDFa
</html>

Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014

More Related Content

What's hot

Similar to Linked Data for Information Extraction Challenge - Tasks and Results @ ISWC 2014