Common Crawl: An Open Repository of Web Data

London HUG
Common Crawl :
WhatRepository
An Open
Does
Theof Web Data
Data World
Mean to Society?
Lisa Green
Lisa Green
1 October 2012
10 October 2012

Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg

Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg

Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg

Still Nascent
• Even cheaper storage
• Even cheaper compute
• Education
• Open Data

Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)

Gratis

Proprietary Libre

Commercial

Progress

Insight

Analysis

Data

Common Crawl Data
• ~8 Billion web pages
• ~120 TB
• 2008-2012
• ARC files, JSON metadata, text files
• Available to anyone

ARC Files - Raw Content
Metadata
• Status information
• HTTP response code
• File names & offsets of ARC files
• HTML title
• HTML meta tags
• RSS/Atom information
• All anchors/hyperlinks

Text Files - Text Only

http://commoncrawl.org/get-started

Change between 2010 and 2012
• URLs with embedded data +6%
• Microdata +14%
• RDFa +26%

http://webdatacommons.org

• 22% of Web pages contain Facebook URLs
• 8% of Web pages implement Open Graph tags

http://wikientities.appspot.com

A corpus of anchortext-WikipediaConcept-Count
from the CommonCrawl dataset, to benefit
research on WSD, NLP and IR.

Given a sentence, it can
Explicit Topic Modeling: help identify entities
(person, location, organization) in wikipedia
Given a concept (represented as a the sentence
and map them onto Wikipedia concepts.
page), it can tell what are the most common
terms people use to describe the concept.

Mapping French websites related to Open Data

Other Use Examples
• Apache Giraph Testing
• Maplight
• Tineye
• Factual
• Sentiment Analysis Projects

In Development
• N-gram and Link Graph Extracts
• Pig Reader
• More Frequent Full Crawls
• Focused Subset Crawls at High Frequency
• Open Educational Resources

Thank You
London HUG

What Does
The Data World
Lisa Green

Mean to Society?
lisa@commoncrawl.org
www.commoncrawl.org
@commoncrawl
Lisa Green
@boudicca
1 October 2012

Common Crawl: An Open Repository of Web Data

More Related Content

What's hot

Viewers also liked

Similar to Common Crawl: An Open Repository of Web Data

More from huguk

Recently uploaded

In this document

Common Crawl: An Open Repository of Web Data