London HUG
               Common Crawl :
               WhatRepository
              An Open
                      Does
             Theof Web Data
                  Data World
             Mean to Society?
                     Lisa Green
                   Lisa Green
                 1 October 2012
                 10 October 2012
Photo license: Public Domain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
Photo license: CC-BY-SA Origin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
Image license: CC-BY Origin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
Still Nascent
                                                                    •      Even cheaper storage
                                                                    •      Even cheaper compute
                                                                    •      Education
                                                                    •      Open Data

Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
Gratis




Proprietary                Libre




              Commercial
Progress


Insight


Analysis


 Data
Gil Elbaz
Common Crawl Data
• ~8 Billion web pages
• ~120 TB
• 2008-2012
• ARC files, JSON metadata, text files
• Available to anyone
ARC Files - Raw Content
Metadata
•   Status information
•   HTTP response code
•   File names & offsets of ARC files
•   HTML title
•   HTML meta tags
•   RSS/Atom information
•   All anchors/hyperlinks

Text Files - Text Only

           http://commoncrawl.org/get-started
Change between 2010 and 2012
• URLs with embedded data +6%
• Microdata +14%
• RDFa +26%

      http://webdatacommons.org
• 22% of Web pages contain Facebook URLs
• 8% of Web pages implement Open Graph tags
http://wikientities.appspot.com

A corpus of anchortext-WikipediaConcept-Count
   from the CommonCrawl dataset, to benefit
         research on WSD, NLP and IR.

Given a sentence, it can
Explicit Topic Modeling: help identify entities
(person, location, organization) in wikipedia
Given a concept (represented as a the sentence
and map them onto Wikipedia concepts.
page), it can tell what are the most common
terms people use to describe the concept.
Mapping French websites related to Open Data
Other Use Examples
•   Apache Giraph Testing
•   Maplight
•   Tineye
•   Factual
•   Sentiment Analysis Projects
In Development
•   N-gram and Link Graph Extracts
•   Pig Reader
•   More Frequent Full Crawls
•   Focused Subset Crawls at High Frequency
•   Open Educational Resources
Thank You
London HUG

               What Does
             The Data World
                       Lisa Green

             Mean to Society?
                  lisa@commoncrawl.org
                www.commoncrawl.org
                     @commoncrawl
                      Lisa Green
                       @boudicca
                   1 October 2012

Common Crawl: An Open Repository of Web Data

  • 1.
    London HUG Common Crawl : WhatRepository An Open Does Theof Web Data Data World Mean to Society? Lisa Green Lisa Green 1 October 2012 10 October 2012
  • 2.
    Photo license: PublicDomain Origin: http://en.wikipedia.org/wiki/File:Floppy_disk_2009_G1.jpg
  • 3.
    Photo license: CC-BY-SAOrigin: http://en.wikipedia.org/wiki/File:Wikimedia_Foundation_Servers-8055_08.jpg
  • 4.
    Image license: CC-BYOrigin: http://en.wikipedia.org/wiki/File:Internet_map_1024.jpg
  • 5.
    Still Nascent • Even cheaper storage • Even cheaper compute • Education • Open Data Image license: CC-BY Credit: NASA, ESA, and the Hubble Heritage Team (STScI/AURA)
  • 6.
    Gratis Proprietary Libre Commercial
  • 7.
  • 8.
  • 10.
    Common Crawl Data •~8 Billion web pages • ~120 TB • 2008-2012 • ARC files, JSON metadata, text files • Available to anyone
  • 11.
    ARC Files -Raw Content Metadata • Status information • HTTP response code • File names & offsets of ARC files • HTML title • HTML meta tags • RSS/Atom information • All anchors/hyperlinks Text Files - Text Only http://commoncrawl.org/get-started
  • 13.
    Change between 2010and 2012 • URLs with embedded data +6% • Microdata +14% • RDFa +26% http://webdatacommons.org
  • 14.
    • 22% ofWeb pages contain Facebook URLs • 8% of Web pages implement Open Graph tags
  • 15.
    http://wikientities.appspot.com A corpus ofanchortext-WikipediaConcept-Count from the CommonCrawl dataset, to benefit research on WSD, NLP and IR. Given a sentence, it can Explicit Topic Modeling: help identify entities (person, location, organization) in wikipedia Given a concept (represented as a the sentence and map them onto Wikipedia concepts. page), it can tell what are the most common terms people use to describe the concept.
  • 16.
    Mapping French websitesrelated to Open Data
  • 17.
    Other Use Examples • Apache Giraph Testing • Maplight • Tineye • Factual • Sentiment Analysis Projects
  • 18.
    In Development • N-gram and Link Graph Extracts • Pig Reader • More Frequent Full Crawls • Focused Subset Crawls at High Frequency • Open Educational Resources
  • 19.
    Thank You London HUG What Does The Data World Lisa Green Mean to Society? [email protected] www.commoncrawl.org @commoncrawl Lisa Green @boudicca 1 October 2012