Common Crawl: 5 billion web pages indexed, ranked, graphed and the data made freely available.
Today […] we have an open repository of crawl data that covers approximately 5 billion pages and includes valuable metadata, such as page rank and link graphs. All of our data is stored on Amazon’s S3 and is accessible to anyone via EC2.
Common Crawl is now entering the next phase – spreading the word about the open system we have built and how people can use it. We are actively seeking partners who share our vision of the open web. We want to collaborate with individuals, academic groups, small start-ups, big companies, governments and nonprofits.
(via Common Crawl Enters A New Phase)
It seems like you could do pretty much anything with this data, including getting a head start making your own search engine. Blekko have a cool Grep the Web section which is full of ideas for the kinds of information you could discover if you had access to such a database. Basically, any kind of semantic analysis.
Continue reading →