Common Crawl: 5 billion web pages indexed, ranked, graphed and the data made freely available.
Today […] we have an open repository of crawl data that covers approximately 5 billion pages and includes valuable metadata, such as page rank and link graphs. All of our data is stored on Amazon’s S3 and is accessible to anyone via EC2.
Common Crawl is now entering the next phase – spreading the word about the open system we have built and how people can use it. We are actively seeking partners who share our vision of the open web. We want to collaborate with individuals, academic groups, small start-ups, big companies, governments and nonprofits.
It seems like you could do pretty much anything with this data, including getting a head start making your own search engine. Blekko have a cool Grep the Web section which is full of ideas for the kinds of information you could discover if you had access to such a database. Basically, any kind of semantic analysis.
As the largest and most diverse collection of information in human history, the web grants us tremendous insight if we can only understand it better. For example, web crawl data can be used to spot trends and identify patterns in politics, economics, health, popular culture and many other aspects of life. It provides an immensely rich corpus for scientific research, technological advancement, and innovative new businesses. It is crucial for our information-based society that the web be openly accessible to anyone who desires to utilize it.
“An openly accessible archive of the web – that’s not owned and controlled by Google – levels the playing field pretty significantly for research and innovation.”