The Data Journalism Handbook

The Data Journalism Handbook is intended to be a useful resource for anyone interested in becoming a data journalist, or dabbling in data journalism.

Data Journalism Handbook cover It was born at a 48 hour workshop at MozFest 2011 in London. It subsequently spilled over into an international, collaborative effort involving dozens of data journalism’s leading advocates and best practitioners – including from the Australian Broadcasting Corporation, the BBC, the Chicago Tribune, Deutsche Welle, the Guardian, the Financial Times, Helsingin Sanomat, La Nacion, the New York Times, ProPublica, the Washington Post, the Texas Tribune, Verdens Gang, Wales Online, Zeit Online and many others.

Continue reading

Advertisement

The Common Crawl Foundation

Common Crawl: 5 billion web pages indexed, ranked, graphed and the data made freely available.

Today […] we have an open repository of crawl data that covers approximately 5 billion pages and includes valuable metadata, such as page rank and link graphs. All of our data is stored on Amazon’s S3 and is accessible to anyone via EC2.

Common Crawl is now entering the next phase – spreading the word about the open system we have built and how people can use it. We are actively seeking partners who share our vision of the open web. We want to collaborate with individuals, academic groups, small start-ups, big companies, governments and nonprofits.

(via Common Crawl Enters A New Phase)

It seems like you could do pretty much anything with this data, including getting a head start making your own search engine. Blekko have a cool Grep the Web section which is full of ideas for the kinds of information you could discover if you had access to such a database. Basically, any kind of semantic analysis.

Continue reading