Wikidata
Large free knowledge base and repository for structured data for the Wikipedia.
Quandl
Search engine for financial, economic and social datasets.
ScraperWiki
Community for the creation and reuse of data scrapers with support for Ruby, Python and PHP and free scraper and data hosting.
data-science-blogs - github
A curated list of data science blogs.
Greasemonkey
Firefox Add-on to execute own Javascript files in the browser, which can be used to realize screen scrapers in Javascript.
Safecast
Global sensor network for collecting and sharing radiation measurements.
Pachube
Service to create public or private API feeds from real-time data from sensors devices or environments, free and paid plans.
Scrapy (Python)
Scraping and web crawling framework for Python.
Django Dynamic Scraper (Python/D...
Django app build on top of Scrapy to manage scrapers via the Django admin interface.
Nokogiri (Ruby)
Ruby library for scraping web pages.
Mechanize (Perl/Python/Ruby)
Library with versions in Perl, Python and Ruby for the automating interaction with websites and for website scraping.
Stackoverflow - Overview Scrapin...
Overview on coding questions site Stackoverflow with different HTML scraping solutions for various programming languages.
google-refine
Powerful tool from Google for cleaning up and transforming messy data.
EXMERG
Tool for merging data from different spreadsheet or CSV files and creating dynamic charts.
Google docs - spreadsheets
Google tool to create and share structured data (spreadsheets) online, can be used in combination with other Google tools like Google Chart Tools or google-refine.
CKAN - the Data Hub Software
Open source software for publishing, sharing and finding data, used as a basis for many data catalogues.
Socrata
Open Data company providing a platform for institutions to publish and manage public data, used for many data catalogues.
Google Fusion Tables
Data management tool from google to import, merge, publish, visualize and share data in the cloud.
Google Correlate
Tool from Google to find searches that correlate with real-world data (e.g. correlation between searches like "how to treat the flu" and actual flu activity).
Apache Solr
Open source search platform with REST interface to realize fulltext search capability over large amounts of data.
elasticsearch
Distributed RESTful open source search engine for building search applications or add search to websites.
80Legs - Custom Web Crawlers
Service to create custom web crawlers to extract data from websites, free and paid plans.
Apache Mahout
Apache software project with different scalable machine learning libraries in Java to get meaning out of data.