Say you need a dataset with the names and Insee numbers for all the French Départements. The (hopefully trustworthy) Wikipedia page has a neat table, complete with information on the Prefecture and many tiny coats of arms which are of absolutely no use at all. We could either key in the relevant data (doable, but a nuisance), or we could try to copy and paste the table into a word processor, hoping that we do not lose accents and other funny characters, and that WinWord or whatever we use converts the HTML table into something that we can edit to extract the information we really need.
Or you we could use outwit. One push of the button loads the page
into a sub-window, a second push (data->tables) extracts the HTML tables on the page. Now, we can either mark the lines we are interested in by hand (often the quickest option) or use a filter to selfect them. One final click, and they are exported as a CSV file that can be read into R, OpenOffice, or Stata for post processing and analysis.
While I’m all in favour of scriptable and open-source tools like Perl, Python and R, outwit has a lot to go for it if all you need is a quick hack. Outwit also has functions to mass-download files (say PDFs) from a page and give the unique names. If the job is complex, there is even more functionality under the hood, and you can use the point-and-click interface to program you own scraper, though I would tend use a real programming language for these cases. At any rate, outwit is a useful and free tool for the lazy data analyst.