scraping the web

I’ve been working on a tool to scrape urls. The idea is to have a tool that checks my sites for invalid code, urls and images.

Below some tools and ideas that I might consider using

  • Wget can be used to spider a website (list url’s, not download) with the following command: wget -r –spider -o logfile.txt http://www.myurl.com
  • ruby: open-uri & hapricot.
  • php is not a good idea since scripts can only run for a certain time. (php on commandline is ok)
  • put the results in a mysql database so I can link it with other data

check for:

  • Valid html
  • 404
  • sitespeed
  • image compression (jpegsnoop for photoshop compression, Smush it)
  • readability
  • pagerank
  • internal links (links on my domain to pages on my domain)
  • external links (links on my domain to external domains)
  • received links (who links to me and what alt text is used, page rank)
  • keywords

Try to :

curl into Google analytics and import the results in mysql

[update]should read[/update]

  1. Today the new Scrubyt.org site went live. Maybe I should give this tool a try as well.
    In the example section you can find a script to scrape Google Analytics.

  2. Found some tools by Peter Krantz today:
    simplecrawler(ruby)
    Raakt(ruby)

  3. PHP: Sipmple html dom class(http://simplehtmldom.sourceforge.net/) for easy parsing.

Leave a Comment