scraping the web

January 7th, 2009 § 3

I’ve been working on a tool to scrape urls. The idea is to have a tool that checks my sites for invalid code, urls and images.

Below some tools and ideas that I might consider using

  • Wget can be used to spider a website (list url’s, not download) with the following command: wget -r –spider -o logfile.txt http://www.myurl.com
  • ruby: open-uri & hapricot.
  • php is not a good idea since scripts can only run for a certain time. (php on commandline is ok)
  • put the results in a mysql database so I can link it with other data

check for:

  • Valid html
  • 404
  • sitespeed
  • image compression (jpegsnoop for photoshop compression, Smush it)
  • readability
  • pagerank
  • internal links (links on my domain to pages on my domain)
  • external links (links on my domain to external domains)
  • received links (who links to me and what alt text is used, page rank)
  • keywords

Try to :

curl into Google analytics and import the results in mysql

[update]should read[/update]

Where Am I?

You are currently browsing entries tagged with scrape validate at westworld: a webmasters best friend.