I’ve been working on a tool to scrape urls. The idea is to have a tool that checks my sites for invalid code, urls and images.
Below some tools and ideas that I might consider using
- Wget can be used to spider a website (list url’s, not download) with the following command: wget -r –spider -o logfile.txt http://www.myurl.com
- ruby: open-uri & hapricot.
- php is not a good idea since scripts can only run for a certain time. (php on commandline is ok)
- put the results in a mysql database so I can link it with other data
check for:
- Valid html
- 404
- sitespeed
- image compression (jpegsnoop for photoshop compression, Smush it)
- readability
- pagerank
- internal links (links on my domain to pages on my domain)
- external links (links on my domain to external domains)
- received links (who links to me and what alt text is used, page rank)
- keywords
Try to :
curl into Google analytics and import the results in mysql
[update]should read[/update]