I’ve been working on a tool to scrape urls. The idea is to have a tool that checks my sites for invalid code, urls and images.
Below some tools and ideas that I might consider using
- Wget can be used to spider a website (list url’s, not download) with the following command: wget -r –spider -o logfile.txt http://www.myurl.com
- ruby: open-uri & hapricot.
- php is not a good idea since scripts can only run for a certain time. (php on commandline is ok)
- put the results in a mysql database so I can link it with other data
check for:
- Valid html
- 404
- sitespeed
- image compression (jpegsnoop for photoshop compression, Smush it)
- readability
- pagerank
- internal links (links on my domain to pages on my domain)
- external links (links on my domain to external domains)
- received links (who links to me and what alt text is used, page rank)
- keywords
Try to :
curl into Google analytics and import the results in mysql
[update]should read[/update]
Related posts:
Today the new Scrubyt.org site went live. Maybe I should give this tool a try as well.
In the example section you can find a script to scrape Google Analytics.
Found some tools by Peter Krantz today:
simplecrawler(ruby)
Raakt(ruby)
PHP: Sipmple html dom class(http://simplehtmldom.sourceforge.net/) for easy parsing.