scraping the web

January 7th, 2009 § 3

I’ve been working on a tool to scrape urls. The idea is to have a tool that checks my sites for invalid code, urls and images.

Below some tools and ideas that I might consider using

  • Wget can be used to spider a website (list url’s, not download) with the following command: wget -r –spider -o logfile.txt http://www.myurl.com
  • ruby: open-uri & hapricot.
  • php is not a good idea since scripts can only run for a certain time. (php on commandline is ok)
  • put the results in a mysql database so I can link it with other data

check for:

  • Valid html
  • 404
  • sitespeed
  • image compression (jpegsnoop for photoshop compression, Smush it)
  • readability
  • pagerank
  • internal links (links on my domain to pages on my domain)
  • external links (links on my domain to external domains)
  • received links (who links to me and what alt text is used, page rank)
  • keywords

Try to :

curl into Google analytics and import the results in mysql

[update]should read[/update]

Related posts:

  1. Prototype: submit multiple select with ajax call

Tagged:

§ 3 Responses to “scraping the web”

  • westworld says:

    Today the new Scrubyt.org site went live. Maybe I should give this tool a try as well.
    In the example section you can find a script to scrape Google Analytics.

  • westworld says:

    Found some tools by Peter Krantz today:
    simplecrawler(ruby)
    Raakt(ruby)

  • admin says:

    PHP: Sipmple html dom class(http://simplehtmldom.sourceforge.net/) for easy parsing.

  • § Leave a Reply

What's this?

You are currently reading scraping the web at westworld: a webmasters best friend.

meta