<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>westworld: a webmasters best friend &#187; scrape validate</title>
	<atom:link href="http://www.westworld.be/tag/scrape-validate/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.westworld.be</link>
	<description>A webmasters best friend</description>
	<lastBuildDate>Mon, 14 Jun 2010 18:12:24 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>scraping the web</title>
		<link>http://www.westworld.be/a-note-to-self/wget-as-a-spider/</link>
		<comments>http://www.westworld.be/a-note-to-self/wget-as-a-spider/#comments</comments>
		<pubDate>Wed, 07 Jan 2009 08:22:19 +0000</pubDate>
		<dc:creator>westworld</dc:creator>
				<category><![CDATA[a note to self]]></category>
		<category><![CDATA[scrape validate]]></category>

		<guid isPermaLink="false">http://www.westworld.be/?p=65</guid>
		<description><![CDATA[I&#8217;ve been working on a tool to scrape urls. The idea is to have a tool that checks my sites for invalid code, urls and images.
Below some tools and ideas that I might consider using

Wget can be used to spider a website (list url&#8217;s, not download) with the following command: wget -r &#8211;spider -o logfile.txt [...]


Related posts:<ol><li><a href='http://www.westworld.be/a-note-to-self/prototype-submit-multiple-select-with-ajax-call/' rel='bookmark' title='Permanent Link: Prototype: submit multiple select with ajax call'>Prototype: submit multiple select with ajax call</a></li></ol>]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been working on a tool to scrape urls. The idea is to have a tool that checks my sites for invalid code, urls and images.</p>
<p>Below some tools and ideas that I might consider using</p>
<ul>
<li>Wget can be used to spider a website (list url&#8217;s, not download) with the following command: wget -r &#8211;spider -o logfile.txt http://www.myurl.com</li>
<li>ruby: open-uri &amp; hapricot.</li>
<li>php is not a good idea since scripts can only run for a certain time. (php on commandline is ok)</li>
<li>put the results in a mysql database so I can link it with other data</li>
</ul>
<p>check for:</p>
<ul>
<li>Valid html</li>
<li>404</li>
<li>sitespeed</li>
<li>image compression (jpegsnoop for photoshop compression, Smush it)</li>
<li>readability</li>
<li>pagerank</li>
<li>internal links (links on my domain to pages on my domain)</li>
<li>external links (links on my domain to external domains)</li>
<li>received links (who links to me and what alt text is used, page rank)</li>
<li>keywords</li>
</ul>
<p>Try to :</p>
<p>curl into Google analytics and import the results in mysql</p>
<p>[update]should <a href="http://www.merchantos.com/makebeta/php/scraping-links-with-php/">read</a>[/update]</p>


<p>Related posts:<ol><li><a href='http://www.westworld.be/a-note-to-self/prototype-submit-multiple-select-with-ajax-call/' rel='bookmark' title='Permanent Link: Prototype: submit multiple select with ajax call'>Prototype: submit multiple select with ajax call</a></li></ol></p>]]></content:encoded>
			<wfw:commentRss>http://www.westworld.be/a-note-to-self/wget-as-a-spider/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
