Site Crawler Chronicles, Part 4: I might be dumb
2010-03-23 3:08 pm ∴ Uncategorized ∴ by matt -

Turns out urljoin() wasn’t behaving badly, I just supplied it a lousy URL. Turns out after running urlopen, the file-like object that is returned has two additional methods, one of them giving the true URL (i.e. after redirects). So far, that’s seems to have fixed the issue.

Hopefully I’ll have a release soon, but I still gotta work out some bugs.

The Site Crawler Chronicles – Part 3
2010-03-19 10:33 am ∴ Programming,Thoughts ∴ by matt -

I managed to find a solution for the problem I had yesterday, though I don’t particularly know if it’s ideal.

Originally I had thought that I would need to store the entire hierarchy of the site in a tree like structure. I figured I could just store a list of the links on a page in a dict structure and then output all of the errors when the crawl was finished. I don’t know why I was hung up on the idea that errors had to be reported as they were come across.

I was worried that memory use would be a factor, but it seems to be ok.

But there’s another issue:

    #taken from lxml.html.__init__
    def make_links_absolute(self, base, root):
        """This function exists because urljoin behaves obnoxiously.
        For example, if I'm on the page:
            http://www.example.com/some/directory/index.html, or just:

http://www.example.com/some/directory/

        And I join the relative URL: ../../abc.html
        I end up with: http://www.example.com/abc.html

        *But*
        If I'm on: http://www.example.com/some/directory  [no trailing slash]
        I end up with: http://www.example.com/../abc.html
        """

My fix for it was stripping out one “../”. Yesterday I thought that it would be a good fix. Today, I can’t figure out why I thought it would fix all cases.

The Site Crawler Chronicles
2010-03-18 2:04 pm ∴ Programming,Thoughts ∴ by matt -

So I managed to stop v4 of my web crawler from opening up a billion connections in parallel. Turns out that gevent has a Pool object and that was exactly what I needed.

Now my little script (137 lines, including a utility object and comments) will not be a sysadmin’s nightmare.

However, I now have a new problem. I described how the older versions work in my previous post, but this version is quite a bit different. Instead of using a queue or stack data structure to figure out where to go next, this version has a greenlet scrape all links from a page, filters out stuff it’s already been to, then returns the rest. The main thread then accumulates the lists when all greenlets are finished. After the accumulation — and it’s ensured that there are no duplicate links — the main thread then spawns a greenlet for each link and the main thread waits until the greenlets finish again. When there are no links returned by the greenlets, the main thread is done, and the script terminates.

The problem is, if there’s a 404 or some kind of error retrieving the page, I have no way of knowing what page that link was found on.

The only solution that I see is using a custom data structure and hope that it doesn’t kill performance.

New Old Ideas
2010-03-17 5:05 pm ∴ Programming,Thoughts ∴ by matt -

A little while ago, I wrote a post in which I was trying to figure out a way to improve a Web Crawler script I had written — it was one of those that I never published.

Anyway, for some reason, I wrote it using Stackless Python, but I was pretty novice and didn’t make it as efficient as I could. This was version 1, and it basically went to each page, scraped all the valid links (ie, those that were on the same server and not a mailto: or something) and then went through recursively.

Version 2 was basically the same, just with cleaner code and no recursion. I decided to set up the library so I would extend the SiteCrawler class and get notified of what was going on through callbacks. While it wasn’t any faster than version 1, it did seem a bit more stable.

Version 3 I decided to change drastically and made it multithreaded. It is much, much, much faster. It works like this, there’s an input Queue, and an already checked Queue. There are 4 threads waiting for input on the input Queue, when they get it, they scrape the links, check to see if any of them are in the checked Queue, put what’s been filtered on the input Queue, and put what it just checked on the checked Queue. It seems more complicated than it is. Also the code is more complicated than it needs to be.

Anyway, version 3 works well for me when I need to test a site. It’s saved me so much time in going through and checking for broken URLs. There are a few clients with the number of pages on their site in the 300′s.

But, I recently found out about gevent, and since I have some free time at work, I wanted to play with it a little bit. If you don’t know, gevent is a package that works with the greenlet package on top of libevent. I’m always interested in concurrent programming, and new technologies involved in it. This is why I had installed Stackless at one time.

So now there’s a version 4 of the SiteCrawler script, using — you guessed it — gevent. I haven’t ironed out all of the kinks yet. I was testing a non-pooled version of the script and it basically crawled through 200 links in a matter of seconds — hopefully none of the server admins look at the logs and see 100 simultaneous connections at 3:30 PM today. I did change how the crawl is done quite a bit too. So I’m going to stop there and probably have more tomorrow or in the next few days.

Good stuff!

Almost forgot
2010-03-12 2:05 pm ∴ Uncategorized ∴ by matt -

I need to start updating this site a bit more. If you’ve ever visited this site around December then you might have noticed that my site’s anniversary is Dec. 4th. Also, you might have noticed that I almost always forget to mention it.

So, this is 3 months late, but my site turned 12 on Dec 4th 2009. If we count this site and Sloatworks combined (and I do) and also ignore the 3 months of downtime I had in 2007, then I’ve had a web presence longer than everyone’s favorite giant corporation, Google.

They are only slightly more successful. :P