The Site Crawler Chronicles
2010-03-18 2:04 pm ∴ Programming,Thoughts ∴ by matt -

So I managed to stop v4 of my web crawler from opening up a billion connections in parallel. Turns out that gevent has a Pool object and that was exactly what I needed.

Now my little script (137 lines, including a utility object and comments) will not be a sysadmin’s nightmare.

However, I now have a new problem. I described how the older versions work in my previous post, but this version is quite a bit different. Instead of using a queue or stack data structure to figure out where to go next, this version has a greenlet scrape all links from a page, filters out stuff it’s already been to, then returns the rest. The main thread then accumulates the lists when all greenlets are finished. After the accumulation — and it’s ensured that there are no duplicate links — the main thread then spawns a greenlet for each link and the main thread waits until the greenlets finish again. When there are no links returned by the greenlets, the main thread is done, and the script terminates.

The problem is, if there’s a 404 or some kind of error retrieving the page, I have no way of knowing what page that link was found on.

The only solution that I see is using a custom data structure and hope that it doesn’t kill performance.

New Old Ideas
2010-03-17 5:05 pm ∴ Programming,Thoughts ∴ by matt -

A little while ago, I wrote a post in which I was trying to figure out a way to improve a Web Crawler script I had written — it was one of those that I never published.

Anyway, for some reason, I wrote it using Stackless Python, but I was pretty novice and didn’t make it as efficient as I could. This was version 1, and it basically went to each page, scraped all the valid links (ie, those that were on the same server and not a mailto: or something) and then went through recursively.

Version 2 was basically the same, just with cleaner code and no recursion. I decided to set up the library so I would extend the SiteCrawler class and get notified of what was going on through callbacks. While it wasn’t any faster than version 1, it did seem a bit more stable.

Version 3 I decided to change drastically and made it multithreaded. It is much, much, much faster. It works like this, there’s an input Queue, and an already checked Queue. There are 4 threads waiting for input on the input Queue, when they get it, they scrape the links, check to see if any of them are in the checked Queue, put what’s been filtered on the input Queue, and put what it just checked on the checked Queue. It seems more complicated than it is. Also the code is more complicated than it needs to be.

Anyway, version 3 works well for me when I need to test a site. It’s saved me so much time in going through and checking for broken URLs. There are a few clients with the number of pages on their site in the 300′s.

But, I recently found out about gevent, and since I have some free time at work, I wanted to play with it a little bit. If you don’t know, gevent is a package that works with the greenlet package on top of libevent. I’m always interested in concurrent programming, and new technologies involved in it. This is why I had installed Stackless at one time.

So now there’s a version 4 of the SiteCrawler script, using — you guessed it — gevent. I haven’t ironed out all of the kinks yet. I was testing a non-pooled version of the script and it basically crawled through 200 links in a matter of seconds — hopefully none of the server admins look at the logs and see 100 simultaneous connections at 3:30 PM today. I did change how the crawl is done quite a bit too. So I’m going to stop there and probably have more tomorrow or in the next few days.

Good stuff!

Almost forgot
2010-03-12 2:05 pm ∴ Uncategorized ∴ by matt -

I need to start updating this site a bit more. If you’ve ever visited this site around December then you might have noticed that my site’s anniversary is Dec. 4th. Also, you might have noticed that I almost always forget to mention it.

So, this is 3 months late, but my site turned 12 on Dec 4th 2009. If we count this site and Sloatworks combined (and I do) and also ignore the 3 months of downtime I had in 2007, then I’ve had a web presence longer than everyone’s favorite giant corporation, Google.

They are only slightly more successful. :P

I can’t figure out what to title this
2010-01-19 12:29 pm ∴ Programming,Thoughts ∴ by matt -

This is my first update in quite a while. I’m not dead. I had a few posts in draft mode, but I never finished them. The topics were:

  • A web crawler script I wrote and some design issues I was having with it.
    - I actually figured out some of the problems with it and made it much more efficient. So there really wasn’t any point in finishing the post. It was more of a thinking out loud kind of deal.
  • Why they should let MySQL die and why Monty Widenius should bite me.
    - This was going to be my response to Monty’s Plea to people to save MySQL from Oracle (I don’t feel like finding the link, just google it). The gist was that he probably shouldn’t have sold out to Sun, would be better concentrating on MariaDB — a supposed MySQL successor, and that generally MySQL sucks and there are better alternatives anyway. I also held him responsible for the countless headaches I endure while being forced to use MySQL, and thus decided he could bite me. :)
  • How awful the jQuery plugin site is and some thoughts on how to replace it.
    - I started writing this post because I have such a hard time keeping my plugins up-to-date due to the fact that the plugin directory is terrible. I started writing the post and just went through the site listing the problems I saw. The list became so massive that it clearly was something that I couldn’t handle on my own, so the post turned in to a call to action with the suggestion of using Google App Engine to run the site. But, the number of problems is depressing and thinking about spearheading a project of such magnitude in my free time was not at all appealing.

So, the holidays came and went. The new year as well. I’ve been extraordinarily busy at work and at home, extraordinarily lazy and tired.

But I do have one small announcement. Off and on over the past year, I’ve been toying with ActionScript and Flash and Flex because a long, long time ago, I thought  I would pursue Flash as a game platform.

After some consideration, I will absolutely *NOT* be doing that.  While ActionScript, as a language, isn’t terrible, I don’t particularly like it. It’s similar to Javascript, but with more crap that you have to do. The real problem I have is with Flash as a platform. In short, it’s probably one of the worst I’ve ever seen. It’s slow, bloated, and full of known security holes. I could make a whole blog dedicated to how much I hate Flash and problems with it.

The only positive thing I can say about Flash is that it’s everywhere. Though, with all of the security problems, it may not turn out to be a good thing.

Instead — I’m trying to stay positive — I plan on focusing on some of the emerging Web App technologies as a platform. I have a feeling in 6 months, I’ll be back here griping about how ridiculous canvas, WebGl, etc. are.

I may try to write a simple game to test, or redo a certain game with flying corpses. So, until then.

Am I nuts?
2009-10-26 4:21 pm ∴ Programming,Rant,Thoughts ∴ Tags: , , ∴ by matt -

I’m starting to think I’m a masochist. Examine the evidence:

  • I’ve been in 2 serious relationships with crazy (and I do mean crazy) girls — both ended crappily.
  • I still hand wind my guitar strings.
  • I’ve worked with PHP since version 3 and I’ve had a job doing it for 5 years. I actually gave up a job doing embedded systems with C to keep doing PHP work.
  • I keep thinking I want to learn Erlang.
  • Lastly, and I think this is the big one, despite the fact that I’ve worked on no less than 3 failed IRC clients — I started working on an IRC bot project.

(more…)

More fun with PHP ORM libraries
2009-10-14 11:09 am ∴ Programming,Thoughts ∴ by matt -

A few months ago, I was contacted by the developer of Red Bean — a new PHP ORM library. He seems to share my dismay with the overall suckiness of every ORM library and he asked if I would give his project a looksee.

Before that, I have a few thoughts on Kohana and it’s ORM library.

(more…)

Kind of annoyed with WordPress
2009-09-10 7:50 pm ∴ News,Rant ∴ by matt -

So, Automatic upgrade for WP has been broken for some time. Now there’s a stupid coding error that’s responsible for a worm attack. This is what I get for following the trend and not researching some software that I start using.

A possible switch to a different blog software may be in the works if they don’t get their act together.

Things that I would do if I was in charge of the Python StdLib
2009-08-24 9:36 am ∴ Thoughts ∴ by matt -

I realize that 99% of what I think would break 99% of all applications, but my dictatorship style is more of a slash and burn kind of style. Out with the old, in with the new.

  • Replace the crappy dom, minidom and pulldom libs with lxml.
    XML support in python has always been kind of crummy, but lxml really changes that. It’s fast, robust, and way better than other xml libs that have been around.
  • Remove Tkinter.
    It’s just taking up space.
  • Remove the stuff for reading and writing audio files.
    I don’t really understand why Python and other stdlibs (I’m looking at you Windows) include stuff for outputting and/or manipulating WAV files. Yet there’s no other format support. This is something that’s best handled by a 3rd party, IMO.
  • Replace telnetlib with an ssh library.
    I hope no one uses telnet anymore. I can’t remember the last time I had to telnet in to something
  • Add an encryption library to the “Cryptographic Services”
    It’s just hashlib and hmac right now.
  • I would also love for FTP to die, so replacing ftplib with an sftp lib might help facilitate that.

:)

So easy, a caveman can use it.
2009-08-19 10:17 pm ∴ Programming ∴ Tags: , , ∴ by matt -

Someone pointed out to me today that if you look at the source of the front page on www.geico.com, you might see that they use a certain timer script for rotating banners…

In fact, it’s this one: http://www.geico.com/public/scripts/jquery/jquery.timer.js

Yeah! That’s me! And all this time I thought the script was pretty useless, but I guess a certain Gecko found a good place for it.

Honestly, I may have used that script once or twice, but that’s it. I don’t use timers a lot. But, if there’s enough interest, I’m open to adding bug fixes and features and such. I’ll also accept patches, etc.

Thanks to Alex for pointing that out. Pretty neet! ;)

PHP date stuff
2009-08-04 12:38 am ∴ Programming ∴ by matt -

I’ve talked before about how I thought PHP’s date handling is the best thing about PHP.

Well, I kinda lied. It sucks…

It works fine if you’re dealing with dates in the here and now, but say you’re accepting user input for a date-of-birth field. strtotime() returns a signed integer, and if you look carefully, you’ll see that strtotime() will fail on early PHP5 revisions on Windows for dates before 1/1/1970.

So that’s problem 1. Problem 2 is going past 1/19/2038.  You would think that it wouldn’t be a problem on 64-bit systems, but I have no idea if that’s the case or not — I can’t seem to find any documentation on 64-bit versions of PHP using 64-bit timestamps.

Luckily, there’s a new DateTime class for PHP 5.2 and up. The constructor of the DateTime class accepts anything that strtotime() can, and DateTime::format() accepts a date() style format. So that’s easy enough. Hopefully, you deduce by my lack of griping, that Pre-epoch and post-2038 dates will work. I didn’t test to see how far back or forward, it can go. Far enough for my uses.

Why the new class isn’t listed on the date() and strtotime() pages — I have no idea. Hopefully now I won’t have another Epoch Fail.

[p → -∞][p → ∞]