So I managed to stop v4 of my web crawler from opening up a billion connections in parallel. Turns out that gevent has a Pool object and that was exactly what I needed.
Now my little script (137 lines, including a utility object and comments) will not be a sysadmin’s nightmare.
However, I now have a new problem. I described how the older versions work in my previous post, but this version is quite a bit different. Instead of using a queue or stack data structure to figure out where to go next, this version has a greenlet scrape all links from a page, filters out stuff it’s already been to, then returns the rest. The main thread then accumulates the lists when all greenlets are finished. After the accumulation — and it’s ensured that there are no duplicate links — the main thread then spawns a greenlet for each link and the main thread waits until the greenlets finish again. When there are no links returned by the greenlets, the main thread is done, and the script terminates.
The problem is, if there’s a 404 or some kind of error retrieving the page, I have no way of knowing what page that link was found on.
The only solution that I see is using a custom data structure and hope that it doesn’t kill performance.