The Site Crawler Chronicles – Part 3
2010-03-19 10:33 am ∴ Programming,Thoughts ∴ by matt -

I managed to find a solution for the problem I had yesterday, though I don’t particularly know if it’s ideal.

Originally I had thought that I would need to store the entire hierarchy of the site in a tree like structure. I figured I could just store a list of the links on a page in a dict structure and then output all of the errors when the crawl was finished. I don’t know why I was hung up on the idea that errors had to be reported as they were come across.

I was worried that memory use would be a factor, but it seems to be ok.

But there’s another issue:

    #taken from lxml.html.__init__
    def make_links_absolute(self, base, root):
        """This function exists because urljoin behaves obnoxiously.
        For example, if I'm on the page:
            http://www.example.com/some/directory/index.html, or just:

http://www.example.com/some/directory/

        And I join the relative URL: ../../abc.html
        I end up with: http://www.example.com/abc.html

        *But*
        If I'm on: http://www.example.com/some/directory  [no trailing slash]
        I end up with: http://www.example.com/../abc.html
        """

My fix for it was stripping out one “../”. Yesterday I thought that it would be a good fix. Today, I can’t figure out why I thought it would fix all cases.

Comments are closed.