I managed to find a solution for the problem I had yesterday, though I don’t particularly know if it’s ideal.
Originally I had thought that I would need to store the entire hierarchy of the site in a tree like structure. I figured I could just store a list of the links on a page in a dict structure and then output all of the errors when the crawl was finished. I don’t know why I was hung up on the idea that errors had to be reported as they were come across.
I was worried that memory use would be a factor, but it seems to be ok.
But there’s another issue:
#taken from lxml.html.__init__
def make_links_absolute(self, base, root):
"""This function exists because urljoin behaves obnoxiously.
For example, if I'm on the page:
http://www.example.com/some/directory/index.html, or just:
http://www.example.com/some/directory/
And I join the relative URL: ../../abc.html
I end up with: http://www.example.com/abc.html
*But*
If I'm on: http://www.example.com/some/directory [no trailing slash]
I end up with: http://www.example.com/../abc.html
"""
My fix for it was stripping out one “../”. Yesterday I thought that it would be a good fix. Today, I can’t figure out why I thought it would fix all cases.