jQuery Delayed Event
2011-07-12 1:45 pm ∴ News,Programming,Releases ∴ Tags: , , , , ∴ by matt -

Earlier today, I went looking for a plugin similar to hoverIntent but worked for other events. Turns out there wasn’t one (or at least I couldn’t find one), so I wrote one. Here it is:

Delayed Event Plugin page

Zipped source and example

Enjoy. I hope it’s useful for other people.

Oh, I should mention… As with the Timer plugin, I have no interest in maintenance or support. So, unless there’s a major security problem or giant leaking of memory, I don’t really care. I didn’t put a lot of thought in to designing an API or anything. You should feel free to modify the code as you see fit.

Two updates less than a year apart — what?
2011-03-03 2:15 pm ∴ Programming,Python ∴ Tags: , , ∴ by matt -

So I’ve been toying with NLTK and generating text. I’ve written a plugin for the crappy-irc-bot project which uses bash.org as a source. I wrote something similar, which uses one of the built-in NLTK sources to generate dummy text in house. I got sick of using “Lorem ipsum…”

And yesterday, I adapted it to use some of the wild stuff that Charlie Sheen has been saying and feed it in to a Twitter account, charliesheenbot.

Right now, it’s just using a Trigram based generator, so it usually doesn’t make a lot of sense. I tried using a grammar based generator at one point, but it was even worse. Grammatically, it was valid. But as far as looking like something a human being would say, no. Turns out that this is one of the more hilarious parts of the English language. Here are two of my favorite articles on grammatically correct but meaningless sentences:

http://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo

http://en.wikipedia.org/wiki/James_while_John_had_had_had_had_had_had_had_had_had_had_had_a_better_effect_on_the_teacher

Seems like a good grammar (that is, a natural language grammar, for use with NLTK) would be able to generate meaningful sentences, but I have no idea how to do that with NLTK, or with anything.

Things like MegaHAL or Cleverbot, if I understand correctly, use a neural net to learn sentence structure based on user input. That’s like research project stuff. I don’t really have the drive to get something like that done. The only reason Charlie Sheenbot exists is because NLTK has all of the stuff needed for it already done.

I’m open to suggestions on improvement though. In the meantime, enjoy, I guess.

New something
2011-03-01 2:16 pm ∴ Programming,Web Apps ∴ by matt -

The other day, I was inspired to make a quick and stupid HTML5 app. It’s a soundboard that plays clips from this video: http://www.youtube.com/watch?v=TugslL45aXk

Angry Ibex Soundboard

I wanted to test out the HTML Audio element. It works best in Firefox, Opera, or Chrome. Oddly, Firefox doesn’t support MP3 format. I assume due to the posturing over codec licensing and such. All seem to support WAV format, although I don’t know why FLAC support is non-existent.

Anyway, enjoy I guess.

Loose Ends
2010-10-11 3:20 pm ∴ Programming,Python ∴ Tags: , , ∴ by matt -

I have 2 draft posts written, one of which sings the praises of SQLAlchemy, the other describes some of the pitfalls I’ve run in to with Pylons, but I can’t really think of a way to finish them. So I’ll just summarize them here:

Pylons

1. Documentation still not very good. I have 8 separate sites that I use for reference.
2. Avoid @validate. It is only useful if you don’t want custom validation error messages from FormEncode.
3. If you want orphan control in your models, know that you can define the cascade rules in your ForeignKey fields. Defining them in the call to relationship() will not handle ‘delete-orphan’ unless you use forward referencing.

SQLAlchemy

1. Setting up relationships is so simple, it’s almost criminal. Even many-to-many.
2. No really, it almost seems like witchcraft.
3. Eager loading of relationships is probably the coolest thing ever.

When I wrote raw queries in apps, I hated using joins because it blew up my result table — horizontally and vertically. When I started using ORMs I couldn’t use joins anymore, but I sometimes hated the fact that I knew a million queries were getting sent. SQLAlchemy addresses this through spells and incantations. I know some kind of post-processing is going on because not only can you load an entire hierarchy of objects, but all in one query… with joins. SQLAlchemy maintains the hierarchy and no additional queries will be run at reference time. Witches, I tell ya.

Quick thought on SQLAlchemy
2010-09-15 4:36 pm ∴ Programming,Thoughts ∴ Tags: , , ∴ by matt -

Just a quick thought on SQLAlchemy today.

Everyone that reads this blog surely knows that I think it is the champion of database libraries despite the fact that time and again they have ditched API backward-compatibility between minor releases (Yes open-source world, 0.x.y is still a minor release). It infuriates me to no end, because I am usually effected and I have to nearly rewrite my application…. or blow it off and stay with an older version until the world itself stops turning.

Anyway, I’ve used SA for many types of applications, but using it in web apps always bothers me. Why? Use of the word session.

Sessions, in web applications, almost universally refer to the user’s “session” — their persistent data, stored on the webserver. Sessions, in SA-land, are roughly the equivalent of a database connection, but they’re not exactly the same. It handles every aspect of communicating with the database server, in a very smart and efficient manner, I might add.

When developing web applications using SA, it can get confusing. Pylons can initiate a project with some code to start your database Session if you choose to do so. And if you plan on using SA in your Pylons application, why wouldn’t you do this? By default, the variable for the database session is Session. Note the capital letter. session is different — that’s where your persistent user-data goes.

It’s simple enough with Pylons to end the confusion and change the variable. I like using dbs (database session). But, now the Pylons documentation and user snippets will be different from your app — which kinda adds some of that confusion back. To make things worse, Pylons’ docs usually refer to the 3rd party module docs — and in SA’s case, all of the examples in the documentation use the variable session.

Obviously this isn’t a major problem. But imagine it’s the first thing in the morning, you were up late and you haven’t gotten your caffeine fix. This could be catastrophic.

Replacing the jQuery Plugin site
2010-07-09 1:26 pm ∴ Programming,Thoughts ∴ Tags: , , , ∴ by matt -

I’ve touched on this a while back, but I never followed through with it. Having some down time at work, I’ve decided to jump in. I want to replace http://plugins.jquery.com.

There are numerous problems with it, and one of the reasons I got overwhelmed by this project originally is because I wanted to fix the site, rather than replace it. So now, I’ve wised up and decided to start from scratch without considering any aspect of how the site currently works.

Here are the problem areas, as I see them, in no particular order.

Browsing through plugins is ridiculously terrible

First, when you get to the page, you get a bunch of categories. Compare this with http://pypi.python.org. PyPi gives a tabular listing of 40 recently updated packages. For the Latest Releases, PJC (plugins.jquery.com) gives a full body of content for each item and goes on for a zillion pages.

Second, from the start page on PJC (with the category listing), the “Browse by Name” tab doesn’t work. The “Browse by Date” tab does work, but what date? The date the plugin was created, or the date of the last release? It turns out this is the same as the “Latest Releases” page, just the tab navigation at the top doesn’t disappear. The “All Plugins” link on the is the same as the “Browse by Name” tab and also doesn’t work.

Lastly, browsing plugins in a category gives a different layout from browsing by date. Why? It’s the same information, just sorted differently and filtered.

Searching is basically useless

Do you know why I’m surprised that people have actually used my timer plugin? Because I can’t even find it myself. Searching for “timer” yields 10 pages of results, and includes plain pages and issue tracker items.

I understand the appeal of having the bug tracker and plugin page tied together, but it’s terrible. A plugin like mine is so small that it doesn’t need a bug tracker. Not to mention that use of a bug tracker is annoying without the use of source control. The plugin author should bear the responsibility of setting up bug tracking, source control, etc. There are plenty of free sites to do that.

The search is easily bombed by adding keywords and tags (which are not moderated). So when I search for timer, the sixth result I get is for dualSlider — perfect for managing timeouts and intervals.

The Rating System

There’s no point to this. The “Top Rated” plugins all have 1-3 votes. Plugins with more votes should have more clout. But it doesn’t really matter anyway. It’s not a popularity contest.

This particular part of PJC will have no part whatsoever in my new project. If there will be any spotlighting of plugins, it will be done by moderators.

Other Data Formats

Right now, there are no RSS feeds for plugins at all. Each plugin should have its own release feed, as well as a feed for all latest releases.

Writing a plugin manager currently would involve screen scraping the existing plugin page to see if there have been any changes. Of course, you have to know the URL of the plugin because searching basically gets no where, and if by some chance you were able to search, you’d have to scrape the search page as well.

That’s why I want to have everything available as JSON. Plugin details, list of plugins by category, search results… the new site has to be highly query-able. PyPi uses XML-RPC to expose their API. JSONRPC might be an option for this, or XML-RPC, but I’ll cross that bridge when I come to it.

Categories

The Categories on PJC are terrible. Not in the way that they aren’t descriptive, but they just suck. They should be hierarchical. For example, “Widgets” and “Windows and Overlays” could fall under “User Interface.” Menus could as well.

I’m not sure how Navigation and Menus are different.

DOM should probably be a child of Utilities.

I don’t know what AJAX means for a category. If the plugin is an AJAX request helper, it should go under “Utilities” or “jQuery Extension.” If it’s something like an auto-complete widget, well it should go under Widgets.

The point is, that categories aren’t very helpful in there current state. I put my Timer plugin under jQuery Extensions, Javascript, and Utilities, leading me to believe that they could all be the same category. I don’t know why Javascript is a category actually, since jQuery encapsulates, rather than extends.

The New Site

I’ve already started. http://code.google.com/p/jqpi (the app page will be http://jquerypi.appspot.com)

Basically, I want to create PyPi for jQuery plugins. I figured using Google App Engine would be nice. Also, knowing my penchant for dragging out projects, I’m coding it for HTML5, since it will probably be widely supported by the time I’m finished.

There are a few things that I don’t know how to do with GAE though. Hierarchical categories, searching, optimization, JSONRPC or XML-RPC. I’ll figure it out eventually, but help is always appreciated. Create an issue, create a wiki page, send patches, join the project, anything. We shouldn’t have to suffer the damned plugins.jquery.com any more.

Windows output redirection bug
2010-05-21 12:48 pm ∴ News,Programming ∴ by matt -

While I was working on the SiteCrawler script, I had a problem getting it to redirect output to a file. In fact it’s one of the reasons I put it on the back burner. I thought it was a Python (on Windows) problem, but it turns it’s just a Windows® Issue™:

http://support.microsoft.com/kb/321788

Even though the article is about WinXP and 2k, I thought I’d try the registry fix. Sure enough — it works! So I will probably start adding stuff to the site crawler again and may actually release it!

The Site Crawler Chronicles – Part 3
2010-03-19 10:33 am ∴ Programming,Thoughts ∴ by matt -

I managed to find a solution for the problem I had yesterday, though I don’t particularly know if it’s ideal.

Originally I had thought that I would need to store the entire hierarchy of the site in a tree like structure. I figured I could just store a list of the links on a page in a dict structure and then output all of the errors when the crawl was finished. I don’t know why I was hung up on the idea that errors had to be reported as they were come across.

I was worried that memory use would be a factor, but it seems to be ok.

But there’s another issue:

    #taken from lxml.html.__init__
    def make_links_absolute(self, base, root):
        """This function exists because urljoin behaves obnoxiously.
        For example, if I'm on the page:
            http://www.example.com/some/directory/index.html, or just:

http://www.example.com/some/directory/

        And I join the relative URL: ../../abc.html
        I end up with: http://www.example.com/abc.html

        *But*
        If I'm on: http://www.example.com/some/directory  [no trailing slash]
        I end up with: http://www.example.com/../abc.html
        """

My fix for it was stripping out one “../”. Yesterday I thought that it would be a good fix. Today, I can’t figure out why I thought it would fix all cases.

The Site Crawler Chronicles
2010-03-18 2:04 pm ∴ Programming,Thoughts ∴ by matt -

So I managed to stop v4 of my web crawler from opening up a billion connections in parallel. Turns out that gevent has a Pool object and that was exactly what I needed.

Now my little script (137 lines, including a utility object and comments) will not be a sysadmin’s nightmare.

However, I now have a new problem. I described how the older versions work in my previous post, but this version is quite a bit different. Instead of using a queue or stack data structure to figure out where to go next, this version has a greenlet scrape all links from a page, filters out stuff it’s already been to, then returns the rest. The main thread then accumulates the lists when all greenlets are finished. After the accumulation — and it’s ensured that there are no duplicate links — the main thread then spawns a greenlet for each link and the main thread waits until the greenlets finish again. When there are no links returned by the greenlets, the main thread is done, and the script terminates.

The problem is, if there’s a 404 or some kind of error retrieving the page, I have no way of knowing what page that link was found on.

The only solution that I see is using a custom data structure and hope that it doesn’t kill performance.

New Old Ideas
2010-03-17 5:05 pm ∴ Programming,Thoughts ∴ by matt -

A little while ago, I wrote a post in which I was trying to figure out a way to improve a Web Crawler script I had written — it was one of those that I never published.

Anyway, for some reason, I wrote it using Stackless Python, but I was pretty novice and didn’t make it as efficient as I could. This was version 1, and it basically went to each page, scraped all the valid links (ie, those that were on the same server and not a mailto: or something) and then went through recursively.

Version 2 was basically the same, just with cleaner code and no recursion. I decided to set up the library so I would extend the SiteCrawler class and get notified of what was going on through callbacks. While it wasn’t any faster than version 1, it did seem a bit more stable.

Version 3 I decided to change drastically and made it multithreaded. It is much, much, much faster. It works like this, there’s an input Queue, and an already checked Queue. There are 4 threads waiting for input on the input Queue, when they get it, they scrape the links, check to see if any of them are in the checked Queue, put what’s been filtered on the input Queue, and put what it just checked on the checked Queue. It seems more complicated than it is. Also the code is more complicated than it needs to be.

Anyway, version 3 works well for me when I need to test a site. It’s saved me so much time in going through and checking for broken URLs. There are a few clients with the number of pages on their site in the 300′s.

But, I recently found out about gevent, and since I have some free time at work, I wanted to play with it a little bit. If you don’t know, gevent is a package that works with the greenlet package on top of libevent. I’m always interested in concurrent programming, and new technologies involved in it. This is why I had installed Stackless at one time.

So now there’s a version 4 of the SiteCrawler script, using — you guessed it — gevent. I haven’t ironed out all of the kinks yet. I was testing a non-pooled version of the script and it basically crawled through 200 links in a matter of seconds — hopefully none of the server admins look at the logs and see 100 simultaneous connections at 3:30 PM today. I did change how the crawl is done quite a bit too. So I’m going to stop there and probably have more tomorrow or in the next few days.

Good stuff!

[p → ∞]