Two updates less than a year apart — what?
2011-03-03 2:15 pm ∴ Programming,Python ∴ Tags: , , ∴ by matt -

So I’ve been toying with NLTK and generating text. I’ve written a plugin for the crappy-irc-bot project which uses bash.org as a source. I wrote something similar, which uses one of the built-in NLTK sources to generate dummy text in house. I got sick of using “Lorem ipsum…”

And yesterday, I adapted it to use some of the wild stuff that Charlie Sheen has been saying and feed it in to a Twitter account, charliesheenbot.

Right now, it’s just using a Trigram based generator, so it usually doesn’t make a lot of sense. I tried using a grammar based generator at one point, but it was even worse. Grammatically, it was valid. But as far as looking like something a human being would say, no. Turns out that this is one of the more hilarious parts of the English language. Here are two of my favorite articles on grammatically correct but meaningless sentences:

http://en.wikipedia.org/wiki/Buffalo_buffalo_Buffalo_buffalo_buffalo_buffalo_Buffalo_buffalo

http://en.wikipedia.org/wiki/James_while_John_had_had_had_had_had_had_had_had_had_had_had_a_better_effect_on_the_teacher

Seems like a good grammar (that is, a natural language grammar, for use with NLTK) would be able to generate meaningful sentences, but I have no idea how to do that with NLTK, or with anything.

Things like MegaHAL or Cleverbot, if I understand correctly, use a neural net to learn sentence structure based on user input. That’s like research project stuff. I don’t really have the drive to get something like that done. The only reason Charlie Sheenbot exists is because NLTK has all of the stuff needed for it already done.

I’m open to suggestions on improvement though. In the meantime, enjoy, I guess.