After a discussion with some of the other interhacktives, two things became readily apparent:
- Being able to search for words accompanying trends on Twitter would be really useful
- Currently, support for “sub-trend” searches is fairly lacking
As an example of the first point: imagine you’re at a conference full of Twitter users, and everyone’s using the hashtag #fakeEvent to talk to each other. This works perfectly fine if everyone’s always in the same room, but what if the conference’s structure was such that attendees had to choose from a variety of talks happening concurrently? Say folks in one talk started using the hashtag “#foo” (to use a long-standing programming convention) to talk about that individual session, while those attending another talk used hashtag “#bar” (to use another), and these were added to the global #fakeEvent hashtag (Example: “@soAndSo says Twitter’s a waste of time… #foo #fakeEvent”). If you’re in either talk and know what the hashtag is, or the global #fakeEvent hashtag isn’t moving very fast, there isn’t a problem. But what if you’re watching an event’s Twitter traffic from afar and don’t know what the additional hashtags are about, or the global hashtag is moving too quickly to find relevant information about a specific aspect of that event? Another example of where this might be true is the worldwide Occupy protests — finding the global hashtag isn’t very difficult, but drilling down into specific subject matter becomes significantly moreso.
I thus began looking for ways to find the words that accompany a specific hashtag. While some tools have limited support for this out of the box (Specifically Twazzup, which is pretty cool unto itself), there just isn’t the level of effectiveness allowed by a paid, high-level marketing tool like the $150 per month PeopleBrowsr. A StackExchange question I posted hasn’t yielded anything in the way of tools specifically designed for this purpose, either.
However, thanks to the ever-helpful Tony Hirst (who you should all most certainly follow at @psychemedia for all things related to R and data visualization), a Yahoo! Pipes tutorial on how to accomplish exactly this is now available, with the end result available here.
While Tony’s Pipe is a really good start towards a tool such as the one I’m envisioning, it falls short in a few places. It’s limited on some level by the Yahoo! Pipes API, which caps the number of requests at 200 (This can be expanded to 1500 if you clone and modify the Pipe, but even that might not be sufficient for a really popular hashtag), and can’t seem to filter out ampersand characters. It would also be great if there was a full dictionary of predicates and other words to always ignore (“a,” “if”, “the”, “but”, et cetera), plus maybe a way to view the popularity of sub-trends over time (I.e., “which were the most active sub-trends at #fakeEvent?”). A way to ignore words preceded by “@” would also be useful, so as to filter out users who get a lot of responses in a specific hashtag and thus become a sub-trend unto themselves. Finally, a way to limit words to only those preceded by a hash character would probably also help, so as to only get sub-hashtag hashtags.
That’s not to complain about the solution Tony set up — it’s really well thought-out and I’ve already begun using it — but rather, to point out an interesting project idea for any reasonably-talented RESTful programmer with an afternoon to kill.
Know of a good search tool for terms inside hashtags? Did I miss something? Let me know in the comments below!
if you want to filter out terms that match usernames, just add a condition to the blocking filter so that it blocks terms that match regex @.*
A stop gap I used for ruling out the ‘if’ ‘and’ style words was to exclude words of 3 characters or less, (and I thing I also included an example of how to just select words of a particular length?)
If you want to actually run the word list through a proper stop list, you’ll probably have to descend into code… a code snippet at http://programmingzen.com/2008/03/18/use-python-to-detect-the-most-frequent-words-in-a-file/ gives a clear example in Python about how to do this. Googling /stopword list/ should turn up some candidate word lists.
If you’re feeling a little more adventurous, you could start exploring the world of text mining. Here’s a starter for ten for you: tf-idf
There’s also (at least one) textmining package in R: tm – here’s the vignette: http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
I haven’t played with it yet, so if you post a demo, please tweet me a link;-) (I’ve previously posted an example of how to get started with Twitter searches in R at http://blog.ouseful.info/2011/11/09/getting-started-with-twitter-analysis-in-r/ – maybe that is actually another candidate for the sort of analysis you were after?
PS for any Python programmers wanting to run with the recipe used to create the Yahoo pipe, the pipe2py compiler should be able to generate a Python version of the pipe… https://github.com/ggaughan/pipe2py
And if it doesn’t – post an issue
Hi Tony!
Wow, I now have my weekend reading material sorted! Cool stuff, I think I’ll give R a go next bunch of free time I get.
At the last CityJTech meeting (http://www.facebook.com/groups/270520716324491/), one idea we discussed for next semester was building a Twitter stats toolkit. I can totally see myself doing some of the more interesting text mining things you mention above if we go ahead with that… Thanks for all the resources and additional help!
Cheers,
-Æ.
You may find what Martin Hawksey/@mhawksey has been up to interesting then… http://mashe.hawksey.info/category/web-apps/twitter-web-apps/
This sounds like exactly the sort of thing we could look at in Entrepreneurial next term, Aendrew. Top work thus far, and I can only echo your praise of Tony’ work.