Last year, I published (with a group of 3 others) and presented a paper called “TwitterReporter.” It described a method for identifying breaking news, in near real time, by processing the Twitter Streaming API.
One of the most interesting sub-problems was that of the unique nature of social media content “noise”. Especially with respect to Twitter, traditional natural language algorithms are flat-out difficult in the social media space. Limited content length, service-specific syntax, shorthand use, and…teenagers.
We came up with a method to reduce the “noise” as much as possible. Generally speaking, although additional steps could be added and tweaks could be made, the method worked fairly well. Here’s the basics:
- Skip all posts from accounts designated as non-English. Unfortunately (and obviously), this is no guarantee on the language used.
- Skip all posts with non-US ASCII characters. This effectively removes many more non-English content. Further, this has a other welcome side affects. Posts with mathematical symbols, icons, smiley faces, and other UTF characters are removed. In general, these posts are fairly useless for mining due to a lack of dictionary content.
- Remove “social media syntax” from all remaining posts. Using Twitter as an example, this removes “RT @[username]”, “@[username]”, etc.
- Replace any XHTML encoded characters (ex: “&” becomes “and”). These occur once in a while, typically due to careless use of APIs.
- Finally, remove any non-alphanumeric characters and extra whitespace.
After all is said and done, you’ll likely have a decent portion of posts that have very little content left. I’d recommend setting an empirically-defined threshold; if the remaining content length does not pass it, skip the post. We started with a “safe” threshold of around 4, but it’s arguable that “useful” content starts with much higher lengths.
See any useful steps that were missed? I’d sincerely appreciate feedback and ideas in the comments!