Discussions on automated sentiment analysis “accuracy” are starting to border on the bizarre. In the past couple of weeks, I’ve read claims that SAS’s new tool can identify sentiment “better than most humans”. Just a few days later, I read a post this week claiming that ”sentiment analysis [is] best done by humans”.
At the heart of this ongoing debate (and confusion) surrounding automated sentiment analysis is the issue of “accuracy”– the degree to which software can correctly extract positive, negative, or neutral tone from text. Using “accuracy” as a criterion for useful sentiment analysis demonstrates a fundamental misunderstanding of what sentiment really is and what “accuracy” really means. Unfortunately, this misunderstanding has led media researchers and software programmers to search for ”100% sentiment analysis accuracy”, and distracted our industry from what its real focus should be– understanding how the media influences human behavior.
Automated sentiment analysis will never be accurate. Not 1% accurate, 50% accurate, or 100% accurate. To say that an algorithm or statistical model has “accurately” identified a piece of text as positive, negative or neutral requires that sentiment is a real thing in the text that can be correctly identified, like a person’s name or a product. The problem is that positive and negative don’t really exist on paper or on a computer monitor. The scientists and philosophers who study sentiment all agree that it only exists as property of the animal nervous system. “Positive” and “negative” are neurological states that evolved to helps organisms avoid stuff that can harm them or to promote behavior that’s likely to nourish and help them propagate. Sentiment is absolutely not something that exists “out there” in the world; it only exists in our perceptions of the world.
Because positivity and negativity doesn’t really exist in blog posts, Tweets, Facebook updates, or New York Times editorials, neither human analysts or software will ever be able to “accurately” extract sentiment from them. What analysts and software can do instead is approximate or guess what a reader’s reaction might be to the text. Accurate identification could only be done by measuring actual reader’s emotional reactions to the text—which would be too costly and time-consuming to do.
You might be thinking that the distinction between sentiment existing in the external world (e.g., text) vs. the internal world (our brains) is purely academic. But it has serious implications for how marketers, communications professionals, media professionals and software programmers tackle the issue of measuring sentiment in large volumes of text.
One relatively minor implication is that when people talk about measuring “accuracy” in automated sentiment analysis, they’re really referring to “reliability”, or agreement between an analysts’ guess about a readers’ reactions to text post and the sentiment decision made by the automated tool. This is more of a pet peeve of mine than anything else (it’s one thing when marketers or software programmers make this mistake, but researchers should know better; a good guide to distinguishing between the reliability and accuracy in measurement can be found here).
A much more serious implication is that PR and marketing pros need to stop focusing on “accuracy” (i.e., reliability) and start caring about how humans actually evaluate positivity and negativity in the external world. The latter will be a much better predictor of how the media influences behavior and, ultimately, will be most useful to companies who analyze large quantities of mainstream and social media coverage. Given that the process through which the human brain evaluates external things as positive or negative is only beginning to be understood by philosophers and scientists, I’m not optimistic that software programmers and artificial intelligence folks will be cracking that anytime soon.
Since automated sentiment analysis relies on set rules, nearly always tweaked by human analysts, I’m sure that reliability rates between a tool and a single analyst can reach 90% and beyond (at least within a single set of text on a specific topic). Still, when it comes to approximating what actual human readers are likely to think of an organization or a product in the media, there’s good reason to believe that human analysts have machines beat. Here are a few things that we do know about how humans evaluate things as good or bad. In each of these cases, human analysts will be better at simulating what an actual reader would do than an automated sentiment analysis tool:
1) Different people are going to have different emotional reactions to text. This point might seem obvious, but it is almost universally overlooked in media measurement conversations. Depending on who you are, you’re probably going to have a different reaction . The human brain is very good at this perspective switching. Starting at a fairly young age, people can simulate the experiences of other people and make good inferences about their emotions, behaviors, etc.. If you’re at a baseball game, for example, and the hitter for the visiting team makes a winning home run, you can effortlessly recognize that that guy must feel pretty good even though it may have ruined your afternoon.
The practical implication for media researchers is that a single piece of text is likely going to have very different affective or emotional meaning depending on who’s perspective you decide to take. A person could read the Tweet, “Legit…apply to this contest. Almost no one has applied, so chances are…you’ll win. www.dell.com/w3” and quickly infer that the writer has a positive attitude towards the contest but that the marketing folks at Dell will probably have a negative reaction. Similarly, the post, “Did you see the next generation iPhone? It was left on a counter by mistake. Hum” might be read negatively by Apple’s PR team but that iPhone owners will probably have a completely neutral reaction to it. A good analyst is able to quickly and effortlessly take on the perspective of the article or post author, a naïve reader, company representative, potential customers, legislators, competing companies, or investors when reading an article or social media post. I have yet to see a piece of software that can approximate this.
2) Mood impacts evaluative judgments. Because goodness and badness aren’t properties of the external world that can be detected, humans have to rely on a variety of information to make sentiment-based judgments. One key piece of information that people tend to use is their own mood. A host of research has consistently shown that people tend to make mood-congruent judgments about objects in their world. If someone is in a good mood, they tend to rate a range of things, from their own life satisfaction to the taste of food, as being better than if they are in a bad mood. In one research study, Alice Isen and colleagues experimentally induced positive moods in some people, and then asked them to rate the service records of their household appliances (e.g., washers and dryers, coffee makers, etc.). They found that participants in good moods reported much greater satisfaction with the appliances than everyone else (you’re more likely to really like your coffee maker when you’re having a good day). The implication of this for media researchers is that other news, world events, and even bad weather will likely to affect whether or not a reader interprets a blog post or news story as negative or positive.
3) Context matters. One of the most important thing that media analysts can learn from existing knowledge on how humans evaluate sentiment is that context often determines whether or not people perceive otherwise ambiguous things as being either good or bad. In yet another interesting psychological experiment by James Russell and colleagues, participants were shown pictures of people displaying prototypical emotions, such as happiness, surprise, anger, etc. and were told a story about what the person in the picture had just experienced. The researchers found that the story played a huge role in what emotion the participant rated the face as showing. When told that a woman making a prototypically fearful face had just been made to wait for a table at a restaurant for over an hour despite having a reservation, participants tended to rate the face as showing anger rather than fear. Findings like this suggests that situational cues are yet another piece of information that people use to make evaluative judgments about the outside world.
The practical implication to be taken from this is that other stories, blog posts, and Tweets that have been recently read will impact whether or not someone perceives a piece of media as being positive or negative. Other stories in the same magazine, blog posts preceding the one being analyzed, nearby Tweets in a Twitter feed, etc., should all be considered when determining whether or not readers are going to have a positive or negative reaction to a specific piece of text.
I don’t mean to suggest that, given the complexities of human evaluation, there’s no point in trying to improve automated sentiment analysis. But, I don’t think that the task will be as easy as many media monitoring software providers would like you to beleive. The human brain is incredibly complex, and getting the output of automated sentiment engines to approximate the emotional reactions of real human readers (e.g., customers, voters, investors, etc.) will be a challenging task. Once these challenges are recognized, however, I’m sure that automated sentiment analysis will eventually come of age as a useful business tool.