Why analytics companies should stop focusing on “accuracy” in automated sentiment analysis
Discussions on automated sentiment analysis “accuracy” are starting to border on the bizarre. In the past couple of weeks, I’ve read claims that SAS’s new tool can identify sentiment “better than most humans”. Just a few days later, I read a post this week claiming that ”sentiment analysis [is] best done by humans”.
At the heart of this ongoing debate (and confusion) surrounding automated sentiment analysis is the issue of “accuracy”– the degree to which software can correctly extract positive, negative, or neutral tone from text. Using “accuracy” as a criterion for useful sentiment analysis demonstrates a fundamental misunderstanding of what sentiment really is and what “accuracy” really means. Unfortunately, this misunderstanding has led media researchers and software programmers to search for ”100% sentiment analysis accuracy”, and distracted our industry from what its real focus should be– understanding how the media influences human behavior.
Automated sentiment analysis will never be accurate. Not 1% accurate, 50% accurate, or 100% accurate. To say that an algorithm or statistical model has “accurately” identified a piece of text as positive, negative or neutral requires that sentiment is a real thing in the text that can be correctly identified, like a person’s name or a product. The problem is that positive and negative don’t really exist on paper or on a computer monitor. The scientists and philosophers who study sentiment all agree that it only exists as property of the animal nervous system. “Positive” and “negative” are neurological states that evolved to helps organisms avoid stuff that can harm them or to promote behavior that’s likely to nourish and help them propagate. Sentiment is absolutely not something that exists “out there” in the world; it only exists in our perceptions of the world.
Because positivity and negativity doesn’t really exist in blog posts, Tweets, Facebook updates, or New York Times editorials, neither human analysts or software will ever be able to “accurately” extract sentiment from them. What analysts and software can do instead is approximate or guess what a reader’s reaction might be to the text. Accurate identification could only be done by measuring actual reader’s emotional reactions to the text—which would be too costly and time-consuming to do.
You might be thinking that the distinction between sentiment existing in the external world (e.g., text) vs. the internal world (our brains) is purely academic. But it has serious implications for how marketers, communications professionals, media professionals and software programmers tackle the issue of measuring sentiment in large volumes of text.
One relatively minor implication is that when people talk about measuring “accuracy” in automated sentiment analysis, they’re really referring to “reliability”, or agreement between an analysts’ guess about a readers’ reactions to text post and the sentiment decision made by the automated tool. This is more of a pet peeve of mine than anything else (it’s one thing when marketers or software programmers make this mistake, but researchers should know better; a good guide to distinguishing between the reliability and accuracy in measurement can be found here).
A much more serious implication is that PR and marketing pros need to stop focusing on “accuracy” (i.e., reliability) and start caring about how humans actually evaluate positivity and negativity in the external world. The latter will be a much better predictor of how the media influences behavior and, ultimately, will be most useful to companies who analyze large quantities of mainstream and social media coverage. Given that the process through which the human brain evaluates external things as positive or negative is only beginning to be understood by philosophers and scientists, I’m not optimistic that software programmers and artificial intelligence folks will be cracking that anytime soon.
Since automated sentiment analysis relies on set rules, nearly always tweaked by human analysts, I’m sure that reliability rates between a tool and a single analyst can reach 90% and beyond (at least within a single set of text on a specific topic). Still, when it comes to approximating what actual human readers are likely to think of an organization or a product in the media, there’s good reason to believe that human analysts have machines beat. Here are a few things that we do know about how humans evaluate things as good or bad. In each of these cases, human analysts will be better at simulating what an actual reader would do than an automated sentiment analysis tool:
1) Different people are going to have different emotional reactions to text. This point might seem obvious, but it is almost universally overlooked in media measurement conversations. Depending on who you are, you’re probably going to have a different reaction . The human brain is very good at this perspective switching. Starting at a fairly young age, people can simulate the experiences of other people and make good inferences about their emotions, behaviors, etc.. If you’re at a baseball game, for example, and the hitter for the visiting team makes a winning home run, you can effortlessly recognize that that guy must feel pretty good even though it may have ruined your afternoon.
The practical implication for media researchers is that a single piece of text is likely going to have very different affective or emotional meaning depending on who’s perspective you decide to take. A person could read the Tweet, “Legit…apply to this contest. Almost no one has applied, so chances are…you’ll win. www.dell.com/w3” and quickly infer that the writer has a positive attitude towards the contest but that the marketing folks at Dell will probably have a negative reaction. Similarly, the post, “Did you see the next generation iPhone? It was left on a counter by mistake. Hum” might be read negatively by Apple’s PR team but that iPhone owners will probably have a completely neutral reaction to it. A good analyst is able to quickly and effortlessly take on the perspective of the article or post author, a naïve reader, company representative, potential customers, legislators, competing companies, or investors when reading an article or social media post. I have yet to see a piece of software that can approximate this.
2) Mood impacts evaluative judgments. Because goodness and badness aren’t properties of the external world that can be detected, humans have to rely on a variety of information to make sentiment-based judgments. One key piece of information that people tend to use is their own mood. A host of research has consistently shown that people tend to make mood-congruent judgments about objects in their world. If someone is in a good mood, they tend to rate a range of things, from their own life satisfaction to the taste of food, as being better than if they are in a bad mood. In one research study, Alice Isen and colleagues experimentally induced positive moods in some people, and then asked them to rate the service records of their household appliances (e.g., washers and dryers, coffee makers, etc.). They found that participants in good moods reported much greater satisfaction with the appliances than everyone else (you’re more likely to really like your coffee maker when you’re having a good day). The implication of this for media researchers is that other news, world events, and even bad weather will likely to affect whether or not a reader interprets a blog post or news story as negative or positive.
3) Context matters. One of the most important thing that media analysts can learn from existing knowledge on how humans evaluate sentiment is that context often determines whether or not people perceive otherwise ambiguous things as being either good or bad. In yet another interesting psychological experiment by James Russell and colleagues, participants were shown pictures of people displaying prototypical emotions, such as happiness, surprise, anger, etc. and were told a story about what the person in the picture had just experienced. The researchers found that the story played a huge role in what emotion the participant rated the face as showing. When told that a woman making a prototypically fearful face had just been made to wait for a table at a restaurant for over an hour despite having a reservation, participants tended to rate the face as showing anger rather than fear. Findings like this suggests that situational cues are yet another piece of information that people use to make evaluative judgments about the outside world.
The practical implication to be taken from this is that other stories, blog posts, and Tweets that have been recently read will impact whether or not someone perceives a piece of media as being positive or negative. Other stories in the same magazine, blog posts preceding the one being analyzed, nearby Tweets in a Twitter feed, etc., should all be considered when determining whether or not readers are going to have a positive or negative reaction to a specific piece of text.
I don’t mean to suggest that, given the complexities of human evaluation, there’s no point in trying to improve automated sentiment analysis. But, I don’t think that the task will be as easy as many media monitoring software providers would like you to beleive. The human brain is incredibly complex, and getting the output of automated sentiment engines to approximate the emotional reactions of real human readers (e.g., customers, voters, investors, etc.) will be a challenging task. Once these challenges are recognized, however, I’m sure that automated sentiment analysis will eventually come of age as a useful business tool.
5 Comments
Interesting reading.
I agree with many of the points you make here, about the complexity of “accurate” analysis and what is implied from such analysis.
But given that such development of the process (of deciding what a person “feels” when they post) has to be tested and verified to see if there is positive progress, the “accuracy” of the analysis must be quantified. A very typical way of doing that is scoring the % of annotated vs automated.
It matter less at current times if the reason a post/tweet is negative because of the weather or mood.. the goal right now is to be able to tell what a post’s “mood” is. It of course would be nice to say “Ye.. this person hates his coffee maker, but he is in a bad mood”. But for now, clients would be happy with only the first part.. “Ye, this person hates his coffee maker.” Apparently even THAT part is very hard to do.
One benefit of computer analysis is that – computers don’t have feelings. A computer doesn’t care who made a home-run. It should be able to detect that the fans of the winning teak are happy while the other team’s fans are sad. And given enough posts/twits, a machine would/should be able to tell who is happy and who is sad about it.
Currently, the problem with automated sentiment analysis is indeed the accuracy. One can not proceed to understand, not only what a person feels, but also why and what it would result on… not before a proper analysis you can accurately know what they feel.
And automated sentiment analysis is having hard time even detecting that bit. It does not detect double-negatives well, it is not good with sarcastic and cynic remarks, it is not good with non-english text, etc…
The problem is that when you show statistics of 80% negative, and the customer is checking ten positive posts as an example and sees that 6 of them are actually negative, trust in the entire ’system’ is shaken. So accuracy is critical at this point.
It is also relatively known fact that today’s accuracy of the better systems is around 60-65%. That means its not too far from just randomly picking a score. Doesn’t sound too promising. But its getting there I guess.
My main concern with the field of sentiment analysis is – “so what”?
So, yesterday 65% was positive and today its 55%. Is that horrible? Can I live with that? Is it 10% less positive? or 20% less positive, but with 10% more intensity? Maybe 65% of people thought my brand is “sort of ok, I guess” and now 55% “absolutely adore and can’t live without” it? And… regardless of it all.. So what now? What should I do??
To answer all of this, a better understanding of mood/sentiment is necessary. Surely, also the complex system of feelings, motives and motivation needs to be factored in. But “accuracy”, even then, will be a major factor.
Great points and thanks for quoting my post on Sentiment Analysis best done by Humans – which I agree is the best way.
However, what you observed applies in many spheres, including Art. For example, I often note my “mental” and “emotional” state affects my ability to enjoy a museum or art gallery opening. Sometimes, I’m in the mood (or, as you put it – I saw some stuff or felt some things before entering that were supportive and put me in a good state) and I get something tangible out of the experience.
Other times, I simply can’t focus – my mind is distracted and I can’t contain what I’m looking at – and I don’t enjoy the experience. At such a time I might be temped to write a bad review or a say something “negative” about a someone’s work that under other circumstances, in another mood, might say something entirely different.
So I think we come to the difference between sentiment and opinions that can be “swayed” by momentary considerations – and those that are more ingrained – core beliefs that are unlikely to change regardless of what my mood is or what I just read.
Interesting post and something I will also comment on at Webmetricsguru.com this weekend.
Thanks,
Marshall
webmetricsguru.com
[...] a blog post by Content Analytics on Why analytics companies should stop focusing on “accuracy” in automated sentiment analysis saying we can’t really measure sentiment in text – we can only evaluate how someone is [...]
The focus on Social media has brought the complexity of sentiments to the forefront. It is an unsolved problem and we have to live with its limitations. I believe that if used in a wise and cautious manner, sentiment analysis is a useful tool.
Part of your question about the subjective nature of sentiment analysis by humans is answered by this research paper which will be presented at the Association for the Advancement of Artificial Intelligence conference :
The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain.
http://www.icwsm.org/2010/data.shtml#papers
Babar
[...] Why analytics companies should stop focusing on “accuracy” in automated sentiment analysis | con… (tags: analytics sentiment_analysis) [...]
Response