Facebook has released statistics on abusive behavior on its social media network, deleting more than 22 million posts for violating its rules against pornography and hate speech – and deleting or adding warnings about violence to another 3.5 million posts. Many of those were detected by automated systems monitoring users’ activity, in line with CEO Mark Zuckerberg’s statement to Congress that his company would use artificial intelligence to identify social media posts that might violate the company’s policies. As an academic researching AI and adversarial machine learning, I can say he was right to acknowledge the significant challenges: “Determining if something is hate speech is very linguistically nuanced.”
The task of detecting abusive posts and comments on social media is not entirely technological. Even Facebook’s human moderators have trouble defining hate speech, inconsistently applying the company’s guidelines and even reversing their decisions (especially when they make headlines). Also, abusers adapt to avoid detection – as email spammers sought to evade detection by replacing “Viagra” with “Vi@gra” in their messages.
Even more complication can come if attackers try to use the machine learning system against itself – tainting the data the algorithm learns from to influence its results. For instance, there is a phenomenon called “Google bombing,” in which people create websites and construct sequences of web links in an effort to affect the results of Google’s search algorithms. A similar “data poisoning” attack could limit Facebook’s efforts to identify hate speech.
Tricking machine learning
Machine learning, a form of artificial intelligence, has proven very useful in detecting many kinds of fraud and abuse, including email spam, phishing scams, credit card fraud and fake product reviews. It works best when there are large amounts of data in which to identify patterns that can reliably separate normal, benign behavior from malicious activity. For example, if people use their email systems to report as spam large numbers of messages that contain the words “urgent,” “investment” and “payment,” then a machine learning algorithm will be more likely to label as spam future messages including those words.
Detecting abusive posts and comments on social media is a similar problem: An algorithm would look for text patterns that are correlated with abusive or nonabusive behavior. This is faster than reading every comment, more flexible than simply performing keyword searches for slurs and more proactive than waiting for complaints. In addition to the text itself, there are often clues from context, including the user who posted the content and their other actions. A verified Twitter account with a million followers would likely be treated differently than a newly created account with no followers.
Yet as those algorithms are developed, abusers adapt, changing their patterns of behavior to avoid detection. Since the dawn of letter substitution in email spam, every new medium has spawned its own version: People buy Twitter followers, favorable Amazon reviews and Facebook likes, all to fool algorithms and other humans into thinking they’re more reputable.
As a result, a big piece of detecting abuse involves creating a stable definition of what is a problem, even as the actual text expressing the abuse changes. This presents an opportunity for artificial intelligence to, effectively, enter an arms race against itself. If an AI system can predict what an attacker might do, it could be adapted to simulate performing that behavior. Another AI system could analyze those actions, learning to detect abusers’ efforts to sneak hate speech past the automated filters. Once both the attacker and defender can be simulated, game theory can identify their best strategies in this competition.
Abusers don’t just have to change their own behavior – by substituting different characters for letters or using words or symbols in coded ways. They can also change the machine learning system itself.
Because algorithms are trained on data generated by humans, if enough people change their behavior in particular ways, the system will learn a different lesson than its creators intended. In 2016, for instance, Microsoft unveiled “Tay,” a Twitter bot that was supposed to engage in meaningful conversations with other Twitter users. Instead, trolls flooded the bot with hateful and abusive messages. As the bot analyzed that text, it began to reply in kind – and was quickly shut down.
It can be difficult to determine when human-generated data are causing an AI to perform poorly. When possible, the best defense is for humans to add constraints to the system, such as removing language patterns that are considered sexist. Data poisoning can also be detected by measuring accuracy on a separate, curated data set: If a new model performs poorly on trusted data, then that could mean the new training data are bad. Finally, poisoning can be made less effective by removing outliers, data points that are very different from the rest of the training data.
Of course, no machine learning system will ever be perfect. Like humans, computers should be used as part of a larger effort to fight abuse. Even email spam, a major success for machine learning, relies on more than just good algorithms: New internet communications standards make it harder for spammers to hide their identities when sending messages. In addition, federal law, such as the 2003 CAN-SPAM Act, sets standards for commercial email, including penalties for violations. Similarly, addressing online abuse may require new standards and policies, not just smarter artificial intelligence.
Daniel Lowd receives funding from NSF, ARO, DARPA, and AFRL.