Very soon, you should be wary of innocent-looking typos in online content, because they might contain a hidden attack against the artificial intelligence algorithms that process the ton of text we consume every day.
“When people see typos right now, they don’t think it’s a security issue. But in the near future, it might be something we will have to contend with,” Stephen Merity, AI researcher and expert on machine learning–based language models, told me in a call last week.
Image credit: Depositphotos
And there’s ample reason to take his warnings seriously. In recent findings, scientists at IBM Research, Amazon and the University of Texas have proven that small modifications to text content can alter the behavior of AI algorithms while remaining unnoticeable to human readers.
In their paper, titled, “Discrete Attacks and Submodular Optimization with Applications to Text Classification,” the researchers delve into paraphrasing attacks, the textual equivalent of adversarial examples, perturbations in input data that cause AI algorithms to behave in erratic ways.
Thanks to advances in deep learning, AI algorithms have become capable to automate text-related tasks that previously required the skills of human operators. Many companies completely rely on AI algorithms to process text content and make important decisions.
But deep learning algorithms are also vulnerable to their own unique type of security threats. With AI becoming more and more prominent in tasks such as filtering spam, detecting fake news, processing resumes and analyzing the sentiment of social media posts, it’s important that we understand what these threats are and find ways to deal with them.
The adversarial vulnerabilities of deep learning algorithms
Neural networks, the main component of deep learning algorithms, develop their behavior based on thousands and millions of examples they examine during their training phase. This is a break from classic artificial intelligence development, in which a large part of the effort involved programmers meticulously coding the rules that defined the behavior of their software.
The example-based approach of developing deep learning algorithms make them very convenient in tackling tasks where the rules are too vague and complicated to encode through static rules. Some of the domains that have benefitted immensely from advances in deep learning include computer vision, automated speech recognition and natural language processing (NLP).
However, with humans having little control on the behavior of neural networks, their inner-workings often remain elusive even to their own creators. Also, deep learning algorithms are statistical machines, albeit very complex ones, which means they are very different from the human mind, even if they often provide very similar results in complicated tasks. Despite all advances in the field, AI’s grasp of human language is still very limited.
The unique characteristics of deep learning algorithms make them vulnerable to adversarial examples. Adversarial examples involve making small changes to the inputs of AI algorithms to force them to change their behavior. For instance, applying small changes to the color values of pixels in an image might cause an image classifier AI algorithm to change its confidence scores.
Researchers at labsix showed how a modified toy turtle could fool deep learning algorithms into classifying it as a rifle (source: labsix.org)
Adversarial examples accentuate the differences between AI and human intelligence. If you show the above picture to a human, they would tell you outright that it’s a turtle. But students at MIT showed that a neural network would classify the same image as a rifle. The key is to make slight changes to the colors and shapes (maybe the weird patterns on the shell) that would make it statistically close to some other object.
Sometimes, adversarial vulnerabilities can happen by accident, such as a case where AI software used by the UK Metropolitan Police to detect and flag pictures of child abuse wrongly labeled pictures of dunes as nudes.
But scientists fear that adversarial examples can someday be weaponized and turned into cyberattacks against AI systems. Any machine learning model can become subject to adversarial attacks, but deep learning models are especially vulnerable because of their complexity and poor interpretability.
The challenges of paraphrasing attacks
The idea behind paraphrasing attacks is like other adversarial attacks. The point is to manipulate the behavior of an NLP model by making changes to the input text that will go unnoticed to a human reader.
“In text attacks, people try to replace words in sentences of an article or email so that the classification will become different,” says Pin-Yu Chen, AI scientist at IBM Research and co-author of the paper.
For instance, consider an AI algorithm that automatically scans emails and blocks spam messages (many popular email providers use AI to filter out spam). A paraphrasing attack would involve rewording a spam email to slip past the AI filter while conveying the same message to the human recipient.
However, in other respects, paraphrasing attacks against NLP models are also fundamentally different from adversarial attacks against computer vision algorithms.
Discovering adversarial vulnerabilities in images is a simple process. A malicious actor can continuously feed the same picture to an image classifier, each time making small changes to the pixel values. By taking note of how changes to the pixels affect the output of the AI model, they will be able to find out how to make the right changes to completely change the classification of the image. IBM Research recently introduced a method that automates the discovery of adversarial vulnerabilities in image classifiers and helps make AI models more robust against attacks.
But creating adversarial text samples is more challenging. “Paraphrasing attacks are very special when compared to adversarial attacks on image, video and audio classifiers. Text is a high-level symbol. We know the meaning of words,” says Lingfei Wu, scientist at IBM Research and co-author of the paraphrasing paper.
In other words, you can increase or decrease color values in images, but you can’t do the same with text. “Text is traditionally harder to attack. It’s discrete. You can’t say I want 10 percent more of the word ‘dog’ in this sentence. You either have the word ‘dog’ or you take it out,” says Merity.
In many cases, a text will lose its smoothness and consistence if you take out or change a single word.
Creating paraphrasing attacks
The paraphrasing paper is not the first effort aimed at attacking text-processing AI models. But previous efforts mostly focused on swapping single words with their synonyms. This approach limited the scope of the attacks and often resulted in artificial output. In contrast, the new work by IBM, Amazon and UT expands the span of the attacks to alter entire sequences of text.
“We are paraphrasing words and sentences. This gives the attack a larger space by creating sequences that are semantically similar to the target sentence. We then see if the model classifies them like the original sentence,” Chen says.
This approach makes the attack much more versatile and ensures that the output also passes the test of being “adversarial,” which means it should go unnoticed to humans. “We believe paraphrasing is the right way to define adversarial for text-based AI models,” Wu says.
But versatility and flexibility will come at a huge cost in complexity. “We search in a very large space that looks for both word and sentence paraphrasing. Finding the best adversarial example in that space is very time consuming,” Wu says.
To overcome this challenge, the researchers have developed a gradient-guided greedy algorithm that tries to search for optimal modifications that will have the most impact on the output of the targeted AI model. “The algorithm is computationally efficient and also provides theoretical guarantees that it’s the best search you can find,” Wu says.
The paper includes some examples of paraphrasing attacks produced by the algorithm. In some cases, rewording a single sentence caused the targeted AI algorithm to change its behavior.
Examples of paraphrased content that force AI algorithms to change their output
To create their paraphrasing attacks, the researchers had access to the structure and architecture of the model they were targeting. This is generally known as a “white box” attack.
But this doesn’t mean that “black box” attacks against NLP models is impossible. Merity points out that even closed AI models, such as those provided as online services by large tech companies, can be vulnerable to paraphrasing attacks. “Even if you keep your machine learning model secret, there have been past papers showing that if you vaguely know the training dataset or the type of AI architecture they’re using, or if you just have enough samples, you can work backwards and attack these AI systems even though you’re basically talking to a black box,” he says.
Humans are not sensitive to paraphrasing attacks
To assess the efficiency of their algorithm, the researchers had to make sure the output read smoothly and had the same meaning to humans. They tested the algorithm by running the content by human evaluators and asking them to identify the original and modified sentences.
“We gave the original and modified paragraph to human evaluators, and it was very hard for them to see difference in meanings. But for the machine, it was completely different,” Wu says.
But Merity pointed out that to be efficient, paraphrasing attacks don’t need to be perfectly coherent. In testing the algorithm, the human evaluator knows in advance that the text might be generated by an AI algorithm and will try to look for inconsistencies and flaws. But in everyday life, most of us dismiss grammatical and typographical errors as the blunders of careless or unknowledgeable humans.
“Humans aren’t the correct level to try to detect these kinds of attacks, because they deal with faulty input every day. Except that for us, faulty input is just incoherent sentences from real people,” he says.
Bad actors can actually turn this lack of sensitivity to their advantage. They can use the AI algorithm to generate paraphrasing examples, and then silently feed it into an online community platform to evaluate its coherence.
“For instance, they can post the modified content on Reddit and see how many upvotes it gets. That could be indicative of whether the sample is semantically or functionally coherent,” Merity says.
Creating more robust AI models
One of the ways to protect AI models against adversarial attacks is to retrain them with adversarial examples and their correct labels. This will make the model more robust and resilient against such attacks.
The same rule applies to NLP models. However, to their surprise, the researchers discovered that adversarial training not only makes the models more robust against paraphrasing attacks, but it also makes them more accurate and versatile.
“This was one of the surprising findings we had in this project. Initially, we started with the angle of robustness. But we found out that this method not only improves robustness, but also improves generalizability,” Wu says. “If instead of attacks, you just think about what is the best way to augment your model, paraphrasing is a very good generalization tool to increase the capability of your AI model.”
After retraining their AI models with paraphrased examples, the researchers noticed an improvement both in performance and robustness against attacks.
Retraining their AI models with paraphrased examples makes them both more robust and accurate.
“It’s interesting to note that this work isn’t necessarily bad. Training the classifier against these adversarial examples that they’re actually able to produce makes the classifier not just more resilient to attacks but also can improve the accuracy of these models,” Merity notes.
The future of AI security
As artificial intelligence algorithms assume a more important role in processing and moderating online content, adversarial attacks will give rise to a new trend of security risks.
“A lot of tech companies rely on automated decisions to classify content, and there isn’t actually a human-to-human interaction involved. This makes the process vulnerable to such attacks,” Merity says. “It will run in parallel to data breaches, except that we’re going to find logic breaches.”
These kinds of attacks might be used to tamper with all kinds of AI systems. Malicious actors might use them to slip their hateful content past hate-speech classifiers, or dupe resume-processing algorithms to push their jobs application higher in the list of shortlisted candidates.
“This is the evolution of technology that you have to deal with. In the same way that in the early 2000s, spam became epidemic, we’re going to have the same thing in this era and it’s going to be more concerning. These systems potentially working against democracies and enflaming entire communities for political reasons,” Merity warns. “These types of issues are going to be a new security era, and I’m worried that companies will just spend as little on this as they do on security, because they’re focused on automation and scalability.”