Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders

Chien Vu Chien Vu
January 5, 2021 AI & Machine Learning

Data Augmentation Method to improve SBERT Bi-Encoders for Pairwise Sentence Scoring Tasks (Semantic sentence tasks)

Background and challenges

Currently, the state-of-the-art architecture models for NLP usually reuse the BERT model which was pre-trained on large text corpora such as Wikipedia and the Toronto Books Corpus as the baseline [1]. By fine-tuning deep pre-trained BERT, a lot of alternative architectures were invented like DeBERT, RetriBERT, RoBERTa,… that achieved substantial improvements to the benchmarks on a variety of language understanding tasks. Among common tasks in NLP, pairwise sentence scoring has a wide number of applications in information retrieval, question answering, duplicate question detection, or clustering,… Generally, there are two typical approaches proposed: Bi-encoders and Cross-encoders.

  • Cross-encoders [2]: performs full (cross) self-attention over a given input and label candidate, and tends to attain much higher accuracies than their counterparts. However, it must recompute the encoding for each input and label; as a result, they are not possible to retrieval end-to-end information cause they do not yield independent representations for the inputs and is prohibitively slow at test time. For example, the clustering of 10,000 sentences has a quadratic complexity and requires about 65 hours in training [4].
Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Cross-encoders
  • Bi-encoders [3]: performs self-attention over the input and candidate label separately, maps them to a dense vector space, and then combines them at the end for a final representation. Therefore, Bi-encoders are able to index the encoded candidates and compare these representations for each input resulting in fast prediction times. At the same complexity of clustering 10,000 sentences, time is reduced from 65 hours to about 5 seconds [4]. The performance of the advanced Bi-encoder Bert model was presented by Ubiquitous Knowledge Processing Lab (UKP-TUDA), which is called Sentence-BERT (SBERT). For more details, this article indicates the hands’ tutorial of using SBert Bi-encoders.
Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Bi-Encoder

On the other hand, no methodology is perfect in all aspects and Bi-encoders is not an exception. The Bi-encoders method usually achieves lower performance compared with the Cross-encoders method and requires a large amount of training data. The reason is Cross-encoders can compare both inputs simultaneously, while the Bi-encoders have to independently map inputs to a meaningful vector space which requires a sufficient amount of training examples for fine-tuning.

To solve this problem, Poly-encoders was invented [5]. Poly-encoders utilizes two separate transformers (similar to cross-encoders), but attention was applied between two inputs only at the top layer, resulting in better performance gains over Bi-encoders and large speed gains over Cross-encoders. However, Poly-encoders still have some drawbacks: they cannot be applied for tasks with symmetric similarity relations because of an asymmetrical score function and Poly-encoders representations cannot be efficiently indexed, causing issues for retrieval tasks with large corpora sizes.

Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Poly-encoders

In this article, I want to introduce a new approach that can use both Cross-encoders and Bi-encoders in an effective way – data augmentation. This strategy is known as Augmented SBERT (AugSBERT) [6], which uses BERT cross-encoders to label a larger set of input pairs to augment the training data for SBERT bi-encoders. Then, SBERT bi-encoders is fine-tuned on this larger augmented training set, which yields a significant performance increase. The idea is very similar to Self-Supervised Learning by Relational Reasoning in Computer Vision. Therefore, in a simple way, we can think that it is Self-Supervised Learning in Natural Language Processing. For more details, it will be presented in the next section.

Technique highlight

There are three major scenarios for the Augmented SBERT approach for either pairwise-sentence regression or classification task.

Scenario 1: Full annotated datasets (all labeled sentence-pairs)

In this scenario, the straight forward data augmentation strategy is applied to prepare and extend the labeled dataset. There are three most common levels: Character, Word, Sentence.

Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Text augmentation level

However, the word level is the most suitable one for the sentence pair task. Based on the performance of training Bi-Encoders, there are few suggested methodologies: Insert/substitute word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet) or substitute word by synonym (WordNet, PPDB). After creating the augmented text data, it is then combined with the original one and fit into Bi-Encoders.

However, in the case of few labeled datasets or special cases, simple word replacement or increment strategies as shown are not helpful for data augmentation in sentence-pair tasks, even leading to worse performance compared to models without augmentation.

In short, the straight forward data augmentation strategy involves three steps:

  • Step 1: Prepared the full labeled Semantic Text Similarity Dataset (gold data)
  • Step 2: Replace the synonyms words in pair sentence (silver data)
  • Step 3: Train a bi-encoder (SBERT) on the extended (gold + silver) training dataset
Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Training SBert Bi-encoders with full annotated datasets

Scenario 2: Limited or small annotated datasets (few labeled sentence-pairs)

In this case, because of the limited labeled datasets (gold dataset), the pre-trained Cross-encoders are used to weakly label the unlabeled data (same domain). However, randomly selecting two sentences usually leads to a dissimilar (negative) pair; while positive pairs are extremely rare. This skews the label distribution of the silver dataset heavily towards negative pairs. Therefore, the two appropriate sampling approaches are suggested:

  • BM25 Sampling (BM25): the algorithm is based on the lexical overlap and is commonly used as a scoring function by many search engines [7]. The top k similar sentences are queried and retrieved from unique indexed sentences.
  • Semantic Search Sampling (SS): pre-trained Bi-encoders (SBERT) [4] are used to retrieve the top k most similar sentences in our collection. For large collections, approximate nearest neighbor search like Faiss could be used to quickly retrieve the k most similar sentences. It is able to solve the drawback of BM25 on synonymous sentences with no or little lexical overlap.

After that, the sampled sentence pairs will be weakly labeled by pre-trained Cross-encoders and be merged with the gold dataset. Then, Bi-encoders are trained on this extended training dataset. This model is called Augmented SBERT (AugSBERT). AugSBERT might improve the performance of existing Bi-encoders and reduce the difference with Cross-encoders.

In summary, AugSBERT for a limited dataset involves three steps:

  • Step 1: Fine-tune a Cross-encoders (BERT) over the small (gold dataset)
  • Step 2.1: Create pairs by recombination and reduce the pairs via BM25 or semantic search
  • Step 2.2: Weakly label new pairs with Cross-encoders (silver dataset)
  • Step 3: Train a Bi-encoders (SBERT) on the extended (gold + silver) training dataset
Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Training SBert with limited annotated datasets

Scenario 3: No annotated datasets (Only unlabeled sentence-pairs)

This scenario happens when we want SBERT to attain high performance in different domain data (without annotation). Basically, SBERT fails to map sentences with unseen terminology to a sensible vector space. Hence, the relevant data augmentation strategy domain adaptation was proposed:

  • Step 1: Train from scratch a Cross-encoders (BERT) over a source dataset, for which we contain annotations.
  • Step 2: Use these Cross-encoders (BERT) to label your target dataset i.e. unlabeled sentence pairs
  • Step 3: Finally, train a Bi-encoders (SBERT) on the labeled target dataset

Generally, AugSBERT benefits a lot when the source domain is rather generic and the target domain is rather specific. Vice-versa, when it goes from a specific domain to a generic target domain, only a slight performance increase is noted.

Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Training SBert Bi-encoders with none annotated datasets

Experimental evaluation

In this experiment, I will introduce a demo on how to apply AugSBERT with different scenarios. First, we need to import some packages

Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Hosted on Jovian/View File

Scenario 1: Full annotated datasets (all labeled sentence-pairs)

The main purpose of this scenario is extending the labeled dataset by the straight forward data augmentation strategies, therefore, we will prepare train, dev, test dataset on the Semantic Text Similarity dataset (link) and define batch size, epoch, and model name (You can specify any Huggingface/transformers pre-trained model)

Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Hosted on Jovian/View File

Then, we will insert words by our BERT model (you can apply another argumentation technique as I mentioned in the Technique highlight section) to create a silver dataset.

Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Hosted on Jovian/View File

Next, we define our Bi-encoders with mean pooling with both(gold + silver) STS benchmark dataset

 Bi-encoders
Hosted on Jovian/View File

Finally, we will evaluate our model in the test STS benchmark

Advance BERT Model Via Transferring Knowledge From Cross-Encoders To Bi-Encoders
Hosted on Jovian/View File

Scenario 2: Limited or small annotated datasets (few labeled sentence-pairs)

In this scenario, we will use Cross-encoders that were trained on the limited labeled dataset (gold dataset) to soft label the in-domain unlabeled dataset (silver dataset) and train Bi-encoders in both datasets (silver + gold). In this simulation, I also use again STS benchmark dataset and create new pairs of sentences by pre-trained SBERT model. First, we will define Cross-encoders and Bi-encoders.

Limited or small annotated datasets (few labeled sentence-pairs)
Hosted on Jovian/View File

Step 1, we will prepare train, dev, test like before and fine-tune our Cross-encoders

we will prepare train, dev, test like before and fine-tune our Cross-encoders
Hosted on Jovian/View File

Step 2, we use our fine-tuned Cross-encoders to label unlabeled datasets.

we use our fine-tuned Cross-encoders to label unlabeled datasets.
Hosted on Jovian/View File

Step 3, we train our Bi-encoders in both gold and silver datasets

we train our Bi-encoders in both gold and silver datasets
Hosted on Jovian/View File

Finally, we will evaluate our model in the test STS benchmark dataset.

we will evaluate our model in the test STS benchmark dataset.
Hosted on Jovian/View File

Scenario 3: No annotated datasets (Only unlabeled sentence-pairs)

In this scenario, all the steps are very similar to scenario 2 but in a different domain. Because of the capability of our Cross-encoders, we will use a generic source dataset (STS benchmark dataset) and transfer the knowledge to a specific target dataset (Quora Question Pairs)

 No annotated datasets (Only unlabeled sentence-pairs)
Hosted on Jovian/View File

And train our Cross-encoders.

And train our Cross-encoders
Hosted on Jovian/View File

Labeling Quora Question Pairs dataset (silver dataset). In this case, the task is classification so we have to convert our score to binary scores.

Labeling Quora Question Pairs dataset (silver dataset)
Hosted on Jovian/View File

Then, training our Bi-encoders

Then, training our Bi-encoders
Hosted on Jovian/View File

Finally, evaluating on test Quora Question Pairs dataset

evaluating on test Quora Question Pairs dataset
Hosted on Jovian/View File

Final Thoughts

AugSBERT is a simple and effective data augmentation to improve Bi-encoders for pairwise sentence scoring tasks. The idea is based on labeling new sentence pairs by using pre-trained Cross-encoders and combining them into the training set. Selecting the right sentence pairs for soft-labeling is crucial and necessary to improve the performance. The AugSBERT approach can also be used for domain adaptation, by soft-labeling data on the target domain.

References

[1] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding.

[2] Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. Transfertransfo: A transfer learning approach for neural network-based conversational agents.

[3] Pierre-Emmanuel Mazare, Samuel Humeau, Martin Raison, and Antoine Bordes. Training millions of personalized dialogue agents.

[4] Nils Reimers and Iryna Gurevych. SentenceBERT: Sentence Embeddings using Siamese BERTNetworks.

[5] Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring.

[6] Nandan Thakur, Nils Reimers, Johannes Daxenberge, and Iryna Gurevych. Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks.

[7] Giambattista Amati. BM25, Springer US, Boston, MA.

  • Experfy Insights

    Top articles, research, podcasts, webinars and more delivered to you monthly.

  • Chien Vu

    Tags
    Bi-encodersCross-encodersNLP modelTransferring Knowledge
    Leave a Comment
    Next Post
    What To Do When The Jobs Are Gone

    What To Do When The Jobs Are Gone

    Leave a Reply Cancel reply

    Your email address will not be published. Required fields are marked *

    More in AI & Machine Learning
    AI & Machine Learning,Future of Work
    AI’s Role in the Future of Work

    Artificial intelligence is shaping the future of work around the world in virtually every field. The role AI will play in employment in the years ahead is dynamic and collaborative. Rather than eliminating jobs altogether, AI will augment the capabilities and resources of employees and businesses, allowing them to do more with less. In more

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    How Can AI Help Improve Legal Services Delivery?

    Everybody is discussing Artificial Intelligence (AI) and machine learning, and some legal professionals are already leveraging these technological capabilities.  AI is not the future expectation; it is the present reality.  Aside from law, AI is widely used in various fields such as transportation and manufacturing, education, employment, defense, health care, business intelligence, robotics, and so

    5 MINUTES READ Continue Reading »
    AI & Machine Learning
    5 AI Applications Changing the Energy Industry

    The energy industry faces some significant challenges, but AI applications could help. Increasing demand, population expansion, and climate change necessitate creative solutions that could fundamentally alter how businesses generate and utilize electricity. Industry researchers looking for ways to solve these problems have turned to data and new data-processing technology. Artificial intelligence, in particular — and

    3 MINUTES READ Continue Reading »

    About Us

    Incubated in Harvard Innovation Lab, Experfy specializes in pipelining and deploying the world's best AI and engineering talent at breakneck speed, with exceptional focus on quality and compliance. Enterprises and governments also leverage our award-winning SaaS platform to build their own customized future of work solutions such as talent clouds.

    Join Us At

    Contact Us

    1700 West Park Drive, Suite 190
    Westborough, MA 01581

    Email: support@experfy.com

    Toll Free: (844) EXPERFY or
    (844) 397-3739

    © 2023, Experfy Inc. All rights reserved.