Transformer debugging: when BERT falls to pieces

Jun 4, 2023

Tags: dev, nlp


Madeesh and I have been developing library of transformer models called curated transformers. These models can be used in spaCy through the spacy-curated-transformers package. spaCy has supported Hugging Face Transformers models for a while now through spacy-transformers, but we wanted to make a smaller library that integrates more deeply with spaCy.

After doing most of our initial testing with the XLM-RoBERTa base model, we wanted to do a dry run retraining all existing spaCy transformer models with curated transformers.

We started off finetuning the German transformer model, which uses the German BERT model. The part-of-speech tagging and dependency parsing accuracies of our retrained model were the same as the existing model. However, the accuracy of morphological tagging and lemmatization were down 0.59% and 0.24% respectively. This is kinda surprising, since usually when something is broken, all accuracies go down.

And so a model debugging journey begins…

A primer to word piecing

Morphosyntactic tagging and lemmatization are largely surface-oriented tasks in German. In both cases you can usually glean a lot of useful information from a word’s suffix and prefix. So, our hunch was that something must be wrong related to the so-called word piecing.

BERT models use a vocabulary is that trained to have a certain target size. Suppose that a model uses a vocabulary of size 30,000, it is not possible to store every word in that vocabulary. In fact, in German you can compose an infinite number of noun compounds by combining two or more nouns. BERT’s word piecing solves this coverage issue by splitting words into… pieces. The vocabulary consists of 30,000 pieces rather than words.

A word piece vocabulary contains two types of pieces: initial pieces and continuation pieces. If a word is split into the pieces p0…pn, then p0 is an initial piece and are continuation pieces.

The word piece vocabulary of the bert-base-german-cased model splits the words in the German sentence

Freudenstadt ist ein anerkannter heilklimatischer und Kneippkurort.

into the following pieces:

Freude ##ns ##ta ##dt ist ein anerkannte ##r heil ##kl ##ima ##ti
##scher und Knei ##pp ##kur ##ort .

Many words occur as an initial piece in the vocabulary and are not split. ist, ein, and und are examples of such words. Other words do not occur in the piece vocabulary as-is and are split into pieces. For example, Kneippkurort is split into the initial piece Knei and the continuation pieces ##pp, ##kur, and ##ort (continuation pieces are prefixed by ## by convention).

When a piece vocabulary contains at single-letter initial and continuation pieces for every letter in a language’s orthography, every word can be split into a sequence of pieces.1 In such a case the piece vocabulary provides full coverage. If a vocabulary does not provide full coverage, the special [UNK] piece is used to mark that the remainder of a word could not be split.

Back to our bug: are we breaking up words?

Our first hypothesis for the regression in accuracy was that we are applying the transformer model incorrectly to the word pieces.

The standard transformer architecture uses an attention mechanism where time and space grow quadratically with the sequence length. Additionally, most BERT models only support inputs consisting of at most 512 pieces. spaCy doesn’t assume that inputs are sentence-split, so each component processes documents. However, the length of documents can be an issue for transformer models like BERT:

We solve this issue by processing documents in strided spans. We iterate over the pieces of a document extracting spans. For example, if we iterate over the previous example sentence extracting spans of length 5, the first two iterations will extract the following spans:

Freude ##ns ##ta ##dt ist ein anerkannte ##r heil ##kl ##ima ##ti
<-------- Span 1 -------> <--------- Span 2 --------->

We could then pass these two spans as inputs to the transformer. However, this is not really great, since each span is isolated and the succeeding/preceding is not used as context. So instead we extract these spans as part of longer windows. For example, with a span length of 5 pieces and a window of 6 pieces, the first two iterations look like this:

Freude ##ns ##ta ##dt ist ein anerkannte ##r heil ##kl ##ima ##ti
<-------- Span 1 -------> <--------- Span 2 --------->
<--------- Window 1 --------> <--------- Window 2 --------->

We increase the context in two different ways:

  1. We give the windows as inputs to the transformer and then slice the window representations to get the spans. Using larger windows than spans allows the transformer to take succeeding pieces into account.
  2. We average the representations of the overlap between Window N and Span N+1. This adds information from the preceding context to the representation.

In our models the strides and windows are typically over 100 pieces. The concatenated span representations are used by downstream components such as the morphological tagger or dependency parser.

Since we specify the window and span size in terms of pieces, the span boundary can lie within a word, such as in heilklimatischer in the example above. We thought that this wouldn’t be an issue, since we are averaging the overlap between a span and the preceding window. However, with the regression in accuracy of surface-oriented tasks, we started to think that this may be the culprit.

We addressed this shortcoming by always rounding up the span boundary to the next word boundary. With this change, we never cut up any words. So in the earlier example, we would extract the following spans:

Freude ##ns ##ta ##dt ist ein anerkannte ##r heil ##kl ##ima ##ti
<-------- Span 1 -------> <-------------------- Span 2 ----------

##scher und Knei ##pp ##kur ##ort .

We retrained some of the models and… we didn’t get any improvement at all.

Nix saves the day?

After a good night’s sleep it dawned upon me that a colleague at the University of Tübingen had seen a similar regression a few years before. He was working on a multi-task annotator that used Hugging Face Transformers. Overnight, the morphological tagging and lemmatization accuracies of his German models went down. He spent several days tinkering with different git revisions of his project and different versions of his project’s dependencies, but he couldn’t reproduce the previous, higher accuracies.

At the same time, I was working on the precursor to SyntaxDot and was not seeing any regressions. I was quite a fan of Nix and meticulously wrote Nix derivations for everything, including derivations for training data and pretained models. Nix derivations that fetch data (which are so-called fixed-output derivations) specify the hash of the data. The hash it used to verify when the data is first downloaded, but is also used for addressing the data in the Nix store.

A few days after my colleague bumped into this regression, I wanted to finetune a model on a different machine and let Nix reproduce my full experimental environment. However, pretty quickly the build failed with:

error: hash mismatch in file downloaded from '':
  wanted: sha256:0g91w3nlq9li113lvw73j4b5v00kpml6wlhk71qlzih6wzw2gqbj
    got:    sha256:18078asvjdf4ippbjh93n02905nf4x85vcyls8c598d1pv1d256z

So, there was a hash mismatch in the German BERT piece vocabulary, meaning that the upstream vocabulary was changed in-place! Given that my colleague had a regression for German specifically, this warranted a deeper dive into the vocabulary changes.

Vocab debugging

The first thing I did was to create a diff of the two versions of the vocabulary to understand the changes. One part of the diff that immediately grabbed my attention was

@@ -26933,7 +26933,7 @@

This change removes the continuation piece ##- and replaces it by the initial piece -. Taking the rest of the vocabulary into account, this is a pretty fundamental change, because this removes the possibility to split up most words with a dash in them. So for instance, the compound AWO-Mitarbeiter (AWO employee(s)) would be tokenized as ['A', '##W', '##O', '##-', '##Mitarbeiter'] before this change. However, after the removal of the dash continuation piece, there is nothing we can do after the first three pieces expect for calling it a day and using the special unknown piece: ['A', '##W', '##O', '[UNK]']

This removes the head of the compound and along with it any chance to do a proper morphological analysis. Given that roughly 1% of the tokens in German TüBa-D/Z treebank (the treebank that my colleague used) contain a dash it’s not surprising to see these large regressions in accuracy.

The fix

We e-mailed our findings to the German BERT authors (who were very responsive and helpful). They hand-edited the piece vocabulary to adapt it to the tokenizer used by the BERT implementation in Hugging Face Transformers. This tokenizer splits words on whitespace and then greedily on punctuation characters, before splitting words into pieces. So a German sentence like

AWO-Mitarbeiter fordern mehr Gehalt.

is tokenized to:

AWO - Mittarbeiter fordern mehr Gehalt .

When punctuation characters are always split up into separate tokens, it does not make sense to have continuation pieces for punctuation characters the dash in your piece vocabulary. This makes it clear why the authors of the German BERT converted the continuation piece into an initial piece.

However, spaCy’s tokenizer tokenizes the sentence as:

AWO-Mittarbeiter fordern mehr Gehalt .

which is the correct tokenization, since AWO-Mitarbeiter is considered to be a single word in German. However, this is where things went wrong in spacy-curated-transformers. Since Hugging Face Transformers uses a more crude form of tokenization, the wordpiece vocabulary of most BERT models do not have continuation pieces for punctuation within tokens and we end up with a pesky [UNK] piece in when splitting such tokens.

Once we figured out that this was the root of our problem, the solution was simple. spacy-curated-transformers keeps track of the mapping between tokens and pieces. So we can safely apply the same greedy punctuation splitting within our piece tokenizer. This allows us to feed the expected tokenization to the transformer, while we retain spaCy’s tokenization at a higher level.

After making this change, the accuracy of spacy-curated-transformers is on-par with our existing transformer models. 😮‍💨

  1. Unfortunately, this is not always the case, some BERT vocabularies don’t contain all characters as pieces.