
Task was too 'easy' for words that had been split into multiple WordPieces.

The improvement comes from the fact that the original prediction The training is identical - we still predict each masked WordPiece token Whole Word Masked Input: the man up, put his basket on ' s head In this case, we always maskĪll of the the tokens corresponding to a word at once. The new technique is called Whole Word Masking. Input Text: the man jumped up, put his basket on phil #am #mon ' s head Original Masked Input: man up, put his on phil #mon ' s head In the original pre-processing code, we randomly select WordPiece tokens to This is a release of several new models which were the result of an improvement ***** New May 31st, 2019: Whole Word Masking Models ***** If you use these models, please cite the following Students Learn Better: On the Importance of Pre-training Compact Models},Īuthor=, Here are the corresponding GLUE scores on the test set: Modelįor each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs: Note that the BERT-Base model in this release is included for completeness only it was re-trained under the same regime as the original model. You can download all 24 from here, or individually from the table below: Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher. They can be fine-tuned in the same manner as the original BERT models.

The smaller BERT models are intended for environments with restricted computational resources.

We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models.
