@@ -4,12 +4,20 @@ There are two multilingual models currently available. We do not plan to release
44more single-language models, but we may release ` BERT-Large ` versions of these
55two in the future:
66
7- * ** [ ` BERT-Base, Multilingual ` ] ( https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip ) ** :
7+ * ** [ ` BERT-Base, Multilingual Cased (New, recommended) ` ] ( https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip ) ** :
8+ 104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
9+ * ** [ ` BERT-Base, Multilingual Uncased (Orig, not recommended) ` ] ( https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip ) ** :
810 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
911* ** [ ` BERT-Base, Chinese ` ] ( https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip ) ** :
1012 Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
1113 parameters
1214
15+ ** The ` Multilingual Cased (New) ` model also fixes normalization issues in many
16+ languages, so it is recommended in languages with non-Latin alphabets (and is
17+ often better for most languages with Latin alphabets). When using this model,
18+ make sure to pass ` --do_lower_case=false ` to ` run_pretraining.py ` and other
19+ scripts.**
20+
1321See the [ list of languages] ( #list-of-languages ) that the Multilingual model
1422supports. The Multilingual model does include Chinese (and English), but if your
1523fine-tuning data is Chinese-only, then the Chinese model will likely produce
@@ -26,13 +34,14 @@ XNLI, not Google NMT). For clarity, we only report on 6 languages below:
2634
2735<!-- mdformat off(no table) -->
2836
29- | System | English | Chinese | Spanish | German | Arabic | Urdu |
30- | ------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
31- | XNLI Baseline - Translate Train | 73.7 | 67.0 | 68.8 | 66.5 | 65.8 | 56.6 |
32- | XNLI Baseline - Translate Test | 73.7 | 68.3 | 70.7 | 68.7 | 66.8 | 59.3 |
33- | BERT -Translate Train | ** 81.4** | ** 74.2** | ** 77.3** | ** 75.2** | ** 70.5** | 61.7 |
34- | BERT - Translate Test | 81.4 | 70.1 | 74.9 | 74.4 | 70.4 | ** 62.1** |
35- | BERT - Zero Shot | 81.4 | 63.8 | 74.3 | 70.5 | 62.1 | 58.3 |
37+ | System | English | Chinese | Spanish | German | Arabic | Urdu |
38+ | --------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
39+ | XNLI Baseline - Translate Train | 73.7 | 67.0 | 68.8 | 66.5 | 65.8 | 56.6 |
40+ | XNLI Baseline - Translate Test | 73.7 | 68.3 | 70.7 | 68.7 | 66.8 | 59.3 |
41+ | BERT - Translate Train Cased | ** 81.9** | ** 76.6** | ** 77.8** | ** 75.9** | ** 70.7** | 61.6 |
42+ | BERT - Translate Train Uncased | 81.4 | 74.2 | 77.3 | 75.2 | 70.5 | 61.7 |
43+ | BERT - Translate Test Uncased | 81.4 | 70.1 | 74.9 | 74.4 | 70.4 | ** 62.1** |
44+ | BERT - Zero Shot Uncased | 81.4 | 63.8 | 74.3 | 70.5 | 62.1 | 58.3 |
3645
3746<!-- mdformat on -->
3847
@@ -292,8 +301,5 @@ chosen because they are the top 100 languages with the largest Wikipedias:
292301* Western Punjabi
293302* Yoruba
294303
295- The only language which we had to unfortunately exclude was Thai, since it is
296- the only language (other than Chinese) that does not use whitespace to delimit
297- words, and it has too many characters-per-word to use character-based
298- tokenization. Our WordPiece algorithm is quadratic with respect to the size of
299- the input token so very long character strings do not work with it.
304+ The ** Multilingual Cased (New)** release contains additionally ** Thai** and
305+ ** Mongolian** , which were not included in the original release.
0 commit comments