Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
This repository was archived by the owner on Sep 25, 2025. It is now read-only.

Commit 332a687

Browse filesBrowse files
Adding new multilingual model
1 parent 1cd50d7 commit 332a687
Copy full SHA for 332a687

3 files changed

+38-15Lines changed: 38 additions & 15 deletions

File tree

Expand file treeCollapse file tree
Open diff view settings
Filter options
Expand file treeCollapse file tree
Open diff view settings
Collapse file

‎README.md‎

Copy file name to clipboardExpand all lines: README.md
+18-1Lines changed: 18 additions & 1 deletion
  • Display the source diff
  • Display the rich diff
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,20 @@
11
# BERT
22

3+
**\*\*\*\*\* New November 23rd, 2018: Un-normalized multilingual model + Thai +
4+
Mongolian \*\*\*\*\***
5+
6+
We uploaded a new multilingual model which does *not* perform any normalization
7+
on the input (no lower casing, accent stripping, or Unicode normalization), and
8+
additionally inclues Thai and Mongolian.
9+
10+
**It is recommended to use this version for developing multilingual models,
11+
especially on languages with non-Latin alphabets.**
12+
13+
This does not require any code changes, and can be downloaded here:
14+
15+
* **[`BERT-Base, Multilingual Cased`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
16+
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
17+
318
**\*\*\*\*\* New November 15th, 2018: SOTA SQuAD 2.0 System \*\*\*\*\***
419

520
We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is
@@ -207,7 +222,9 @@ The links to the models are here (right-click, 'Save link as...' on the name):
207222
12-layer, 768-hidden, 12-heads , 110M parameters
208223
* **`BERT-Large, Cased`**: 24-layer, 1024-hidden, 16-heads, 340M parameters
209224
(Not available yet. Needs to be re-generated).
210-
* **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
225+
* **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
226+
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
227+
* **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
211228
102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
212229
* **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**:
213230
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
Collapse file

‎multilingual.md‎

Copy file name to clipboardExpand all lines: multilingual.md
+19-13Lines changed: 19 additions & 13 deletions
  • Display the source diff
  • Display the rich diff
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,20 @@ There are two multilingual models currently available. We do not plan to release
44
more single-language models, but we may release `BERT-Large` versions of these
55
two in the future:
66

7-
* **[`BERT-Base, Multilingual`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
7+
* **[`BERT-Base, Multilingual Cased (New, recommended)`](https://storage.googleapis.com/bert_models/2018_11_23/multi_cased_L-12_H-768_A-12.zip)**:
8+
104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
9+
* **[`BERT-Base, Multilingual Uncased (Orig, not recommended)`](https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H-768_A-12.zip)**:
810
102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
911
* **[`BERT-Base, Chinese`](https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip)**:
1012
Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M
1113
parameters
1214

15+
**The `Multilingual Cased (New)` model also fixes normalization issues in many
16+
languages, so it is recommended in languages with non-Latin alphabets (and is
17+
often better for most languages with Latin alphabets). When using this model,
18+
make sure to pass `--do_lower_case=false` to `run_pretraining.py` and other
19+
scripts.**
20+
1321
See the [list of languages](#list-of-languages) that the Multilingual model
1422
supports. The Multilingual model does include Chinese (and English), but if your
1523
fine-tuning data is Chinese-only, then the Chinese model will likely produce
@@ -26,13 +34,14 @@ XNLI, not Google NMT). For clarity, we only report on 6 languages below:
2634

2735
<!-- mdformat off(no table) -->
2836

29-
| System | English | Chinese | Spanish | German | Arabic | Urdu |
30-
| ------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
31-
| XNLI Baseline - Translate Train | 73.7 | 67.0 | 68.8 | 66.5 | 65.8 | 56.6 |
32-
| XNLI Baseline - Translate Test | 73.7 | 68.3 | 70.7 | 68.7 | 66.8 | 59.3 |
33-
| BERT -Translate Train | **81.4** | **74.2** | **77.3** | **75.2** | **70.5** | 61.7 |
34-
| BERT - Translate Test | 81.4 | 70.1 | 74.9 | 74.4 | 70.4 | **62.1** |
35-
| BERT - Zero Shot | 81.4 | 63.8 | 74.3 | 70.5 | 62.1 | 58.3 |
37+
| System | English | Chinese | Spanish | German | Arabic | Urdu |
38+
| --------------------------------- | -------- | -------- | -------- | -------- | -------- | -------- |
39+
| XNLI Baseline - Translate Train | 73.7 | 67.0 | 68.8 | 66.5 | 65.8 | 56.6 |
40+
| XNLI Baseline - Translate Test | 73.7 | 68.3 | 70.7 | 68.7 | 66.8 | 59.3 |
41+
| BERT - Translate Train Cased | **81.9** | **76.6** | **77.8** | **75.9** | **70.7** | 61.6 |
42+
| BERT - Translate Train Uncased | 81.4 | 74.2 | 77.3 | 75.2 | 70.5 | 61.7 |
43+
| BERT - Translate Test Uncased | 81.4 | 70.1 | 74.9 | 74.4 | 70.4 | **62.1** |
44+
| BERT - Zero Shot Uncased | 81.4 | 63.8 | 74.3 | 70.5 | 62.1 | 58.3 |
3645

3746
<!-- mdformat on -->
3847

@@ -292,8 +301,5 @@ chosen because they are the top 100 languages with the largest Wikipedias:
292301
* Western Punjabi
293302
* Yoruba
294303

295-
The only language which we had to unfortunately exclude was Thai, since it is
296-
the only language (other than Chinese) that does not use whitespace to delimit
297-
words, and it has too many characters-per-word to use character-based
298-
tokenization. Our WordPiece algorithm is quadratic with respect to the size of
299-
the input token so very long character strings do not work with it.
304+
The **Multilingual Cased (New)** release contains additionally **Thai** and
305+
**Mongolian**, which were not included in the original release.
Collapse file

‎tokenization.py‎

Copy file name to clipboardExpand all lines: tokenization.py
+1-1Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -249,7 +249,7 @@ def _clean_text(self, text):
249249
class WordpieceTokenizer(object):
250250
"""Runs WordPiece tokenziation."""
251251

252-
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=100):
252+
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200):
253253
self.vocab = vocab
254254
self.unk_token = unk_token
255255
self.max_input_chars_per_word = max_input_chars_per_word

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.