create_datasets_from_start.sh Multiple isssus

Hi, I'm trying to pretrain Bert Large, I'm trying to download and preprocess the data.
I have multiple issues:
1.Downloading- I'm getting very low amount of valid links in BookCorpus, after downloading I got only 250 txt files.
I know it is problematic issue, but 250?
2.TextFormatting- Wikipedia dataset text formatting extracts only AA AB directories under data/extracted/wikicorpus_en/ directory.
The downloaded file size was 73G, is this normal?
3.sharding- this step just dies, I'm running this script and getting:
data/create_datasets_from_start.sh: line 38: 60 Killed python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action sharding --dataset books_wiki_en_corpus
Where my input files:
input file: /workspace/bert/data/formatted_one_article_per_line/bookscorpus_one_book_per_line.txt input file: /workspace/bert/data/formatted_one_article_per_line/wikicorpus_en_one_article_per_line.txt
Both exists.
Since the error is not informative at all, I have no idea how to take a step forward in solving this issue.
Thanks a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create_datasets_from_start.sh Multiple isssus #489

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

create_datasets_from_start.sh Multiple isssus #489

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions