Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

create_datasets_from_start.sh Multiple isssus #489

Copy link
Copy link
@Esaada

Description

@Esaada
Issue body actions

Hi, I'm trying to pretrain Bert Large, I'm trying to download and preprocess the data.
I have multiple issues:
1.Downloading- I'm getting very low amount of valid links in BookCorpus, after downloading I got only 250 txt files.
I know it is problematic issue, but 250?
2.TextFormatting- Wikipedia dataset text formatting extracts only AA AB directories under data/extracted/wikicorpus_en/ directory.
The downloaded file size was 73G, is this normal?
3.sharding- this step just dies, I'm running this script and getting:
data/create_datasets_from_start.sh: line 38: 60 Killed python3 ${BERT_PREP_WORKING_DIR}/bertPrep.py --action sharding --dataset books_wiki_en_corpus
Where my input files:
input file: /workspace/bert/data/formatted_one_article_per_line/bookscorpus_one_book_per_line.txt input file: /workspace/bert/data/formatted_one_article_per_line/wikicorpus_en_one_article_per_line.txt
Both exists.
Since the error is not informative at all, I have no idea how to take a step forward in solving this issue.
Thanks a lot.

Reactions are currently unavailable

Metadata

Metadata

Labels

bugSomething isn't workingSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    Morty Proxy This is a proxified and sanitized view of the page, visit original site.