e2e medium job fails with OSError: Not enough free space to write 67108864 bytes

Describe the bug

https://github.com/instructlab/instructlab/actions/runs/14519599987/job/40737534990?pr=3295


Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
Permuting layer 4
Permuting layer 5
Permuting layer 6
Permuting layer 7
Permuting layer 8
Permuting layer 9
Permuting layer 10
Permuting layer 11
Permuting layer 12
Permuting layer 13
Permuting layer 14
Permuting layer 15
Permuting layer 16
Permuting layer 17
Permuting layer 18
Permuting layer 19
Permuting layer 20
Permuting layer 21
Permuting layer 22
Permuting layer 23
Permuting layer 24
Permuting layer 25
Permuting layer 26
Permuting layer 27
Permuting layer 28
Permuting layer 29
Permuting layer 30
Permuting layer 31
model.embed_tokens.weight                        -> token_embd.weight                        | F32    | [32008, 4096]
model.layers.0.self_attn.q_proj.weight           -> blk.0.attn_q.weight                      | F32    | [4096, 4096]
model.layers.0.self_attn.k_proj.weight           -> blk.0.attn_k.weight                      | F32    | [4096, 4096]
model.layers.0.self_attn.v_proj.weight           -> blk.0.attn_v.weight                      | F32    | [4096, 4096]
model.layers.0.self_attn.o_proj.weight           -> blk.0.attn_output.weight                 | F32    | [4096, 4096]
model.layers.0.mlp.gate_proj.weight              -> blk.0.ffn_gate.weight                    | F32    | [11008, 4096]
model.layers.0.mlp.up_proj.weight                -> blk.0.ffn_up.weight                      | F32    | [11008, 4096]
model.layers.0.mlp.down_proj.weight              -> blk.0.ffn_down.weight                    | F32    | [4096, 11008]
model.layers.0.input_layernorm.weight            -> blk.0.attn_norm.weight                   | F32    | [4096]
model.layers.0.post_attention_layernorm.weight   -> blk.0.ffn_norm.weight                    | F32    | [4096]
model.layers.1.self_attn.q_proj.weight           -> blk.1.attn_q.weight                      | F32    | [4096, 4096]
model.layers.1.self_attn.k_proj.weight           -> blk.1.attn_k.weight                      | F32    | [4096, 4096]
model.layers.1.self_attn.v_proj.weight           -> blk.1.attn_v.weight                      | F32    | [4096, 4096]
model.layers.1.self_attn.o_proj.weight           -> blk.1.attn_output.weight                 | F32    | [4096, 4096]
model.layers.1.mlp.gate_proj.weight              -> blk.1.ffn_gate.weight                    | F32    | [11008, 4096]
[155/291] Writing tensor blk.17.attn_q.weight                   | size   4096 x   4096  | type F32  | T+ 109
    sys.exit(ilab())
             ^^^^^^
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/instructlab/clickext.py", line 356, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/instructlab/cli/model/train.py", line 524, in train
    full_train.train(
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/instructlab/model/full_train.py", line 395, in train
    llamacpp_convert_to_gguf.convert_llama_to_gguf(
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/instructlab/llamacpp/llamacpp_convert_to_gguf.py", line 1731, in convert_llama_to_gguf
    OutputFile.write_all(
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/instructlab/llamacpp/llamacpp_convert_to_gguf.py", line 1340, in write_all
    of.gguf.write_tensor_data(ndarray)
  File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/gguf/gguf_writer.py", line 417, in write_tensor_data
    tensor.tofile(fout)
OSError: Not enough free space to write 67108864 bytes

It's exactly 64 Mb. The error comes from gguf library that relies on numpy. 64 Mb doesn't seem like a lot to me, and df -h output in the job shows that the machine has a EBS volume of 2TB that is nearly empty before the failure. Maybe a bug in gguf or some other underlying component.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

e2e medium job fails with `OSError: Not enough free space to write 67108864 bytes` #3298

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

e2e medium job fails with OSError: Not enough free space to write 67108864 bytes #3298

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

e2e medium job fails with `OSError: Not enough free space to write 67108864 bytes` #3298