-
Notifications
You must be signed in to change notification settings - Fork 449
Closed as not planned
Closed as not planned
Copy link
Labels
CI/CDAffects CI/CD configurationAffects CI/CD configurationbugSomething isn't workingSomething isn't workingci-failurePR has at least one CI failurePR has at least one CI failurestale
Description
Describe the bug
https://github.com/instructlab/instructlab/actions/runs/14519599987/job/40737534990?pr=3295
Permuting layer 0
Permuting layer 1
Permuting layer 2
Permuting layer 3
Permuting layer 4
Permuting layer 5
Permuting layer 6
Permuting layer 7
Permuting layer 8
Permuting layer 9
Permuting layer 10
Permuting layer 11
Permuting layer 12
Permuting layer 13
Permuting layer 14
Permuting layer 15
Permuting layer 16
Permuting layer 17
Permuting layer 18
Permuting layer 19
Permuting layer 20
Permuting layer 21
Permuting layer 22
Permuting layer 23
Permuting layer 24
Permuting layer 25
Permuting layer 26
Permuting layer 27
Permuting layer 28
Permuting layer 29
Permuting layer 30
Permuting layer 31
model.embed_tokens.weight -> token_embd.weight | F32 | [32008, 4096]
model.layers.0.self_attn.q_proj.weight -> blk.0.attn_q.weight | F32 | [4096, 4096]
model.layers.0.self_attn.k_proj.weight -> blk.0.attn_k.weight | F32 | [4096, 4096]
model.layers.0.self_attn.v_proj.weight -> blk.0.attn_v.weight | F32 | [4096, 4096]
model.layers.0.self_attn.o_proj.weight -> blk.0.attn_output.weight | F32 | [4096, 4096]
model.layers.0.mlp.gate_proj.weight -> blk.0.ffn_gate.weight | F32 | [11008, 4096]
model.layers.0.mlp.up_proj.weight -> blk.0.ffn_up.weight | F32 | [11008, 4096]
model.layers.0.mlp.down_proj.weight -> blk.0.ffn_down.weight | F32 | [4096, 11008]
model.layers.0.input_layernorm.weight -> blk.0.attn_norm.weight | F32 | [4096]
model.layers.0.post_attention_layernorm.weight -> blk.0.ffn_norm.weight | F32 | [4096]
model.layers.1.self_attn.q_proj.weight -> blk.1.attn_q.weight | F32 | [4096, 4096]
model.layers.1.self_attn.k_proj.weight -> blk.1.attn_k.weight | F32 | [4096, 4096]
model.layers.1.self_attn.v_proj.weight -> blk.1.attn_v.weight | F32 | [4096, 4096]
model.layers.1.self_attn.o_proj.weight -> blk.1.attn_output.weight | F32 | [4096, 4096]
model.layers.1.mlp.gate_proj.weight -> blk.1.ffn_gate.weight | F32 | [11008, 4096]
[155/291] Writing tensor blk.17.attn_q.weight | size 4096 x 4096 | type F32 | T+ 109
sys.exit(ilab())
^^^^^^
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1161, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
return f(get_current_context(), *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/instructlab/clickext.py", line 356, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/instructlab/cli/model/train.py", line 524, in train
full_train.train(
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/instructlab/model/full_train.py", line 395, in train
llamacpp_convert_to_gguf.convert_llama_to_gguf(
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/instructlab/llamacpp/llamacpp_convert_to_gguf.py", line 1731, in convert_llama_to_gguf
OutputFile.write_all(
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/instructlab/llamacpp/llamacpp_convert_to_gguf.py", line 1340, in write_all
of.gguf.write_tensor_data(ndarray)
File "/actions-runner/_work/instructlab/instructlab/venv/lib64/python3.11/site-packages/gguf/gguf_writer.py", line 417, in write_tensor_data
tensor.tofile(fout)
OSError: Not enough free space to write 67108864 bytes
It's exactly 64 Mb. The error comes from gguf library that relies on numpy. 64 Mb doesn't seem like a lot to me, and df -h output in the job shows that the machine has a EBS volume of 2TB that is nearly empty before the failure. Maybe a bug in gguf or some other underlying component.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
CI/CDAffects CI/CD configurationAffects CI/CD configurationbugSomething isn't workingSomething isn't workingci-failurePR has at least one CI failurePR has at least one CI failurestale