Closed
Description
There is a working bert.cpp implementation.
We should try to implement this in llama.cpp
and update the embedding
example to use it.
The implementation should follow mostly what we did to integrate Falcon.
Here are the main steps:
- Update
gguf.py
with BERT arch KV pairs and tensors - Python convert script using
gguf.py
to generate F16 model - add tokenizer implementation in
llama.cpp
- add function to build BERT graph
- add any new ops in
ggml
if needed - add CUDA offloading
- add tokenizer tests
Metadata
Metadata
Assignees
Labels
Model specificModel specific