Description
Is your feature request related to a problem? Please describe.
Currently, llama-cpp's tokenizer has two differences/problems as compared to HF's AutoTokenizer.
model.tokenize
(llama-cpp-python) !=tokenizer.encode
(HF)
llama-cpp's tokenizer seems to always generate a different token from HF's autotokenizer for the token right after every special token.- model.detokenize returns empty string when trying to convert a special token back to text
I am currently working on implementing the chat handlers for Functionary-7b-v1.4 and all v2 models as mentioned in this issue. However, all the models use added special tokens in the prompt template and as stopping tokens. This results in suboptimal generation (due to problem 1) and the inability to stop generation (problem 2).
Describe the solution you'd like
Refactor the current Functionary chat_handler to use HF AutoTokenizer instead of llama-cpp-python's tokenizer. We can then directly use the jinja chat template in the various Functionary models. This also avoids any discrepancies in terms of tokenization.
Describe alternatives you've considered
Fixing the tokenizer issue in llama.cpp directly but lack the knowledge of the details of that codebase.
Additional context
NA