Closed
Description
I propose refactoring main.cpp
into a library (llama.cpp
, compiled to llama.so
/llama.a
/whatever) and making main.cpp
a simple driver program. A simple C API should be exposed to access the model, and then bindings can more easily be written for Python, node.js, or whatever other language.
This would partially solve #82 and #162.
Edit: on that note, is it possible to do inference from two or more prompts on different threads? If so, serving multiple people would be possible without multiple copies of model weights in RAM.
mike-luabase, artob, dineshdb, maffe03, AlexAltea and 5 more
Metadata
Metadata
Assignees
Labels
This issue or pull request already existsThis issue or pull request already existsNew feature or requestNew feature or request