Description
The numa feature of llama.cpp does not seem to be supported, resulting in significant performance degradation on servers with multiple numa nodes.
Additional Context
NUMA support
--numa: Attempt optimizations that help on some systems with non-uniform memory access. This currently consists of pinning an equal proportion of the threads to the cores on each NUMA node, and disabling prefetch and readahead for mmap. The latter causes mapped pages to be faulted in on first access instead of all at once, and in combination with pinning threads to NUMA nodes, more of the pages end up on the NUMA node where they are used. Note that if the model is already in the system page cache, for example because of a previous run without this option, this will have little effect unless you drop the page cache first. This can be done by rebooting the system or on Linux by writing '3' to '/proc/sys/vm/drop_caches' as root.
https://github.com/ggerganov/llama.cpp/blob/master/examples/main/README.md
This downstream project needs it.
oobabooga/text-generation-webui#3444