It is not in the list of combinations. Most of the LLM packages seem to lack an easy way to run a llama.cpp server with the load split between CPU and GPU. Like Ollama appears to only load a model simply with the whole thing in the GPU. The simplification pushes users into smaller models that are far less capable. If the model is split between the CPU and GPU one can run a much larger quantized model in GGUF format that runs nearly as fast as the smaller less capable model loaded into the GPU only. Then you do not need to resort to using cloud hosted or proprietary models.
The Oobabooga front end also gives a nice interface for model loading and softmax settings.
gptel is at:
https://github.com/karthink/gptel or MELPA
Oobabooga is at:
https://github.com/oobabooga/text-generation-webui
With a model loaded and the --api
flag set the model will be available for gptel.
In packages.el
:
(package! gptel)
In config.el
:
(setq
gptel-mode 'test
gptel-backend (gptel-make-openai "llama-cpp"
:stream t
:protocol "http"
:host "localhost:5000"
:models '(test)))
This splits the load to easily run an 8×7b model. Most probably already know this or have other methods. I just thought I would mention it after getting it working just now. Share if you have a better way.