llama.cpp, llama-xpp and pralin
pralin has had experimental support for llama.cpp since version 2. But it was hard to maintain as the llama API is moving and breaking on a regular basis. The feature was disabled in the past few months. Issue 5215 suggests a future lamax
library with a common, and presumably stable, API. I experimented with a possible API for that library as llama-xpp.
For now, I rewrote the ggml/llama/inference
to use the library. It can be used directly in code, or using pralin/compose:
-
#include <pralin/algorithms_registry.h> // Initialise Pralin pralin::algorithms_registry::load_default_definitions(); // Initialise the inference algorithm auto inference = pralin::algorithms_registry::create("ggml/llama", "inference", { {"filename", "ggml-model-q4_0.gguf"}, {"source", "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf"} }); // Process the prompt std::string answer; inference.process("Hello.", &answer); std::cout << answer << std::endl;
-
import pralin import pralin.values # Initialise Pralin pralin.AlgorithmsRegistry.load_default_definitions() # Initialise the inference algorithm inference = pralin.AlgorithmsRegistry.create("ggml/llama", "inference", { "filename": "ggml-model-q4_0.gguf", "source": "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf" }) # Process the prompt answer = pralin.Output() inference.process("Hello.", answer) print(answer.value())
-
require 'pralin/values' # Initialise Pralin Pralin::AlgorithmsRegistry.load_default_definitions() # Initialise the inference algorithm inference = Pralin::AlgorithmsRegistry.create("ggml/llama", "inference", { "filename" => "ggml-model-q4_0.gguf", "source" => "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf" }); # Process the prompt answer = Pralin::Output.new inference.process("Hello.", answer) puts answer.value
-
This composition has an input
prompt
and one outputllm[0]
:compose: inputs: [prompt] outputs: ["llm[0]"] process: - ggml/llama/inference: id: llm inputs: - prompt parameters: filename: ggml-model-q4_0.gguf source: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf
This can be run with the
pralin compose
command or as part of the computation server.
This is coming in pralin/3.1
sometime in December.