llama.cpp, llama-xpp and pralin

pralin has had experimental support for llama.cpp since version 2. But it was hard to maintain as the llama API is moving and breaking on a regular basis. The feature was disabled in the past few months. Issue 5215 suggests a future lamax library with a common, and presumably stable, API. I experimented with a possible API for that library as llama-xpp.

For now, I rewrote the ggml/llama/inference to use the library. It can be used directly in code, or using pralin/compose:

#include <pralin/algorithms_registry.h>

// Initialise Pralin
pralin::algorithms_registry::load_default_definitions();

// Initialise the inference algorithm
auto inference = pralin::algorithms_registry::create("ggml/llama", "inference",
  {
    {"filename", "ggml-model-q4_0.gguf"},
    {"source", "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf"}
  });

// Process the prompt
std::string answer;
inference.process("Hello.", &answer);
std::cout << answer << std::endl;

import pralin
import pralin.values

# Initialise Pralin
pralin.AlgorithmsRegistry.load_default_definitions()

# Initialise the inference algorithm
inference = pralin.AlgorithmsRegistry.create("ggml/llama", "inference",
  {
    "filename": "ggml-model-q4_0.gguf",
    "source": "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf"
  })

# Process the prompt
answer = pralin.Output()
inference.process("Hello.", answer)
print(answer.value())

require 'pralin/values'

# Initialise Pralin
Pralin::AlgorithmsRegistry.load_default_definitions()

# Initialise the inference algorithm
inference = Pralin::AlgorithmsRegistry.create("ggml/llama", "inference",
  {
    "filename" => "ggml-model-q4_0.gguf",
    "source" => "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf"
  });

# Process the prompt
answer = Pralin::Output.new
inference.process("Hello.", answer)
puts answer.value

This composition has an input prompt and one output llm[0]:

compose:
  inputs: [prompt]
  outputs: ["llm[0]"]
  process:
    - ggml/llama/inference:
        id: llm
        inputs:
          - prompt
        parameters:
          filename: ggml-model-q4_0.gguf
          source: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf

This can be run with the pralin compose command or as part of the computation server.

This is coming in pralin/3.1 sometime in December.