pralin has had experimental support for llama.cpp since version 2. But it was hard to maintain as the llama API is moving and breaking on a regular basis. The feature was disabled in the past few months. Issue 5215 suggests a future lamax library with a common, and presumably stable, API. I experimented with a possible API for that library as llama-xpp.

For now, I rewrote the ggml/llama/inference to use the library. It can be used directly in code, or using pralin/compose:

  • #include <pralin/algorithms_registry.h>
    
    // Initialise Pralin
    pralin::algorithms_registry::load_default_definitions();
    
    // Initialise the inference algorithm
    auto inference = pralin::algorithms_registry::create("ggml/llama", "inference",
      {
        {"filename", "ggml-model-q4_0.gguf"},
        {"source", "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf"}
      });
    
    // Process the prompt
    std::string answer;
    inference.process("Hello.", &answer);
    std::cout << answer << std::endl;
    
  • import pralin
    import pralin.values
    
    # Initialise Pralin
    pralin.AlgorithmsRegistry.load_default_definitions()
    
    # Initialise the inference algorithm
    inference = pralin.AlgorithmsRegistry.create("ggml/llama", "inference",
      {
        "filename": "ggml-model-q4_0.gguf",
        "source": "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf"
      })
    
    # Process the prompt
    answer = pralin.Output()
    inference.process("Hello.", answer)
    print(answer.value())
    
  • require 'pralin/values'
    
    # Initialise Pralin
    Pralin::AlgorithmsRegistry.load_default_definitions()
    
    # Initialise the inference algorithm
    inference = Pralin::AlgorithmsRegistry.create("ggml/llama", "inference",
      {
        "filename" => "ggml-model-q4_0.gguf",
        "source" => "https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf"
      });
    
    # Process the prompt
    answer = Pralin::Output.new
    inference.process("Hello.", answer)
    puts answer.value
    
  • This composition has an input prompt and one output llm[0]:

    compose:
      inputs: [prompt]
      outputs: ["llm[0]"]
      process:
        - ggml/llama/inference:
            id: llm
            inputs:
              - prompt
            parameters:
              filename: ggml-model-q4_0.gguf
              source: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v0.2-GGUF/resolve/main/ggml-model-q4_0.gguf
    

    This can be run with the pralin compose command or as part of the computation server.

This is coming in pralin/3.1 sometime in December.

Updated: