Install Llama Stack

This document describes how to install and deploy Llama Stack Server on Kubernetes using the Llama Stack Operator.

Upload Operator

Download the Llama Stack Operator installation file (e.g., llama-stack-operator.alpha.ALL.xxxx.tgz).

Use the violet command to publish to the platform repository:

violet push --platform-address=platform-access-address --platform-username=platform-admin --platform-password=platform-admin-password llama-stack-operator.alpha.ALL.xxxx.tgz

Install Operator

  1. Go to the Administrator view in the Alauda Container Platform.

  2. In the left navigation, select Marketplace / Operator Hub.

  3. In the right panel, find Alauda build of Llama Stack and click Install.

  4. Keep all parameters as default and complete the installation.

Deploy Llama Stack Server

After the operator is installed, deploy Llama Stack Server by creating a LlamaStackDistribution custom resource:

Note: Prepare the following in advance; otherwise the distribution may not become ready:

  • Inference URL: VLLM_URL must point at a vLLM OpenAI-compatible HTTP base URL (for example an in-cluster vLLM or KServe InferenceService) that serves the target model.
  • Secret (optional): VLLM_API_TOKEN is only needed when the vLLM endpoint requires authentication. If vLLM has no auth, do not set it. When required, create a Secret in the same namespace and reference it from containerSpec.env (see the commented example in the manifest below).
  • Storage Class: Ensure the default Storage Class exists in the cluster; otherwise the PVC cannot be bound and the resource will not become ready.
  • PostgreSQL storage: The starter distribution in this release uses PostgreSQL for Llama Stack persistence. Configure POSTGRES_* environment variables for the server pod before deploying.
  • PGVector (optional): To use vector_stores with provider_id="pgvector", provide PGVECTOR_* environment variables to the server pod. ACP-provided PostgreSQL can be used directly because it already includes the pgvector extension.
  • Milvus (optional): To use vector_stores with provider_id="milvus-remote", provide MILVUS_ENDPOINT and, when authentication is enabled, MILVUS_TOKEN. Set MILVUS_CONSISTENCY_LEVEL to a valid Milvus consistency level such as Strong.
  • Embedding model download: Llama Stack includes a default embedding model configuration for vector-store usage, but the model artifacts are downloaded from Hugging Face on first use. If a mirror or proxy is needed, configure HF_ENDPOINT. For fully offline environments, pre-download the model files into the server PVC before running the first vector-store request.
apiVersion: llamastack.io/v1alpha1
kind: LlamaStackDistribution
metadata:
  annotations:
    cpaas.io/display-name: ""
  name: demo
  namespace: default
spec:
  network:
    exposeRoute: false                             # Whether to expose the route externally
  replicas: 1                                      # Number of server replicas
  server:
    containerSpec:
      name: llama-stack
      port: 8321
      env:
        - name: VLLM_URL
          value: "http://vllm-predictor.default.svc.cluster.local/v1"   # vLLM OpenAI-compatible base URL
        - name: VLLM_MAX_TOKENS
          value: "8192"                            # Maximum output tokens

        # Optional: VLLM_API_TOKEN — add only when the vLLM endpoint requires authentication.
        # If vLLM is deployed without auth, omit the entire block below (do not set VLLM_API_TOKEN).
        # Example after creating: kubectl create secret generic vllm-api-token -n default --from-literal=token=<TOKEN>
        # - name: VLLM_API_TOKEN
        #   valueFrom:
        #     secretKeyRef:
        #       key: token
        #       name: vllm-api-token

        # Required: PostgreSQL-backed Llama Stack persistence for this starter
        # distribution image.
        - name: POSTGRES_HOST
          value: "<postgresql-service>"
        - name: POSTGRES_PORT
          value: "5432"
        - name: POSTGRES_DB
          value: "<database-name>"
        - name: POSTGRES_USER
          value: "<database-username>"
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: <postgresql-credentials-secret>
              key: password

        # Optional: enable PGVector-backed vector stores.
        # Omit the entire block below if you do not need PGVector vector stores.
        # These settings configure the vector DB provider and are separate from
        # the POSTGRES_* persistence settings above, although they may point to
        # the same PostgreSQL instance when it has the pgvector extension.
        # ACP-provided PostgreSQL already includes the pgvector extension.
        # - name: ENABLE_PGVECTOR
        #   value: "true"
        # - name: PGVECTOR_HOST
        #   value: "<acp-postgresql-service>"
        # - name: PGVECTOR_PORT
        #   value: "5432"
        # - name: PGVECTOR_DB
        #   value: "<database-name>"
        # - name: PGVECTOR_USER
        #   value: "<database-username>"
        # - name: PGVECTOR_PASSWORD
        #   valueFrom:
        #     secretKeyRef:
        #       name: <pgvector-credentials-secret>
        #       key: password

        # Optional: enable remote Milvus-backed vector stores.
        # Use provider_id="milvus-remote" from the client API.
        # - name: MILVUS_ENDPOINT
        #   value: "http://<milvus-endpoint-host-and-port>"
        # - name: MILVUS_TOKEN
        #   valueFrom:
        #     secretKeyRef:
        #       name: <milvus-credentials-secret>
        #       key: token
        # - name: MILVUS_CONSISTENCY_LEVEL
        #   value: "Strong"

        # Required for PGVector or Milvus vector stores that use local
        # sentence-transformers embeddings.
        # - name: ENABLE_SENTENCE_TRANSFORMERS
        #   value: "true"
        #
        # Optional: configure a Hugging Face mirror or proxy for the default
        # embedding model download path.
        # - name: HF_ENDPOINT
        #   value: "<huggingface-mirror-or-proxy>"
        #
        # Optional: configure fully offline model loading. Pre-populate the
        # Hugging Face cache under /home/lls/.lls/huggingface/hub, then set:
        # - name: HF_HUB_CACHE
        #   value: "/home/lls/.lls/huggingface/hub"
        # - name: HF_HUB_OFFLINE
        #   value: "1"
        # - name: TRANSFORMERS_OFFLINE
        #   value: "1"
        # - name: HF_HUB_DISABLE_XET
        #   value: "1"

    distribution:
      name: starter                                # Distribution name (options: starter, postgres-demo, meta-reference-gpu)
    storage:
      mountPath: /home/lls/.lls
      size: 1Gi                                    # Requires the "default" Storage Class to be configured beforehand

After deployment, the Llama Stack Server will be available within the cluster. The access URL is displayed in status.serviceURL, for example:

status:
  phase: Ready
  serviceURL: http://demo-service.default.svc.cluster.local:8321

Configure PostgreSQL Storage

The starter distribution image used by this release requires PostgreSQL for Llama Stack persistence. Configure these server environment variables in the LlamaStackDistribution:

  • POSTGRES_HOST
  • POSTGRES_PORT
  • POSTGRES_DB
  • POSTGRES_USER
  • POSTGRES_PASSWORD

These settings are for Llama Stack server state. They are not the same as PGVECTOR_*, which only configures the optional PGVector vector-store provider. You may use the same PostgreSQL instance for both roles when it has the required database, credentials, and pgvector extension.

Tool calling with vLLM on KServe

The following applies to the vLLM predictor on KServe, not to the LlamaStackDistribution manifest. For agent flows that use tools (client-side tools or MCP), the vLLM process must expose tool-call support. Add predictor container args as required by upstream vLLM, for example:

args:
  - --enable-auto-tool-choice
  - --tool-call-parser
  - hermes

Choose --tool-call-parser (and any related flags) according to the served model and the vLLM documentation for that model family.

Enable PGVector Vector Store

When ENABLE_PGVECTOR=true is set on the server, Llama Stack can create vector stores by using provider_id="pgvector" from the client API.

Recommended preparation:

  1. Prepare an ACP PostgreSQL instance and record its service name, database name, username, and password.
  2. Expose the database connection to the LlamaStackDistribution with PGVECTOR_HOST, PGVECTOR_PORT, PGVECTOR_DB, PGVECTOR_USER, and PGVECTOR_PASSWORD.
  3. Set ENABLE_SENTENCE_TRANSFORMERS=true and make sure the default embedding model files can be fetched on first use.
  4. If the cluster uses a Hugging Face mirror or proxy, set HF_ENDPOINT accordingly.
  5. If the cluster is fully offline, pre-download the embedding model files into the server PVC and enable offline cache-related environment variables.

After the distribution is ready, you can validate the setup with the PGVector section in the Quickstart notebook.

Enable Milvus Vector Store

When MILVUS_ENDPOINT is set on the server, Llama Stack can create vector stores by using provider_id="milvus-remote" from the client API.

Recommended preparation:

  1. Prepare a Milvus endpoint reachable from the Llama Stack Server pod. MILVUS_ENDPOINT must include the scheme, either http:// or https://, and the port required by your Milvus service.
  2. Expose the Milvus connection to the LlamaStackDistribution with MILVUS_ENDPOINT.
  3. If Milvus authentication is enabled, set MILVUS_TOKEN from a Secret.
  4. Set MILVUS_CONSISTENCY_LEVEL to a string value such as Strong; the Milvus provider requires this field.
  5. Set ENABLE_SENTENCE_TRANSFORMERS=true and make sure the embedding model files can be fetched or are already present in the server PVC.

After the distribution is ready, validate the setup with the Milvus section in the Quickstart notebook. The client creates the vector store with provider_id="milvus-remote" and passes the selected embedding model id plus embedding dimension in extra_body.

Hugging Face Access For Embedding Models

Llama Stack uses a default embedding model for vector-store operations. On first use, the server downloads the model files from Hugging Face into its local cache.

Recommended cache path:

  • /home/lls/.lls/huggingface/hub

Common deployment modes:

  1. Mirror or proxy access:

    - name: HF_ENDPOINT
      value: "<huggingface-mirror-or-proxy>"
    - name: HF_HUB_CACHE
      value: "/home/lls/.lls/huggingface/hub"
  2. Fully offline access:

    Pre-download the required model files into the PVC-backed cache directory /home/lls/.lls/huggingface/hub, then set:

    - name: HF_HUB_CACHE
      value: "/home/lls/.lls/huggingface/hub"
    - name: HF_HUB_OFFLINE
      value: "1"
    - name: TRANSFORMERS_OFFLINE
      value: "1"
    - name: HF_HUB_DISABLE_XET
      value: "1"

If the cache path is pre-populated correctly, the server can create PGVector-backed or Milvus-backed vector stores without downloading model artifacts at runtime.