Ik llama swap in container step by step guide (#1249)

* Create README.md * Add container files and llama-swap configs * Update main README.md * Build without GGML_IQK_FA_ALL_QUANTS Otherwise fails with CUDA_DOCKER_ARCH=default * Mention GGML_IQK_FA_ALL_QUANTS usage * First step more explicit
2026-02-09 16:00:12 +00:00 · 2026-02-07 18:30:19 +02:00
parent 82c4f27332
commit dbcbfdb0ef
6 changed files with 387 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -8,6 +8,8 @@ This repository is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp)

 ## Latest News

+### [Step by step guide](./docker/README.md) for ik_llama.cpp in podman/docker container including llama-swap
+
 ### Model Support

 LlaMA-3-Nemotron [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377), Qwen3 [PR 355](https://github.com/ikawrakow/ik_llama.cpp/pull/355), GLM-4 [PR 344](https://github.com/ikawrakow/ik_llama.cpp/pull/344), Command-A [PR 341](https://github.com/ikawrakow/ik_llama.cpp/pull/341), bitnet-b1.58-2B-4T [PR 337](https://github.com/ikawrakow/ik_llama.cpp/pull/337), LLaMA-4 [PR 321](https://github.com/ikawrakow/ik_llama.cpp/pull/321), Gemma3 [PR 276](https://github.com/ikawrakow/ik_llama.cpp/pull/276),  DeepSeek-V3 [PR 176](https://github.com/ikawrakow/ik_llama.cpp/pull/176), Kimi-2 [PR 609](https://github.com/ikawrakow/ik_llama.cpp/pull/609), dots.llm1 [PR 573](https://github.com/ikawrakow/ik_llama.cpp/pull/573), Hunyuan [PR 565](https://github.com/ikawrakow/ik_llama.cpp/pull/565)
--- a/docker/README.md
+++ b/docker/README.md
@@ -0,0 +1,138 @@
+# Build and use ik_llama.cpp with CPU or CPU+CUDA
+
+Built on top of [ikawrakow/ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) and [llama-swap](https://github.com/mostlygeek/llama-swap)
+
+All commands are provided for Podman and Docker.
+
+CPU or CUDA sections under [Build](#Build) and [Run]($Run) are enough to get up and running. 
+
+## Overview
+
+- [Build](#Build)
+- [Run](#Run)
+- [Troubleshooting](#Troubleshooting)
+- [Extra Features](#Extra)
+- [Credits](#Credits)
+
+# Build
+
+Builds two image tags:
+
+- `swap`: Includes only `llama-swap` and `llama-server`.
+- `full`: Includes `llama-server`, `llama-quantize`, and other utilities.
+
+Start: download the 4 files to a new directory (e.g. `~/ik_llama/`) then follow the next steps.
+
+```
+└── ik_llama
+    ├── ik_llama-cpu.Containerfile
+    ├── ik_llama-cpu-swap.config.yaml
+    ├── ik_llama-cuda.Containerfile
+    └── ik_llama-cuda-swap.config.yaml
+```
+
+## CPU
+
+```
+podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap
+```
+
+```
+docker image build --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full . && docker image build --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap .
+```
+
+## CUDA
+
+```
+podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full && podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap
+```
+
+```
+docker image build --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap .
+```
+
+# Run
+
+- Download `.gguf` model files to your favorite directory (e.g. `/my_local_files/gguf`).
+- Map it to `/models` inside the container.
+- Open browser `http://localhost:9292` and enjoy the features.
+- API endpoints are available at `http://localhost:9292/v1` for use in other applications.
+
+## CPU
+
+```
+podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
+```
+
+```
+docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro ik_llama-cpu:swap
+```
+
+## CUDA
+
+- Install Nvidia Drivers and CUDA on the host.
+- For Docker, install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
+- For Podman, install [CDI Container Device Interface](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html)
+
+```
+podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap
+```
+
+```
+docker run  -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia ik_llama-cuda:swap
+```
+
+# Troubleshooting
+
+- If CUDA is not available, use `ik_llama-cpu` instead.
+- If models are not found, ensure you mount the correct directory: `-v /my_local_files/gguf:/models:ro`
+- If you need to install `podman` or `docker` follow the [Podman Installation](https://podman.io/docs/installation) or [Install Docker Engine](https://docs.docker.com/engine/install) for your OS.
+
+# Extra
+
+- `CUSTOM_COMMIT` can be used to build a specific `ik_llama.cpp` commit (e.g. `1ec12b8`).
+
+```
+podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:swap
+```
+
+```
+docker image build --file ik_llama-cuda.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:swap .
+```
+
+- Using the tools in the `full` image:
+
+```
+$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
+# ./llama-quantize ...
+# python3 gguf-py/scripts/gguf_dump.py ...
+# ./llama-perplexity ...
+# ./llama-sweep-bench ...
+```
+
+```
+docker run  -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash ik_llama-cuda:full
+# ./llama-quantize ...
+# python3 gguf-py/scripts/gguf_dump.py ...
+# ./llama-perplexity ...
+# ./llama-sweep-bench ...
+```
+
+- Customize `llama-swap` config: save the `ik_llama-cpu-swap.config.yaml` or `ik_llama-cuda-swap.config.yaml` localy  (e.g. under `/my_local_files/`) then map it to `/app/config.yaml` inside the container appending `-v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:ro` to your`podman run ...` or `docker run ...`.
+- To run the container in background, replace `-it` with `-d`: `podman run -d ...` or `docker run -d ...`. To stop it: `podman stop ik_llama` or `docker stop ik_llama`.
+- If you build the image on the same machine where will be used, change `-DGGML_NATIVE=OFF` to `-DGGML_NATIVE=ON` in the `.Containerfile`.
+- For a smaller CUDA build, identify your GPU [CUDA GPU Compute Capability](https://developer.nvidia.com/cuda/gpus) (e.g. `8.6` for RTX30*0) then change `CUDA_DOCKER_ARCH` in `ik_llama-cuda.Containerfile` from `default` to your GPU architecture (e.g. `CUDA_DOCKER_ARCH=86`).
+- If you build only for your GPU architecture and want to make use of more KV quantization types, build with `-DGGML_IQK_FA_ALL_QUANTS=ON`.
+- Get the best (measures kindly provided on each model card) quants from [ubergarm](https://huggingface.co/ubergarm/models) if available.
+- Usefull graphs and numbers on @magikRUKKOLA [Perplexity vs Size Graphs for the recent quants (GLM-4.7, Kimi-K2-Thinking, Deepseek-V3.1-Terminus, Deepseek-R1, Qwen3-Coder, Kimi-K2, Chimera etc.)](https://github.com/ikawrakow/ik_llama.cpp/discussions/715) topic.
+- Build custom quants with [Thireus](https://github.com/Thireus/GGUF-Tool-Suite)'s tools.
+- Download from [ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA](https://github.com/Thireus/ik_llama.cpp) if you cannot build.
+- For a KoboldCPP experience [Croco.Cpp is fork of KoboldCPP infering GGML/GGUF models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet. ](https://github.com/Nexesenex/croco.cpp)
+
+# Credits
+
+All credits to the awesome community:  
+
+[ikawrakow](https://github.com/ikawrakow/ik_llama.cpp)
+
+[llama-swap](https://github.com/mostlygeek/llama-swap)
--- a/docker/ik_llama-cpu-swap.config.yaml
+++ b/docker/ik_llama-cpu-swap.config.yaml
@@ -0,0 +1,44 @@
+healthCheckTimeout: 1800
+logRequests: true
+metricsMaxInMemory: 1000
+
+models:
+  "qwen3 (you need to download .gguf first)":
+    proxy: "http://127.0.0.1:9999"
+    cmd: >
+      /app/llama-server
+      --model /models/Qwen_Qwen3-0.6B-Q6_K.gguf
+      --alias qwen3
+      --port 9999
+      --parallel 1
+      --webui llamacpp
+      --jinja
+      --ctx-size 12288
+      -fa on
+
+  "qwen3-vl (you need to download .gguf and mmproj first)":
+    proxy: "http://127.0.0.1:9999"
+    cmd: >
+      /app/llama-server
+      --model /models/Qwen_Qwen3-VL-4B-Instruct-IQ4_NL.gguf
+      --mmproj /models/Qwen_Qwen3-VL-4B-Instruct-mmproj-f16.gguf
+      --alias qwen3-vl
+      --port 9999
+      --parallel 1
+      --webui llamacpp
+      --jinja
+      --ctx-size 12288
+      -fa on
+
+  "smollm2 (will be downloaded automatically from huggingface.co)":
+    proxy: "http://127.0.0.1:9999"
+    cmd: >
+      /app/llama-server
+      --hf-repo mradermacher/SmolLM2-135M-i1-GGUF --hf-file SmolLM2-135M.i1-IQ4_NL.gguf
+      --alias smollm2
+      --port 9999
+      --parallel 1
+      --webui llamacpp
+      --jinja
+      --ctx-size 12288
+      -fa on
--- a/docker/ik_llama-cpu.Containerfile
+++ b/docker/ik_llama-cpu.Containerfile
@@ -0,0 +1,73 @@
+ARG UBUNTU_VERSION=22.04
+
+# Stage 1: Build
+FROM docker.io/ubuntu:$UBUNTU_VERSION AS build
+ENV LLAMA_CURL=1
+ENV LC_ALL=C.utf8
+ARG CUSTOM_COMMIT
+
+RUN apt-get update && apt-get install -yq build-essential git libcurl4-openssl-dev curl libgomp1 cmake
+RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git /app
+WORKDIR /app
+RUN if [ -n "$CUSTOM_COMMIT" ]; then git switch --detach "$CUSTOM_COMMIT"; fi
+RUN cmake -B build -DGGML_NATIVE=OFF -DLLAMA_CURL=ON -DGGML_IQK_FA_ALL_QUANTS=ON && \
+    cmake --build build --config Release -j$(nproc)
+RUN mkdir -p /app/lib && \
+    find build -name "*.so" -exec cp {} /app/lib \;
+RUN mkdir -p /app/build/src && \
+    find build -name "*.so" -exec cp {} /app/build/src \;
+RUN mkdir -p /app/full \
+    && cp build/bin/* /app/full \
+    && cp *.py /app/full \
+    && cp -r gguf-py /app/full \
+    && cp -r requirements /app/full \
+    && cp requirements.txt /app/full \
+    && cp .devops/tools.sh /app/full/tools.sh
+
+# Stage 2: Base
+FROM docker.io/ubuntu:$UBUNTU_VERSION AS base
+RUN apt-get update && apt-get install -yq libgomp1 curl \
+    && apt-get autoremove -y \
+    && apt-get clean -y \
+    && rm -rf /tmp/* /var/tmp/* \
+    && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
+    && find /var/cache -type f -delete
+COPY --from=build /app/lib/ /app
+
+# Stage 3: Full
+FROM base AS full
+COPY --from=build /app/full /app
+RUN mkdir -p /app/build/src
+COPY --from=build /app/build/src /app/build/src
+WORKDIR /app
+RUN apt-get update && apt-get install -yq \
+    git \
+    python3 \
+    python3-pip \
+    && pip install --upgrade pip setuptools wheel \
+    && pip install -r requirements.txt \
+    && apt-get autoremove -y \
+    && apt-get clean -y \
+    && rm -rf /tmp/* /var/tmp/* \
+    && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
+    && find /var/cache -type f -delete
+ENTRYPOINT ["/app/full/tools.sh"]
+
+# Stage 4: Server
+FROM base AS server
+ENV LLAMA_ARG_HOST=0.0.0.0
+COPY --from=build /app/full/llama-server /app/llama-server
+WORKDIR /app
+HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
+ENTRYPOINT [ "/app/llama-server" ]
+
+# Stage 5: Swap
+FROM server AS swap
+ARG LS_REPO=mostlygeek/llama-swap
+ARG LS_VER=189
+RUN curl -LO "https://github.com/${LS_REPO}/releases/download/v${LS_VER}/llama-swap_${LS_VER}_linux_amd64.tar.gz" \
+    && tar -zxf "llama-swap_${LS_VER}_linux_amd64.tar.gz" \
+    && rm "llama-swap_${LS_VER}_linux_amd64.tar.gz"
+COPY ./ik_llama-cpu-swap.config.yaml /app/config.yaml
+HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080"]
+ENTRYPOINT [ "/app/llama-swap", "-config", "/app/config.yaml" ]
--- a/docker/ik_llama-cuda-swap.config.yaml
+++ b/docker/ik_llama-cuda-swap.config.yaml
@@ -0,0 +1,54 @@
+healthCheckTimeout: 1800
+logRequests: true
+metricsMaxInMemory: 1000
+
+models:
+  "qwen3 (you need to download .gguf first)":
+    proxy: "http://127.0.0.1:9999"
+    cmd: >
+      /app/llama-server
+      --model /models/Qwen_Qwen3-0.6B-Q6_K.gguf
+      --alias qwen3
+      --port 9999
+      --parallel 1
+      --webui llamacpp
+      --jinja
+      --ctx-size 12288
+      -fa on
+      --merge-qkv
+      -ngl 999 --threads-batch 1
+      -ctk q8_0 -ctv q8_0
+
+  "oss-moe (you need to download .gguf first)":
+    proxy: "http://127.0.0.1:9999"
+    cmd: >
+      /app/llama-server
+      --model /models/kldzj_gpt-oss-120b-heretic-MXFP4_MOE-00001-of-00002.gguf
+      --alias gpt-oss
+      --port 9999
+      --parallel 1
+      --webui llamacpp
+      --jinja
+      --ctx-size 12288
+      -fa on
+      --merge-qkv
+      -ngl 999
+      --n-cpu-moe 30
+      -ctk q8_0 -ctv q8_0
+      --grouped-expert-routing
+      --reasoning-format auto --chat-template-kwargs '{"reasoning_effort": "medium"}'
+
+  "smollm2 (will be downloaded automatically from huggingface.co)":
+    proxy: "http://127.0.0.1:9999"
+    cmd: >
+      /app/llama-server
+      --hf-repo mradermacher/SmolLM2-135M-i1-GGUF --hf-file SmolLM2-135M.i1-IQ4_NL.gguf
+      --alias smollm2
+      --port 9999
+      --parallel 1
+      --webui llamacpp
+      --jinja
+      --ctx-size 12288
+      -fa on
+      --merge-qkv
+      -ngl 999 --threads-batch 1
--- a/docker/ik_llama-cuda.Containerfile
+++ b/docker/ik_llama-cuda.Containerfile
@@ -0,0 +1,76 @@
+ARG UBUNTU_VERSION=24.04
+ARG CUDA_VERSION=12.6.2
+ARG BASE_CUDA_DEV_CONTAINER=docker.io/nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
+ARG BASE_CUDA_RUN_CONTAINER=docker.io/nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}
+
+# Stage 1: Build
+FROM ${BASE_CUDA_DEV_CONTAINER} AS build
+ARG CUDA_DOCKER_ARCH=default # CUDA architecture to build for (defaults to all supported archs)
+RUN apt-get update && apt-get install -yq build-essential git libcurl4-openssl-dev curl libgomp1 cmake
+
+RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git /app
+WORKDIR /app
+RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \
+    export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \
+    fi && \
+    cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
+    cmake --build build --config Release -j$(nproc)
+RUN mkdir -p /app/lib && \
+    find build -name "*.so" -exec cp {} /app/lib \;
+RUN mkdir -p /app/build/src && \
+    find build -name "*.so" -exec cp {} /app/build/src \;
+RUN mkdir -p /app/full \
+    && cp build/bin/* /app/full \
+    && cp *.py /app/full \
+    && cp -r gguf-py /app/full \
+    && cp -r requirements /app/full \
+    && cp requirements.txt /app/full \
+    && cp .devops/tools.sh /app/full/tools.sh
+
+# Stage 2: base
+FROM ${BASE_CUDA_RUN_CONTAINER} AS base
+RUN apt-get update && apt-get install -yq libgomp1 curl \
+    && update-ca-certificates \
+    && apt-get autoremove -y \
+    && apt-get clean -y \
+    && rm -rf /tmp/* /var/tmp/* \
+    && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
+    && find /var/cache -type f -delete
+COPY --from=build /app/lib/ /app
+
+# Stage 3: full
+FROM base AS full
+COPY --from=build /app/full /app
+RUN mkdir -p /app/build/src
+COPY --from=build /app/build/src /app/build/src
+WORKDIR /app
+RUN apt-get update && apt-get install -yq \
+    git \
+    python3 \
+    python3-pip \
+    && pip3 install --break-system-packages -r requirements.txt \
+    && apt-get autoremove -y \
+    && apt-get clean -y \
+    && rm -rf /tmp/* /var/tmp/* \
+    && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
+    && find /var/cache -type f -delete
+ENTRYPOINT ["/app/tools.sh"]
+
+# Stage 4: Server
+FROM base AS server
+ENV LLAMA_ARG_HOST=0.0.0.0
+COPY --from=build /app/full/llama-server /app/llama-server
+WORKDIR /app
+HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
+ENTRYPOINT [ "/app/llama-server" ]
+
+# Stage 5: Swap
+FROM server AS swap
+ARG LS_REPO=mostlygeek/llama-swap
+ARG LS_VER=189
+RUN curl -LO "https://github.com/${LS_REPO}/releases/download/v${LS_VER}/llama-swap_${LS_VER}_linux_amd64.tar.gz" \
+    && tar -zxf "llama-swap_${LS_VER}_linux_amd64.tar.gz" \
+    && rm "llama-swap_${LS_VER}_linux_amd64.tar.gz"
+COPY ./ik_llama-cuda-swap.config.yaml /app/config.yaml
+HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080"]
+ENTRYPOINT [ "/app/llama-swap", "-config", "/app/config.yaml" ]