Ik llama swap in container step by step guide (#1249)

* Create README.md

* Add container files and llama-swap configs

* Update main README.md

* Build without GGML_IQK_FA_ALL_QUANTS

Otherwise fails with CUDA_DOCKER_ARCH=default

* Mention GGML_IQK_FA_ALL_QUANTS usage

* First step more explicit
This commit is contained in:
mcm007
2026-02-07 18:30:19 +02:00
committed by GitHub
parent 82c4f27332
commit dbcbfdb0ef
6 changed files with 387 additions and 0 deletions

View File

@@ -8,6 +8,8 @@ This repository is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp)
## Latest News
### [Step by step guide](./docker/README.md) for ik_llama.cpp in podman/docker container including llama-swap
### Model Support
LlaMA-3-Nemotron [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377), Qwen3 [PR 355](https://github.com/ikawrakow/ik_llama.cpp/pull/355), GLM-4 [PR 344](https://github.com/ikawrakow/ik_llama.cpp/pull/344), Command-A [PR 341](https://github.com/ikawrakow/ik_llama.cpp/pull/341), bitnet-b1.58-2B-4T [PR 337](https://github.com/ikawrakow/ik_llama.cpp/pull/337), LLaMA-4 [PR 321](https://github.com/ikawrakow/ik_llama.cpp/pull/321), Gemma3 [PR 276](https://github.com/ikawrakow/ik_llama.cpp/pull/276), DeepSeek-V3 [PR 176](https://github.com/ikawrakow/ik_llama.cpp/pull/176), Kimi-2 [PR 609](https://github.com/ikawrakow/ik_llama.cpp/pull/609), dots.llm1 [PR 573](https://github.com/ikawrakow/ik_llama.cpp/pull/573), Hunyuan [PR 565](https://github.com/ikawrakow/ik_llama.cpp/pull/565)

138
docker/README.md Normal file
View File

@@ -0,0 +1,138 @@
# Build and use ik_llama.cpp with CPU or CPU+CUDA
Built on top of [ikawrakow/ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) and [llama-swap](https://github.com/mostlygeek/llama-swap)
All commands are provided for Podman and Docker.
CPU or CUDA sections under [Build](#Build) and [Run]($Run) are enough to get up and running.
## Overview
- [Build](#Build)
- [Run](#Run)
- [Troubleshooting](#Troubleshooting)
- [Extra Features](#Extra)
- [Credits](#Credits)
# Build
Builds two image tags:
- `swap`: Includes only `llama-swap` and `llama-server`.
- `full`: Includes `llama-server`, `llama-quantize`, and other utilities.
Start: download the 4 files to a new directory (e.g. `~/ik_llama/`) then follow the next steps.
```
└── ik_llama
├── ik_llama-cpu.Containerfile
├── ik_llama-cpu-swap.config.yaml
├── ik_llama-cuda.Containerfile
└── ik_llama-cuda-swap.config.yaml
```
## CPU
```
podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap
```
```
docker image build --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full . && docker image build --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap .
```
## CUDA
```
podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full && podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap
```
```
docker image build --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap .
```
# Run
- Download `.gguf` model files to your favorite directory (e.g. `/my_local_files/gguf`).
- Map it to `/models` inside the container.
- Open browser `http://localhost:9292` and enjoy the features.
- API endpoints are available at `http://localhost:9292/v1` for use in other applications.
## CPU
```
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
```
```
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro ik_llama-cpu:swap
```
## CUDA
- Install Nvidia Drivers and CUDA on the host.
- For Docker, install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
- For Podman, install [CDI Container Device Interface](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html)
```
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap
```
```
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia ik_llama-cuda:swap
```
# Troubleshooting
- If CUDA is not available, use `ik_llama-cpu` instead.
- If models are not found, ensure you mount the correct directory: `-v /my_local_files/gguf:/models:ro`
- If you need to install `podman` or `docker` follow the [Podman Installation](https://podman.io/docs/installation) or [Install Docker Engine](https://docs.docker.com/engine/install) for your OS.
# Extra
- `CUSTOM_COMMIT` can be used to build a specific `ik_llama.cpp` commit (e.g. `1ec12b8`).
```
podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:swap
```
```
docker image build --file ik_llama-cuda.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:swap .
```
- Using the tools in the `full` image:
```
$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...
```
```
docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash ik_llama-cuda:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...
```
- Customize `llama-swap` config: save the `ik_llama-cpu-swap.config.yaml` or `ik_llama-cuda-swap.config.yaml` localy (e.g. under `/my_local_files/`) then map it to `/app/config.yaml` inside the container appending `-v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:ro` to your`podman run ...` or `docker run ...`.
- To run the container in background, replace `-it` with `-d`: `podman run -d ...` or `docker run -d ...`. To stop it: `podman stop ik_llama` or `docker stop ik_llama`.
- If you build the image on the same machine where will be used, change `-DGGML_NATIVE=OFF` to `-DGGML_NATIVE=ON` in the `.Containerfile`.
- For a smaller CUDA build, identify your GPU [CUDA GPU Compute Capability](https://developer.nvidia.com/cuda/gpus) (e.g. `8.6` for RTX30*0) then change `CUDA_DOCKER_ARCH` in `ik_llama-cuda.Containerfile` from `default` to your GPU architecture (e.g. `CUDA_DOCKER_ARCH=86`).
- If you build only for your GPU architecture and want to make use of more KV quantization types, build with `-DGGML_IQK_FA_ALL_QUANTS=ON`.
- Get the best (measures kindly provided on each model card) quants from [ubergarm](https://huggingface.co/ubergarm/models) if available.
- Usefull graphs and numbers on @magikRUKKOLA [Perplexity vs Size Graphs for the recent quants (GLM-4.7, Kimi-K2-Thinking, Deepseek-V3.1-Terminus, Deepseek-R1, Qwen3-Coder, Kimi-K2, Chimera etc.)](https://github.com/ikawrakow/ik_llama.cpp/discussions/715) topic.
- Build custom quants with [Thireus](https://github.com/Thireus/GGUF-Tool-Suite)'s tools.
- Download from [ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA](https://github.com/Thireus/ik_llama.cpp) if you cannot build.
- For a KoboldCPP experience [Croco.Cpp is fork of KoboldCPP infering GGML/GGUF models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet. ](https://github.com/Nexesenex/croco.cpp)
# Credits
All credits to the awesome community:
[ikawrakow](https://github.com/ikawrakow/ik_llama.cpp)
[llama-swap](https://github.com/mostlygeek/llama-swap)

View File

@@ -0,0 +1,44 @@
healthCheckTimeout: 1800
logRequests: true
metricsMaxInMemory: 1000
models:
"qwen3 (you need to download .gguf first)":
proxy: "http://127.0.0.1:9999"
cmd: >
/app/llama-server
--model /models/Qwen_Qwen3-0.6B-Q6_K.gguf
--alias qwen3
--port 9999
--parallel 1
--webui llamacpp
--jinja
--ctx-size 12288
-fa on
"qwen3-vl (you need to download .gguf and mmproj first)":
proxy: "http://127.0.0.1:9999"
cmd: >
/app/llama-server
--model /models/Qwen_Qwen3-VL-4B-Instruct-IQ4_NL.gguf
--mmproj /models/Qwen_Qwen3-VL-4B-Instruct-mmproj-f16.gguf
--alias qwen3-vl
--port 9999
--parallel 1
--webui llamacpp
--jinja
--ctx-size 12288
-fa on
"smollm2 (will be downloaded automatically from huggingface.co)":
proxy: "http://127.0.0.1:9999"
cmd: >
/app/llama-server
--hf-repo mradermacher/SmolLM2-135M-i1-GGUF --hf-file SmolLM2-135M.i1-IQ4_NL.gguf
--alias smollm2
--port 9999
--parallel 1
--webui llamacpp
--jinja
--ctx-size 12288
-fa on

View File

@@ -0,0 +1,73 @@
ARG UBUNTU_VERSION=22.04
# Stage 1: Build
FROM docker.io/ubuntu:$UBUNTU_VERSION AS build
ENV LLAMA_CURL=1
ENV LC_ALL=C.utf8
ARG CUSTOM_COMMIT
RUN apt-get update && apt-get install -yq build-essential git libcurl4-openssl-dev curl libgomp1 cmake
RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git /app
WORKDIR /app
RUN if [ -n "$CUSTOM_COMMIT" ]; then git switch --detach "$CUSTOM_COMMIT"; fi
RUN cmake -B build -DGGML_NATIVE=OFF -DLLAMA_CURL=ON -DGGML_IQK_FA_ALL_QUANTS=ON && \
cmake --build build --config Release -j$(nproc)
RUN mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;
RUN mkdir -p /app/build/src && \
find build -name "*.so" -exec cp {} /app/build/src \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
# Stage 2: Base
FROM docker.io/ubuntu:$UBUNTU_VERSION AS base
RUN apt-get update && apt-get install -yq libgomp1 curl \
&& apt-get autoremove -y \
&& apt-get clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
COPY --from=build /app/lib/ /app
# Stage 3: Full
FROM base AS full
COPY --from=build /app/full /app
RUN mkdir -p /app/build/src
COPY --from=build /app/build/src /app/build/src
WORKDIR /app
RUN apt-get update && apt-get install -yq \
git \
python3 \
python3-pip \
&& pip install --upgrade pip setuptools wheel \
&& pip install -r requirements.txt \
&& apt-get autoremove -y \
&& apt-get clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
ENTRYPOINT ["/app/full/tools.sh"]
# Stage 4: Server
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/full/llama-server /app/llama-server
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]
# Stage 5: Swap
FROM server AS swap
ARG LS_REPO=mostlygeek/llama-swap
ARG LS_VER=189
RUN curl -LO "https://github.com/${LS_REPO}/releases/download/v${LS_VER}/llama-swap_${LS_VER}_linux_amd64.tar.gz" \
&& tar -zxf "llama-swap_${LS_VER}_linux_amd64.tar.gz" \
&& rm "llama-swap_${LS_VER}_linux_amd64.tar.gz"
COPY ./ik_llama-cpu-swap.config.yaml /app/config.yaml
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080"]
ENTRYPOINT [ "/app/llama-swap", "-config", "/app/config.yaml" ]

View File

@@ -0,0 +1,54 @@
healthCheckTimeout: 1800
logRequests: true
metricsMaxInMemory: 1000
models:
"qwen3 (you need to download .gguf first)":
proxy: "http://127.0.0.1:9999"
cmd: >
/app/llama-server
--model /models/Qwen_Qwen3-0.6B-Q6_K.gguf
--alias qwen3
--port 9999
--parallel 1
--webui llamacpp
--jinja
--ctx-size 12288
-fa on
--merge-qkv
-ngl 999 --threads-batch 1
-ctk q8_0 -ctv q8_0
"oss-moe (you need to download .gguf first)":
proxy: "http://127.0.0.1:9999"
cmd: >
/app/llama-server
--model /models/kldzj_gpt-oss-120b-heretic-MXFP4_MOE-00001-of-00002.gguf
--alias gpt-oss
--port 9999
--parallel 1
--webui llamacpp
--jinja
--ctx-size 12288
-fa on
--merge-qkv
-ngl 999
--n-cpu-moe 30
-ctk q8_0 -ctv q8_0
--grouped-expert-routing
--reasoning-format auto --chat-template-kwargs '{"reasoning_effort": "medium"}'
"smollm2 (will be downloaded automatically from huggingface.co)":
proxy: "http://127.0.0.1:9999"
cmd: >
/app/llama-server
--hf-repo mradermacher/SmolLM2-135M-i1-GGUF --hf-file SmolLM2-135M.i1-IQ4_NL.gguf
--alias smollm2
--port 9999
--parallel 1
--webui llamacpp
--jinja
--ctx-size 12288
-fa on
--merge-qkv
-ngl 999 --threads-batch 1

View File

@@ -0,0 +1,76 @@
ARG UBUNTU_VERSION=24.04
ARG CUDA_VERSION=12.6.2
ARG BASE_CUDA_DEV_CONTAINER=docker.io/nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
ARG BASE_CUDA_RUN_CONTAINER=docker.io/nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}
# Stage 1: Build
FROM ${BASE_CUDA_DEV_CONTAINER} AS build
ARG CUDA_DOCKER_ARCH=default # CUDA architecture to build for (defaults to all supported archs)
RUN apt-get update && apt-get install -yq build-essential git libcurl4-openssl-dev curl libgomp1 cmake
RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git /app
WORKDIR /app
RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \
export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \
fi && \
cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
cmake --build build --config Release -j$(nproc)
RUN mkdir -p /app/lib && \
find build -name "*.so" -exec cp {} /app/lib \;
RUN mkdir -p /app/build/src && \
find build -name "*.so" -exec cp {} /app/build/src \;
RUN mkdir -p /app/full \
&& cp build/bin/* /app/full \
&& cp *.py /app/full \
&& cp -r gguf-py /app/full \
&& cp -r requirements /app/full \
&& cp requirements.txt /app/full \
&& cp .devops/tools.sh /app/full/tools.sh
# Stage 2: base
FROM ${BASE_CUDA_RUN_CONTAINER} AS base
RUN apt-get update && apt-get install -yq libgomp1 curl \
&& update-ca-certificates \
&& apt-get autoremove -y \
&& apt-get clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
COPY --from=build /app/lib/ /app
# Stage 3: full
FROM base AS full
COPY --from=build /app/full /app
RUN mkdir -p /app/build/src
COPY --from=build /app/build/src /app/build/src
WORKDIR /app
RUN apt-get update && apt-get install -yq \
git \
python3 \
python3-pip \
&& pip3 install --break-system-packages -r requirements.txt \
&& apt-get autoremove -y \
&& apt-get clean -y \
&& rm -rf /tmp/* /var/tmp/* \
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
&& find /var/cache -type f -delete
ENTRYPOINT ["/app/tools.sh"]
# Stage 4: Server
FROM base AS server
ENV LLAMA_ARG_HOST=0.0.0.0
COPY --from=build /app/full/llama-server /app/llama-server
WORKDIR /app
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
ENTRYPOINT [ "/app/llama-server" ]
# Stage 5: Swap
FROM server AS swap
ARG LS_REPO=mostlygeek/llama-swap
ARG LS_VER=189
RUN curl -LO "https://github.com/${LS_REPO}/releases/download/v${LS_VER}/llama-swap_${LS_VER}_linux_amd64.tar.gz" \
&& tar -zxf "llama-swap_${LS_VER}_linux_amd64.tar.gz" \
&& rm "llama-swap_${LS_VER}_linux_amd64.tar.gz"
COPY ./ik_llama-cuda-swap.config.yaml /app/config.yaml
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080"]
ENTRYPOINT [ "/app/llama-swap", "-config", "/app/config.yaml" ]