Files
ik_llama.cpp/docker/README.md
Yadir Hernandez Batista db31e7d803 Added workflow to build container images (#1279)
* ci: implement build matrix for CUDA/CPU containers with dynamic tagging

* fix: Updated Docker images/build-container.yml

* fix: Updated the documentation about Docker

* fix: Set Arch for 3090s

* fix: Updated build step name.

* fix: Set target ARCH as a variable

* feat: Added cleanup step

* feat: Added docker-bake and updated workflow

* fix: Issue with REPO_OWNER variable

* fix: Updated workflow to solve errors

* fix: Updated branch format

* fix: Wrong naming

* Update docker-bake.hcl

* Update build-container.yml

* Update ik_llama-cuda.Containerfile

* Update ik_llama-cpu.Containerfile

* Update docker-bake.hcl

* Update build-container.yml

* Removed action/cache

* added -sSL for reliability and fixed the URL path

* added -sSL for reliability and fixed the URL path CUDA containerfile

* fix: correct Dockerfile RUN command syntax errors

- Combine split apt-get install commands in both Containerfiles
- Fix broken cmake command continuation in ik_llama-cuda.Containerfile

* fix: correct llama-swap download URL in Containerfiles

- Fix broken line continuation in curl download URL for llama-swap

* perf: improve ccache configuration in Containerfiles

- Add CCACHE_UMASK=000 for cache accessibility across stages
- Add CCACHE_MAXSIZE=1G to prevent unbounded growth
- Initialize ccache with ccache -i during build stage

* fix: remove problematic ccache initialization from Containerfiles

- ccache -i fails because CCACHE_DIR mount doesn't exist yet at build time

* fix: add git to CPU Containerfile build dependencies

- Resolves CMake warning about missing Git for build info

* chore: optimize Containerfile with smaller images and better healthchecks

- Add --no-install-recommends to all apt-get commands for smaller image size
- Add ca-certificates to base stage for HTTPS support
- Merge redundant build copy commands from 3 layers to 1
- Fix llama-swap version from 198 to v199 (latest release)
- Add HEALTHCHECK configuration with interval/timeout/retries to server and swap stages
- Copy /app/lib in server stage to fix container startup

* chore: fix CUDA Containerfile healthchecks and swap version

- Add /app/lib copy in server stage to fix container startup
- Fix llama-swap version from 198 to v199 (latest release)
- Add HEALTHCHECK configuration with interval/timeout/retries

* chore: fix indentation in Containerfiles and add LD_LIBRARY_PATH for server target

* fix: add --break-system-packages flag for pip in CPU Containerfile

* feat: add git bind mount for build info and NCCL support for CUDA

* fix: remove libnccl-dev from CUDA build (already included in base image)

* fix: added Markdown files to ignore files

* feat: use BUILD_NUMBER-COMMIT pattern for docker image tags

- Add BUILD_NUMBER and LLAMA_COMMIT to build workflow
- Update docker-bake.hcl to use version tag format matching llama-server --version output
- Format: VARIANT-BUILD_NUMBER-COMMIT (e.g., cu12-full-4406-3bc90dfd)

* fix: fetch full git history for accurate BUILD_NUMBER

- Add fetch-depth: 0 to actions/checkout to get all commits
- This ensures git rev-list --count HEAD returns correct total commit count

* fix: fetch full git history in Dockerfile for accurate BUILD_NUMBER

- Add git fetch --unshallow to get complete commit history during build
- This ensures build-info.cpp is generated with correct LLAMA_BUILD_NUMBER

* chore: update GitHub Actions to latest versions for Node.js 24 compatibility

- docker/setup-buildx-action@v3 -> v4
- docker/login-action@v3 -> v4

* chore: update all GitHub Actions to Node.js 24 compatible versions

- actions/checkout@v4 -> v6
- docker/setup-buildx-action@v3 -> v5
- docker/login-action@v3 -> v6
- docker/bake-action@v5 -> v7

* fix: use CI-passed BUILD_NUMBER and LLAMA_COMMIT in Dockerfile

- Add BUILD_NUMBER and LLAMA_COMMIT as build args
- Fall back to git commands if not provided
- Pass values explicitly to cmake for accurate build info

* fix: pass BUILD_NUMBER and LLAMA_COMMIT as Docker build args

- Add BUILD_NUMBER and LLAMA_COMMIT to docker bake args
- These will be used by the Containerfile for accurate build info

* fix: revert docker actions to v4 (latest available versions)

* fix: calculate BUILD_NUMBER and LLAMA_COMMIT directly in Containerfile

- Removed ARG defaults since we calculate from git during build
- Use git rev-list --count HEAD and git rev-parse for accurate version info
- Falls back to 0/unknown if git commands fail

* feat: calculate BUILD_NUMBER and LLAMA_COMMIT in Containerfiles

- Add git-based version calculation in both CPU and CUDA Containerfiles
- Remove .git bind mount (git is copied with COPY .)
- Pass build info to CMake for accurate llama-server --version output

* feat: calculate BUILD_NUMBER and LLAMA_COMMIT in Containerfiles

- Add git-based version calculation using git rev-list and git rev-parse
- Copy .git directory separately to ensure git commands work during build
- Pass build info to CMake for accurate llama-server --version output

* fix: cache improvements for CUDA and CPU builds

* fix: "/.git": not found

* fix: Unnecessary mv llama-swap

* fix: Remove BUILD_NUMBER and LLAMA_COMMIT from docker file, calculated by cmake proc

* fix: remove .git from dockerignore for local and CI builds

- Enables cmake to access .git directory during Docker build
- Required for version calculation in llama-server binary
- GitHub Actions uses explicit mount via bake action set parameter

* fix: Remove mounts key from Build and Push step in gh workflow

* ci: add .git verification step before build

* refactor: standardize Containerfile structure and remove .git mount dependency

- Remove --mount=type=bind,source=.git,target=.git from both Containerfiles
- Replace COPY . . with git clone for cleaner build context
- Add CUSTOM_COMMIT ARG for optional custom commit switching
- Standardize ARG/ENV ordering and comment formatting across CPU/CUDA variants
- Install ca-certificates before git clone to fix SSL verification issues
- Rename 'Structured artifact collection' to 'Collect build artifacts'

* ci: remove broken cache pruning step

* ci: remove broken prune-cache job

- Remove prune-cache job that was failing due to missing .git directory
- The job required a checkout step and the cache pruning logic was non-critical

* chore: Removed step for Verifying .git existance in GH workflow

* fix: ensure build always proceeds even if git switch fails

- Add '|| true' to git switch command so build continues on failure
- This prevents the entire RUN step from failing when CUSTOM_COMMIT is invalid

* fix: resolve Docker build pipeline issues

- Remove external git clone from Containerfiles, use build context directly
- Add BUILD_NUMBER and BUILD_COMMIT as CMake cache variables in build-info.cmake
- Fix .devops/tools.sh inclusion by using explicit COPY for hidden directories
- Set USE_CCACHE=true for CI builds
- Clean up unused SHA_SHORT variable from docker-bake.hcl

Fixes: Build steps were cached incorrectly due to external git clone ignoring the actual build context source.

* fix: include .git in Docker build context and add verification

* ci: add .git directory verification step after checkout

* build: fix .git mount path for Docker build context compatibility

* build: fix .git mount path for Docker build context compatibility

* docker: include .git in build context for version calculation

* ci: add .git directory verification step after checkout

* chore: Removed unecessary Verify .git step (It was a test)

* docs: update README with docker-bake and build-local.sh instructions

* docs: remove build-local.sh reference (not in repo)

* ci: optimize disk usage by limiting fetch depth and cleaning workspace

---------

Co-authored-by: HP Prodesk <sourceupdev@gmail.com>
2026-04-10 08:06:47 +02:00

6.0 KiB

Build and use ik_llama.cpp with CPU or CPU+CUDA

Built on top of ikawrakow/ik_llama.cpp and llama-swap

All commands are provided for Podman and Docker.

CPU or CUDA sections under Build and Run are enough to get up and running.

Overview

Build

The project uses Docker Bake for building multiple targets efficiently.

CPU Variant

docker buildx bake --builder ik-llama-builder full swap

Or with custom tags:

REPO_OWNER=yourname docker buildx bake --builder ik-llama-builder \
  -f ./docker-bake.hcl \
  full swap

CUDA Variant

First, set the CUDA version and GPU architecture in ik_llama-cuda.Containerfile:

  • CUDA_DOCKER_ARCH: Your GPU's compute capability (e.g., 86 for RTX 30*, 89 for RTX 40*, 12.0 for RTX 50*)
  • CUDA_VERSION: CUDA Toolkit version (e.g., 12.6.2, 13.1.1)
VARIANT=cu12 docker buildx bake --builder ik-llama-builder full swap

Build Targets

Builds two image tags per variant:

  • full: Includes llama-server, llama-quantize, and other utilities.
  • swap: Includes only llama-swap and llama-server.

Local Development

  1. Clone the repository: git clone https://github.com/ikawrakow/ik_llama.cpp
  2. Enter the repo: cd ik_llama.cpp
  3. Use either docker-bake or build-local.sh as shown above.

Run

  • Download .gguf model files to your favorite directory (e.g., /my_local_files/gguf).
  • Map it to /models inside the container.
  • Open browser http://localhost:9292 and enjoy the features.
  • API endpoints are available at http://localhost:9292/v1 for use in other applications.

CPU

podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap

CUDA

podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia localhost/ik_llama-cuda:swap

Troubleshooting

  • If CUDA is not available, use ik_llama-cpu instead.
  • If models are not found, ensure you mount the correct directory: -v /my_local_files/gguf:/models:ro
  • If you need to install podman or docker follow the Podman Installation or Install Docker Engine for your OS.

Extra

  • Custom commit: Build a specific ik_llama.cpp commit by modifying the Containerfile or using build args.
docker buildx bake --builder ik-llama-builder --set full.args.BUILD_COMMIT=1ec12b8 full
  • Using the tools in the full image:
$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...
docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash localhost/ik_llama-cuda:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...
  • Customize llama-swap config: Save the ./docker/ik_llama-cpu-swap.config.yaml or ./docker/ik_llama-cuda-swap.config.yaml locally (e.g., under /my_local_files/) then map it to /app/config.yaml inside the container appending -v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:ro to your podman run ... or docker run ....

  • Run in background: Replace -it with -d: podman run -d ... or docker run -d .... To stop it: podman stop ik_llama or docker stop ik_llama.

  • GGML_NATIVE: If you build the image on a different machine, change -DGGML_NATIVE=ON to -DGGML_NATIVE=OFF in the .Containerfile.

  • KV quantization types: To use more KV quantization types, build with -DGGML_IQK_FA_ALL_QUANTS=ON.

  • Cleanup unused CUDA images: If you experiment with several CUDA_VERSION, delete unused images (they are several GB):

    podman image rm docker.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04 && \
      podman image rm docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04
    
  • Build without llama-swap: Change --target swap to --target server in docker-bake or Containerfiles.

  • Pre-made quants: Look for premade quants from ubergarm.

  • GGUF tools: Build custom quants with Thireus's tools.

  • Download prebuilt binaries: Download from ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA.

  • KoboldCPP experience: Croco.Cpp is a fork of KoboldCPP inferring GGUF/GGML models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet.

Credits

All credits to the awesome community:

llama-swap