* ci: implement build matrix for CUDA/CPU containers with dynamic tagging * fix: Updated Docker images/build-container.yml * fix: Updated the documentation about Docker * fix: Set Arch for 3090s * fix: Updated build step name. * fix: Set target ARCH as a variable * feat: Added cleanup step * feat: Added docker-bake and updated workflow * fix: Issue with REPO_OWNER variable * fix: Updated workflow to solve errors * fix: Updated branch format * fix: Wrong naming * Update docker-bake.hcl * Update build-container.yml * Update ik_llama-cuda.Containerfile * Update ik_llama-cpu.Containerfile * Update docker-bake.hcl * Update build-container.yml * Removed action/cache * added -sSL for reliability and fixed the URL path * added -sSL for reliability and fixed the URL path CUDA containerfile * fix: correct Dockerfile RUN command syntax errors - Combine split apt-get install commands in both Containerfiles - Fix broken cmake command continuation in ik_llama-cuda.Containerfile * fix: correct llama-swap download URL in Containerfiles - Fix broken line continuation in curl download URL for llama-swap * perf: improve ccache configuration in Containerfiles - Add CCACHE_UMASK=000 for cache accessibility across stages - Add CCACHE_MAXSIZE=1G to prevent unbounded growth - Initialize ccache with ccache -i during build stage * fix: remove problematic ccache initialization from Containerfiles - ccache -i fails because CCACHE_DIR mount doesn't exist yet at build time * fix: add git to CPU Containerfile build dependencies - Resolves CMake warning about missing Git for build info * chore: optimize Containerfile with smaller images and better healthchecks - Add --no-install-recommends to all apt-get commands for smaller image size - Add ca-certificates to base stage for HTTPS support - Merge redundant build copy commands from 3 layers to 1 - Fix llama-swap version from 198 to v199 (latest release) - Add HEALTHCHECK configuration with interval/timeout/retries to server and swap stages - Copy /app/lib in server stage to fix container startup * chore: fix CUDA Containerfile healthchecks and swap version - Add /app/lib copy in server stage to fix container startup - Fix llama-swap version from 198 to v199 (latest release) - Add HEALTHCHECK configuration with interval/timeout/retries * chore: fix indentation in Containerfiles and add LD_LIBRARY_PATH for server target * fix: add --break-system-packages flag for pip in CPU Containerfile * feat: add git bind mount for build info and NCCL support for CUDA * fix: remove libnccl-dev from CUDA build (already included in base image) * fix: added Markdown files to ignore files * feat: use BUILD_NUMBER-COMMIT pattern for docker image tags - Add BUILD_NUMBER and LLAMA_COMMIT to build workflow - Update docker-bake.hcl to use version tag format matching llama-server --version output - Format: VARIANT-BUILD_NUMBER-COMMIT (e.g., cu12-full-4406-3bc90dfd) * fix: fetch full git history for accurate BUILD_NUMBER - Add fetch-depth: 0 to actions/checkout to get all commits - This ensures git rev-list --count HEAD returns correct total commit count * fix: fetch full git history in Dockerfile for accurate BUILD_NUMBER - Add git fetch --unshallow to get complete commit history during build - This ensures build-info.cpp is generated with correct LLAMA_BUILD_NUMBER * chore: update GitHub Actions to latest versions for Node.js 24 compatibility - docker/setup-buildx-action@v3 -> v4 - docker/login-action@v3 -> v4 * chore: update all GitHub Actions to Node.js 24 compatible versions - actions/checkout@v4 -> v6 - docker/setup-buildx-action@v3 -> v5 - docker/login-action@v3 -> v6 - docker/bake-action@v5 -> v7 * fix: use CI-passed BUILD_NUMBER and LLAMA_COMMIT in Dockerfile - Add BUILD_NUMBER and LLAMA_COMMIT as build args - Fall back to git commands if not provided - Pass values explicitly to cmake for accurate build info * fix: pass BUILD_NUMBER and LLAMA_COMMIT as Docker build args - Add BUILD_NUMBER and LLAMA_COMMIT to docker bake args - These will be used by the Containerfile for accurate build info * fix: revert docker actions to v4 (latest available versions) * fix: calculate BUILD_NUMBER and LLAMA_COMMIT directly in Containerfile - Removed ARG defaults since we calculate from git during build - Use git rev-list --count HEAD and git rev-parse for accurate version info - Falls back to 0/unknown if git commands fail * feat: calculate BUILD_NUMBER and LLAMA_COMMIT in Containerfiles - Add git-based version calculation in both CPU and CUDA Containerfiles - Remove .git bind mount (git is copied with COPY .) - Pass build info to CMake for accurate llama-server --version output * feat: calculate BUILD_NUMBER and LLAMA_COMMIT in Containerfiles - Add git-based version calculation using git rev-list and git rev-parse - Copy .git directory separately to ensure git commands work during build - Pass build info to CMake for accurate llama-server --version output * fix: cache improvements for CUDA and CPU builds * fix: "/.git": not found * fix: Unnecessary mv llama-swap * fix: Remove BUILD_NUMBER and LLAMA_COMMIT from docker file, calculated by cmake proc * fix: remove .git from dockerignore for local and CI builds - Enables cmake to access .git directory during Docker build - Required for version calculation in llama-server binary - GitHub Actions uses explicit mount via bake action set parameter * fix: Remove mounts key from Build and Push step in gh workflow * ci: add .git verification step before build * refactor: standardize Containerfile structure and remove .git mount dependency - Remove --mount=type=bind,source=.git,target=.git from both Containerfiles - Replace COPY . . with git clone for cleaner build context - Add CUSTOM_COMMIT ARG for optional custom commit switching - Standardize ARG/ENV ordering and comment formatting across CPU/CUDA variants - Install ca-certificates before git clone to fix SSL verification issues - Rename 'Structured artifact collection' to 'Collect build artifacts' * ci: remove broken cache pruning step * ci: remove broken prune-cache job - Remove prune-cache job that was failing due to missing .git directory - The job required a checkout step and the cache pruning logic was non-critical * chore: Removed step for Verifying .git existance in GH workflow * fix: ensure build always proceeds even if git switch fails - Add '|| true' to git switch command so build continues on failure - This prevents the entire RUN step from failing when CUSTOM_COMMIT is invalid * fix: resolve Docker build pipeline issues - Remove external git clone from Containerfiles, use build context directly - Add BUILD_NUMBER and BUILD_COMMIT as CMake cache variables in build-info.cmake - Fix .devops/tools.sh inclusion by using explicit COPY for hidden directories - Set USE_CCACHE=true for CI builds - Clean up unused SHA_SHORT variable from docker-bake.hcl Fixes: Build steps were cached incorrectly due to external git clone ignoring the actual build context source. * fix: include .git in Docker build context and add verification * ci: add .git directory verification step after checkout * build: fix .git mount path for Docker build context compatibility * build: fix .git mount path for Docker build context compatibility * docker: include .git in build context for version calculation * ci: add .git directory verification step after checkout * chore: Removed unecessary Verify .git step (It was a test) * docs: update README with docker-bake and build-local.sh instructions * docs: remove build-local.sh reference (not in repo) * ci: optimize disk usage by limiting fetch depth and cleaning workspace --------- Co-authored-by: HP Prodesk <sourceupdev@gmail.com>
6.0 KiB
Build and use ik_llama.cpp with CPU or CPU+CUDA
Built on top of ikawrakow/ik_llama.cpp and llama-swap
All commands are provided for Podman and Docker.
CPU or CUDA sections under Build and Run are enough to get up and running.
Overview
Build
Using docker-bake (Recommended)
The project uses Docker Bake for building multiple targets efficiently.
CPU Variant
docker buildx bake --builder ik-llama-builder full swap
Or with custom tags:
REPO_OWNER=yourname docker buildx bake --builder ik-llama-builder \
-f ./docker-bake.hcl \
full swap
CUDA Variant
First, set the CUDA version and GPU architecture in ik_llama-cuda.Containerfile:
CUDA_DOCKER_ARCH: Your GPU's compute capability (e.g.,86for RTX 30*,89for RTX 40*,12.0for RTX 50*)CUDA_VERSION: CUDA Toolkit version (e.g.,12.6.2,13.1.1)
VARIANT=cu12 docker buildx bake --builder ik-llama-builder full swap
Build Targets
Builds two image tags per variant:
full: Includesllama-server,llama-quantize, and other utilities.swap: Includes onlyllama-swapandllama-server.
Local Development
- Clone the repository:
git clone https://github.com/ikawrakow/ik_llama.cpp - Enter the repo:
cd ik_llama.cpp - Use either docker-bake or build-local.sh as shown above.
Run
- Download
.ggufmodel files to your favorite directory (e.g.,/my_local_files/gguf). - Map it to
/modelsinside the container. - Open browser
http://localhost:9292and enjoy the features. - API endpoints are available at
http://localhost:9292/v1for use in other applications.
CPU
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
CUDA
- Install Nvidia Drivers and CUDA on the host.
- For Docker, install NVIDIA Container Toolkit
- For Podman, install CDI Container Device Interface
- Identify your GPU:
- CUDA GPU Compute Capability (e.g.,
8.6for RTX30*,8.9for RTX40*,12.0for RTX50*) - CUDA Toolkit supported version
- CUDA GPU Compute Capability (e.g.,
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia localhost/ik_llama-cuda:swap
Troubleshooting
- If CUDA is not available, use
ik_llama-cpuinstead. - If models are not found, ensure you mount the correct directory:
-v /my_local_files/gguf:/models:ro - If you need to install
podmanordockerfollow the Podman Installation or Install Docker Engine for your OS.
Extra
- Custom commit: Build a specific
ik_llama.cppcommit by modifying the Containerfile or using build args.
docker buildx bake --builder ik-llama-builder --set full.args.BUILD_COMMIT=1ec12b8 full
- Using the tools in the
fullimage:
$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...
docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash localhost/ik_llama-cuda:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...
-
Customize
llama-swapconfig: Save the./docker/ik_llama-cpu-swap.config.yamlor./docker/ik_llama-cuda-swap.config.yamllocally (e.g., under/my_local_files/) then map it to/app/config.yamlinside the container appending-v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:roto yourpodman run ...ordocker run .... -
Run in background: Replace
-itwith-d:podman run -d ...ordocker run -d .... To stop it:podman stop ik_llamaordocker stop ik_llama. -
GGML_NATIVE: If you build the image on a different machine, change
-DGGML_NATIVE=ONto-DGGML_NATIVE=OFFin the.Containerfile. -
KV quantization types: To use more KV quantization types, build with
-DGGML_IQK_FA_ALL_QUANTS=ON. -
Cleanup unused CUDA images: If you experiment with several
CUDA_VERSION, delete unused images (they are several GB):podman image rm docker.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04 && \ podman image rm docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04 -
Build without
llama-swap: Change--target swapto--target serverin docker-bake or Containerfiles. -
Pre-made quants: Look for premade quants from ubergarm.
-
GGUF tools: Build custom quants with Thireus's tools.
-
Download prebuilt binaries: Download from ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA.
-
KoboldCPP experience: Croco.Cpp is a fork of KoboldCPP inferring GGUF/GGML models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet.
Credits
All credits to the awesome community: