mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 08:30:19 +00:00

Files

Zhekun Hu 9ddb510787 Add Turing and Ampere (A100) GGML to docker build file (#1691 )

* Add Turing and Ampere (A100) GGML to docker build file

At the moment, the docker file for image builds do not build for CUDA architectures below 8.6, and ik_llama.cpp specifies support for architectures Turing and above, this PR sets the CUDA architecture list to include the architecture for Turing (7.5) and A100 (8.0)

* Remove 80 because few ppl have A100s and it does seem like many cuda arches cause issues for build

* switch to 86-real and 89-real with 75, 80, 90 using virtual ptx jit

* nvm, even adding 90-virtual causes linker error

---------

Co-authored-by: Codex <codex@local>

2026-05-07 12:58:58 +03:00

ik_llama-cpu-swap.config.yaml

Update docker and it's configs (#1497 )

2026-03-24 07:52:54 +01:00

ik_llama-cpu.Containerfile

Added workflow to build container images (#1279 )

2026-04-10 08:06:47 +02:00

ik_llama-cuda-swap.config.yaml

Update docker and it's configs (#1497 )

2026-03-24 07:52:54 +01:00

ik_llama-cuda.Containerfile

Add Turing and Ampere (A100) GGML to docker build file (#1691 )

2026-05-07 12:58:58 +03:00

README.md

Update docker readme and local build (#1710 )

2026-04-30 08:16:12 +02:00

README.md

Build and use ik_llama.cpp with CPU or CPU+CUDA

Built on top of ikawrakow/ik_llama.cpp and llama-swap

Commands are provided for Podman and Docker.

CPU or CUDA sections under Prebuilt/Build and Run are enough to get up and running.

Prebuilt Docker images

Pull one of the available images from ghcr.io. View all tags

docker pull ghcr.io/ikawrakow/ik-llama-cpp:cpu-swap
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cpu-server
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cpu-full

docker pull ghcr.io/ikawrakow/ik-llama-cpp:cu12-swap
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cu12-server
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cu12-full

Build

The project uses Docker Bake for building multiple targets efficiently.

Clone the repository: git clone https://github.com/ikawrakow/ik_llama.cpp

Use docker-bake.

docker buildx create --name ik-llama-builder --use

CPU Variant

VARIANT=cpu docker buildx bake --builder ik-llama-builder --load full swap

Or with custom tags:

REPO_OWNER=yourname VARIANT=cpu docker buildx bake --builder ik-llama-builder --load \
  -f ./docker-bake.hcl \
  full swap

CUDA Variant

First, set the CUDA version and GPU architecture in ik_llama-cuda.Containerfile:

CUDA_DOCKER_ARCH: Your GPU's compute capability (e.g., 86 for RTX 30*, 89 for RTX 40*, 12.0 for RTX 50*)
CUDA_VERSION: CUDA Toolkit version (e.g., 12.6.2, 13.1.1)

VARIANT=cu12 docker buildx bake --builder ik-llama-builder --load full swap

Build Targets

Builds two image tags per variant:

full: Includes llama-server, llama-quantize, and other utilities.
swap: Includes only llama-swap and llama-server.

Run

Download .gguf model files to your favorite directory (e.g., /my_local_files/gguf).
Map it to /models inside the container.
Open browser http://localhost:9292 and enjoy the features.
API endpoints are available at http://localhost:9292/v1 for use in other applications.

CPU

podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap

docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap

CUDA

Install Nvidia Drivers and CUDA on the host.
For Docker, install NVIDIA Container Toolkit
For Podman, install CDI Container Device Interface
Identify your GPU:
- CUDA GPU Compute Capability (e.g., 8.6 for RTX30*, 8.9 for RTX40*, 12.0 for RTX50*)
- CUDA Toolkit supported version

podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap

docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia localhost/ik_llama-cuda:swap

Troubleshooting

If CUDA is not available, use ik_llama-cpu instead.
If models are not found, ensure you mount the correct directory: -v /my_local_files/gguf:/models:ro
If you need to install podman or docker follow the Podman Installation or Install Docker Engine for your OS.

Extra

Custom commit: Build a specific ik_llama.cpp commit by modifying the Containerfile or using build args.

docker buildx bake --builder ik-llama-builder --set full.args.BUILD_COMMIT=1ec12b8 full

Using the tools in the full image:

$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...

docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash localhost/ik_llama-cuda:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...

Customize llama-swap config: Save the ./docker/ik_llama-cpu-swap.config.yaml or ./docker/ik_llama-cuda-swap.config.yaml locally (e.g., under /my_local_files/) then map it to /app/config.yaml inside the container appending -v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:ro to your podman run ... or docker run ....
Run in background: Replace -it with -d: podman run -d ... or docker run -d .... To stop it: podman stop ik_llama or docker stop ik_llama.
GGML_NATIVE: If you build the image on a different machine, change -DGGML_NATIVE=ON to -DGGML_NATIVE=OFF in the .Containerfile.
KV quantization types: To use more KV quantization types, build with -DGGML_IQK_FA_ALL_QUANTS=ON.

Cleanup unused CUDA images: If you experiment with several CUDA_VERSION, delete unused images (they are several GB):

podman image rm docker.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04 && \
  podman image rm docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04

Build without llama-swap: Change --target swap to --target server in docker-bake or Containerfiles.
Pre-made quants: Look for premade quants from ubergarm.
GGUF tools: Build custom quants with Thireus's tools.
Download prebuilt binaries: Download from ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA.
KoboldCPP experience: Croco.Cpp is a fork of KoboldCPP inferring GGUF/GGML models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet.

Credits

All credits to the awesome community:

llama-swap