* Add Turing and Ampere (A100) GGML to docker build file At the moment, the docker file for image builds do not build for CUDA architectures below 8.6, and ik_llama.cpp specifies support for architectures Turing and above, this PR sets the CUDA architecture list to include the architecture for Turing (7.5) and A100 (8.0) * Remove 80 because few ppl have A100s and it does seem like many cuda arches cause issues for build * switch to 86-real and 89-real with 75, 80, 90 using virtual ptx jit * nvm, even adding 90-virtual causes linker error --------- Co-authored-by: Codex <codex@local>
Build and use ik_llama.cpp with CPU or CPU+CUDA
Built on top of ikawrakow/ik_llama.cpp and llama-swap
Commands are provided for Podman and Docker.
CPU or CUDA sections under Prebuilt/Build and Run are enough to get up and running.
Overview
Prebuilt Docker images
Pull one of the available images from ghcr.io. View all tags
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cpu-swap
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cpu-server
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cpu-full
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cu12-swap
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cu12-server
docker pull ghcr.io/ikawrakow/ik-llama-cpp:cu12-full
Build
The project uses Docker Bake for building multiple targets efficiently.
Clone the repository: git clone https://github.com/ikawrakow/ik_llama.cpp
Use docker-bake.
docker buildx create --name ik-llama-builder --use
CPU Variant
VARIANT=cpu docker buildx bake --builder ik-llama-builder --load full swap
Or with custom tags:
REPO_OWNER=yourname VARIANT=cpu docker buildx bake --builder ik-llama-builder --load \
-f ./docker-bake.hcl \
full swap
CUDA Variant
First, set the CUDA version and GPU architecture in ik_llama-cuda.Containerfile:
CUDA_DOCKER_ARCH: Your GPU's compute capability (e.g.,86for RTX 30*,89for RTX 40*,12.0for RTX 50*)CUDA_VERSION: CUDA Toolkit version (e.g.,12.6.2,13.1.1)
VARIANT=cu12 docker buildx bake --builder ik-llama-builder --load full swap
Build Targets
Builds two image tags per variant:
full: Includesllama-server,llama-quantize, and other utilities.swap: Includes onlyllama-swapandllama-server.
Run
- Download
.ggufmodel files to your favorite directory (e.g.,/my_local_files/gguf). - Map it to
/modelsinside the container. - Open browser
http://localhost:9292and enjoy the features. - API endpoints are available at
http://localhost:9292/v1for use in other applications.
CPU
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
CUDA
- Install Nvidia Drivers and CUDA on the host.
- For Docker, install NVIDIA Container Toolkit
- For Podman, install CDI Container Device Interface
- Identify your GPU:
- CUDA GPU Compute Capability (e.g.,
8.6for RTX30*,8.9for RTX40*,12.0for RTX50*) - CUDA Toolkit supported version
- CUDA GPU Compute Capability (e.g.,
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia localhost/ik_llama-cuda:swap
Troubleshooting
- If CUDA is not available, use
ik_llama-cpuinstead. - If models are not found, ensure you mount the correct directory:
-v /my_local_files/gguf:/models:ro - If you need to install
podmanordockerfollow the Podman Installation or Install Docker Engine for your OS.
Extra
- Custom commit: Build a specific
ik_llama.cppcommit by modifying the Containerfile or using build args.
docker buildx bake --builder ik-llama-builder --set full.args.BUILD_COMMIT=1ec12b8 full
- Using the tools in the
fullimage:
$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...
docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash localhost/ik_llama-cuda:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...
-
Customize
llama-swapconfig: Save the./docker/ik_llama-cpu-swap.config.yamlor./docker/ik_llama-cuda-swap.config.yamllocally (e.g., under/my_local_files/) then map it to/app/config.yamlinside the container appending-v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:roto yourpodman run ...ordocker run .... -
Run in background: Replace
-itwith-d:podman run -d ...ordocker run -d .... To stop it:podman stop ik_llamaordocker stop ik_llama. -
GGML_NATIVE: If you build the image on a different machine, change
-DGGML_NATIVE=ONto-DGGML_NATIVE=OFFin the.Containerfile. -
KV quantization types: To use more KV quantization types, build with
-DGGML_IQK_FA_ALL_QUANTS=ON. -
Cleanup unused CUDA images: If you experiment with several
CUDA_VERSION, delete unused images (they are several GB):podman image rm docker.io/nvidia/cuda:12.4.0-runtime-ubuntu22.04 && \ podman image rm docker.io/nvidia/cuda:12.4.0-devel-ubuntu22.04 -
Build without
llama-swap: Change--target swapto--target serverin docker-bake or Containerfiles. -
Pre-made quants: Look for premade quants from ubergarm.
-
GGUF tools: Build custom quants with Thireus's tools.
-
Download prebuilt binaries: Download from ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA.
-
KoboldCPP experience: Croco.Cpp is a fork of KoboldCPP inferring GGUF/GGML models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet.
Credits
All credits to the awesome community: