mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-10 08:20:09 +00:00
* Create README.md * Add container files and llama-swap configs * Update main README.md * Build without GGML_IQK_FA_ALL_QUANTS Otherwise fails with CUDA_DOCKER_ARCH=default * Mention GGML_IQK_FA_ALL_QUANTS usage * First step more explicit
6.6 KiB
6.6 KiB
Build and use ik_llama.cpp with CPU or CPU+CUDA
Built on top of ikawrakow/ik_llama.cpp and llama-swap
All commands are provided for Podman and Docker.
CPU or CUDA sections under Build and Run are enough to get up and running.
Overview
Build
Builds two image tags:
swap: Includes onlyllama-swapandllama-server.full: Includesllama-server,llama-quantize, and other utilities.
Start: download the 4 files to a new directory (e.g. ~/ik_llama/) then follow the next steps.
└── ik_llama
├── ik_llama-cpu.Containerfile
├── ik_llama-cpu-swap.config.yaml
├── ik_llama-cuda.Containerfile
└── ik_llama-cuda-swap.config.yaml
CPU
podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap
docker image build --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full . && docker image build --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap .
CUDA
podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full && podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap
docker image build --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap .
Run
- Download
.ggufmodel files to your favorite directory (e.g./my_local_files/gguf). - Map it to
/modelsinside the container. - Open browser
http://localhost:9292and enjoy the features. - API endpoints are available at
http://localhost:9292/v1for use in other applications.
CPU
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro ik_llama-cpu:swap
CUDA
- Install Nvidia Drivers and CUDA on the host.
- For Docker, install NVIDIA Container Toolkit
- For Podman, install CDI Container Device Interface
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia ik_llama-cuda:swap
Troubleshooting
- If CUDA is not available, use
ik_llama-cpuinstead. - If models are not found, ensure you mount the correct directory:
-v /my_local_files/gguf:/models:ro - If you need to install
podmanordockerfollow the Podman Installation or Install Docker Engine for your OS.
Extra
CUSTOM_COMMITcan be used to build a specificik_llama.cppcommit (e.g.1ec12b8).
podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:swap
docker image build --file ik_llama-cuda.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:swap .
- Using the tools in the
fullimage:
$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...
docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash ik_llama-cuda:full
# ./llama-quantize ...
# python3 gguf-py/scripts/gguf_dump.py ...
# ./llama-perplexity ...
# ./llama-sweep-bench ...
- Customize
llama-swapconfig: save theik_llama-cpu-swap.config.yamlorik_llama-cuda-swap.config.yamllocaly (e.g. under/my_local_files/) then map it to/app/config.yamlinside the container appending-v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:roto yourpodman run ...ordocker run .... - To run the container in background, replace
-itwith-d:podman run -d ...ordocker run -d .... To stop it:podman stop ik_llamaordocker stop ik_llama. - If you build the image on the same machine where will be used, change
-DGGML_NATIVE=OFFto-DGGML_NATIVE=ONin the.Containerfile. - For a smaller CUDA build, identify your GPU CUDA GPU Compute Capability (e.g.
8.6for RTX30*0) then changeCUDA_DOCKER_ARCHinik_llama-cuda.Containerfilefromdefaultto your GPU architecture (e.g.CUDA_DOCKER_ARCH=86). - If you build only for your GPU architecture and want to make use of more KV quantization types, build with
-DGGML_IQK_FA_ALL_QUANTS=ON. - Get the best (measures kindly provided on each model card) quants from ubergarm if available.
- Usefull graphs and numbers on @magikRUKKOLA Perplexity vs Size Graphs for the recent quants (GLM-4.7, Kimi-K2-Thinking, Deepseek-V3.1-Terminus, Deepseek-R1, Qwen3-Coder, Kimi-K2, Chimera etc.) topic.
- Build custom quants with Thireus's tools.
- Download from ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA if you cannot build.
- For a KoboldCPP experience Croco.Cpp is fork of KoboldCPP infering GGML/GGUF models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet.
Credits
All credits to the awesome community: