diff --git a/README.md b/README.md index 307a85fc..f4ac461f 100644 --- a/README.md +++ b/README.md @@ -8,6 +8,8 @@ This repository is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp) ## Latest News +### [Step by step guide](./docker/README.md) for ik_llama.cpp in podman/docker container including llama-swap + ### Model Support LlaMA-3-Nemotron [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377), Qwen3 [PR 355](https://github.com/ikawrakow/ik_llama.cpp/pull/355), GLM-4 [PR 344](https://github.com/ikawrakow/ik_llama.cpp/pull/344), Command-A [PR 341](https://github.com/ikawrakow/ik_llama.cpp/pull/341), bitnet-b1.58-2B-4T [PR 337](https://github.com/ikawrakow/ik_llama.cpp/pull/337), LLaMA-4 [PR 321](https://github.com/ikawrakow/ik_llama.cpp/pull/321), Gemma3 [PR 276](https://github.com/ikawrakow/ik_llama.cpp/pull/276), DeepSeek-V3 [PR 176](https://github.com/ikawrakow/ik_llama.cpp/pull/176), Kimi-2 [PR 609](https://github.com/ikawrakow/ik_llama.cpp/pull/609), dots.llm1 [PR 573](https://github.com/ikawrakow/ik_llama.cpp/pull/573), Hunyuan [PR 565](https://github.com/ikawrakow/ik_llama.cpp/pull/565) diff --git a/docker/README.md b/docker/README.md new file mode 100644 index 00000000..6589d128 --- /dev/null +++ b/docker/README.md @@ -0,0 +1,138 @@ +# Build and use ik_llama.cpp with CPU or CPU+CUDA + +Built on top of [ikawrakow/ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) and [llama-swap](https://github.com/mostlygeek/llama-swap) + +All commands are provided for Podman and Docker. + +CPU or CUDA sections under [Build](#Build) and [Run]($Run) are enough to get up and running. + +## Overview + +- [Build](#Build) +- [Run](#Run) +- [Troubleshooting](#Troubleshooting) +- [Extra Features](#Extra) +- [Credits](#Credits) + +# Build + +Builds two image tags: + +- `swap`: Includes only `llama-swap` and `llama-server`. +- `full`: Includes `llama-server`, `llama-quantize`, and other utilities. + +Start: download the 4 files to a new directory (e.g. `~/ik_llama/`) then follow the next steps. + +``` +└── ik_llama + ├── ik_llama-cpu.Containerfile + ├── ik_llama-cpu-swap.config.yaml + ├── ik_llama-cuda.Containerfile + └── ik_llama-cuda-swap.config.yaml +``` + +## CPU + +``` +podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap +``` + +``` +docker image build --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full . && docker image build --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap . +``` + +## CUDA + +``` +podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full && podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap +``` + +``` +docker image build --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap . +``` + +# Run + +- Download `.gguf` model files to your favorite directory (e.g. `/my_local_files/gguf`). +- Map it to `/models` inside the container. +- Open browser `http://localhost:9292` and enjoy the features. +- API endpoints are available at `http://localhost:9292/v1` for use in other applications. + +## CPU + +``` +podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap +``` + +``` +docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro ik_llama-cpu:swap +``` + +## CUDA + +- Install Nvidia Drivers and CUDA on the host. +- For Docker, install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) +- For Podman, install [CDI Container Device Interface](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html) + +``` +podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap +``` + +``` +docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia ik_llama-cuda:swap +``` + +# Troubleshooting + +- If CUDA is not available, use `ik_llama-cpu` instead. +- If models are not found, ensure you mount the correct directory: `-v /my_local_files/gguf:/models:ro` +- If you need to install `podman` or `docker` follow the [Podman Installation](https://podman.io/docs/installation) or [Install Docker Engine](https://docs.docker.com/engine/install) for your OS. + +# Extra + +- `CUSTOM_COMMIT` can be used to build a specific `ik_llama.cpp` commit (e.g. `1ec12b8`). + +``` +podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:swap +``` + +``` +docker image build --file ik_llama-cuda.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:swap . +``` + +- Using the tools in the `full` image: + +``` +$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full +# ./llama-quantize ... +# python3 gguf-py/scripts/gguf_dump.py ... +# ./llama-perplexity ... +# ./llama-sweep-bench ... +``` + +``` +docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash ik_llama-cuda:full +# ./llama-quantize ... +# python3 gguf-py/scripts/gguf_dump.py ... +# ./llama-perplexity ... +# ./llama-sweep-bench ... +``` + +- Customize `llama-swap` config: save the `ik_llama-cpu-swap.config.yaml` or `ik_llama-cuda-swap.config.yaml` localy (e.g. under `/my_local_files/`) then map it to `/app/config.yaml` inside the container appending `-v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:ro` to your`podman run ...` or `docker run ...`. +- To run the container in background, replace `-it` with `-d`: `podman run -d ...` or `docker run -d ...`. To stop it: `podman stop ik_llama` or `docker stop ik_llama`. +- If you build the image on the same machine where will be used, change `-DGGML_NATIVE=OFF` to `-DGGML_NATIVE=ON` in the `.Containerfile`. +- For a smaller CUDA build, identify your GPU [CUDA GPU Compute Capability](https://developer.nvidia.com/cuda/gpus) (e.g. `8.6` for RTX30*0) then change `CUDA_DOCKER_ARCH` in `ik_llama-cuda.Containerfile` from `default` to your GPU architecture (e.g. `CUDA_DOCKER_ARCH=86`). +- If you build only for your GPU architecture and want to make use of more KV quantization types, build with `-DGGML_IQK_FA_ALL_QUANTS=ON`. +- Get the best (measures kindly provided on each model card) quants from [ubergarm](https://huggingface.co/ubergarm/models) if available. +- Usefull graphs and numbers on @magikRUKKOLA [Perplexity vs Size Graphs for the recent quants (GLM-4.7, Kimi-K2-Thinking, Deepseek-V3.1-Terminus, Deepseek-R1, Qwen3-Coder, Kimi-K2, Chimera etc.)](https://github.com/ikawrakow/ik_llama.cpp/discussions/715) topic. +- Build custom quants with [Thireus](https://github.com/Thireus/GGUF-Tool-Suite)'s tools. +- Download from [ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA](https://github.com/Thireus/ik_llama.cpp) if you cannot build. +- For a KoboldCPP experience [Croco.Cpp is fork of KoboldCPP infering GGML/GGUF models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet. ](https://github.com/Nexesenex/croco.cpp) + +# Credits + +All credits to the awesome community: + +[ikawrakow](https://github.com/ikawrakow/ik_llama.cpp) + +[llama-swap](https://github.com/mostlygeek/llama-swap) diff --git a/docker/ik_llama-cpu-swap.config.yaml b/docker/ik_llama-cpu-swap.config.yaml new file mode 100644 index 00000000..1ff1b445 --- /dev/null +++ b/docker/ik_llama-cpu-swap.config.yaml @@ -0,0 +1,44 @@ +healthCheckTimeout: 1800 +logRequests: true +metricsMaxInMemory: 1000 + +models: + "qwen3 (you need to download .gguf first)": + proxy: "http://127.0.0.1:9999" + cmd: > + /app/llama-server + --model /models/Qwen_Qwen3-0.6B-Q6_K.gguf + --alias qwen3 + --port 9999 + --parallel 1 + --webui llamacpp + --jinja + --ctx-size 12288 + -fa on + + "qwen3-vl (you need to download .gguf and mmproj first)": + proxy: "http://127.0.0.1:9999" + cmd: > + /app/llama-server + --model /models/Qwen_Qwen3-VL-4B-Instruct-IQ4_NL.gguf + --mmproj /models/Qwen_Qwen3-VL-4B-Instruct-mmproj-f16.gguf + --alias qwen3-vl + --port 9999 + --parallel 1 + --webui llamacpp + --jinja + --ctx-size 12288 + -fa on + + "smollm2 (will be downloaded automatically from huggingface.co)": + proxy: "http://127.0.0.1:9999" + cmd: > + /app/llama-server + --hf-repo mradermacher/SmolLM2-135M-i1-GGUF --hf-file SmolLM2-135M.i1-IQ4_NL.gguf + --alias smollm2 + --port 9999 + --parallel 1 + --webui llamacpp + --jinja + --ctx-size 12288 + -fa on diff --git a/docker/ik_llama-cpu.Containerfile b/docker/ik_llama-cpu.Containerfile new file mode 100644 index 00000000..5d76dbb2 --- /dev/null +++ b/docker/ik_llama-cpu.Containerfile @@ -0,0 +1,73 @@ +ARG UBUNTU_VERSION=22.04 + +# Stage 1: Build +FROM docker.io/ubuntu:$UBUNTU_VERSION AS build +ENV LLAMA_CURL=1 +ENV LC_ALL=C.utf8 +ARG CUSTOM_COMMIT + +RUN apt-get update && apt-get install -yq build-essential git libcurl4-openssl-dev curl libgomp1 cmake +RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git /app +WORKDIR /app +RUN if [ -n "$CUSTOM_COMMIT" ]; then git switch --detach "$CUSTOM_COMMIT"; fi +RUN cmake -B build -DGGML_NATIVE=OFF -DLLAMA_CURL=ON -DGGML_IQK_FA_ALL_QUANTS=ON && \ + cmake --build build --config Release -j$(nproc) +RUN mkdir -p /app/lib && \ + find build -name "*.so" -exec cp {} /app/lib \; +RUN mkdir -p /app/build/src && \ + find build -name "*.so" -exec cp {} /app/build/src \; +RUN mkdir -p /app/full \ + && cp build/bin/* /app/full \ + && cp *.py /app/full \ + && cp -r gguf-py /app/full \ + && cp -r requirements /app/full \ + && cp requirements.txt /app/full \ + && cp .devops/tools.sh /app/full/tools.sh + +# Stage 2: Base +FROM docker.io/ubuntu:$UBUNTU_VERSION AS base +RUN apt-get update && apt-get install -yq libgomp1 curl \ + && apt-get autoremove -y \ + && apt-get clean -y \ + && rm -rf /tmp/* /var/tmp/* \ + && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \ + && find /var/cache -type f -delete +COPY --from=build /app/lib/ /app + +# Stage 3: Full +FROM base AS full +COPY --from=build /app/full /app +RUN mkdir -p /app/build/src +COPY --from=build /app/build/src /app/build/src +WORKDIR /app +RUN apt-get update && apt-get install -yq \ + git \ + python3 \ + python3-pip \ + && pip install --upgrade pip setuptools wheel \ + && pip install -r requirements.txt \ + && apt-get autoremove -y \ + && apt-get clean -y \ + && rm -rf /tmp/* /var/tmp/* \ + && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \ + && find /var/cache -type f -delete +ENTRYPOINT ["/app/full/tools.sh"] + +# Stage 4: Server +FROM base AS server +ENV LLAMA_ARG_HOST=0.0.0.0 +COPY --from=build /app/full/llama-server /app/llama-server +WORKDIR /app +HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ] +ENTRYPOINT [ "/app/llama-server" ] + +# Stage 5: Swap +FROM server AS swap +ARG LS_REPO=mostlygeek/llama-swap +ARG LS_VER=189 +RUN curl -LO "https://github.com/${LS_REPO}/releases/download/v${LS_VER}/llama-swap_${LS_VER}_linux_amd64.tar.gz" \ + && tar -zxf "llama-swap_${LS_VER}_linux_amd64.tar.gz" \ + && rm "llama-swap_${LS_VER}_linux_amd64.tar.gz" +COPY ./ik_llama-cpu-swap.config.yaml /app/config.yaml +HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080"] +ENTRYPOINT [ "/app/llama-swap", "-config", "/app/config.yaml" ] diff --git a/docker/ik_llama-cuda-swap.config.yaml b/docker/ik_llama-cuda-swap.config.yaml new file mode 100644 index 00000000..6fd0b3b1 --- /dev/null +++ b/docker/ik_llama-cuda-swap.config.yaml @@ -0,0 +1,54 @@ +healthCheckTimeout: 1800 +logRequests: true +metricsMaxInMemory: 1000 + +models: + "qwen3 (you need to download .gguf first)": + proxy: "http://127.0.0.1:9999" + cmd: > + /app/llama-server + --model /models/Qwen_Qwen3-0.6B-Q6_K.gguf + --alias qwen3 + --port 9999 + --parallel 1 + --webui llamacpp + --jinja + --ctx-size 12288 + -fa on + --merge-qkv + -ngl 999 --threads-batch 1 + -ctk q8_0 -ctv q8_0 + + "oss-moe (you need to download .gguf first)": + proxy: "http://127.0.0.1:9999" + cmd: > + /app/llama-server + --model /models/kldzj_gpt-oss-120b-heretic-MXFP4_MOE-00001-of-00002.gguf + --alias gpt-oss + --port 9999 + --parallel 1 + --webui llamacpp + --jinja + --ctx-size 12288 + -fa on + --merge-qkv + -ngl 999 + --n-cpu-moe 30 + -ctk q8_0 -ctv q8_0 + --grouped-expert-routing + --reasoning-format auto --chat-template-kwargs '{"reasoning_effort": "medium"}' + + "smollm2 (will be downloaded automatically from huggingface.co)": + proxy: "http://127.0.0.1:9999" + cmd: > + /app/llama-server + --hf-repo mradermacher/SmolLM2-135M-i1-GGUF --hf-file SmolLM2-135M.i1-IQ4_NL.gguf + --alias smollm2 + --port 9999 + --parallel 1 + --webui llamacpp + --jinja + --ctx-size 12288 + -fa on + --merge-qkv + -ngl 999 --threads-batch 1 diff --git a/docker/ik_llama-cuda.Containerfile b/docker/ik_llama-cuda.Containerfile new file mode 100644 index 00000000..42b5c433 --- /dev/null +++ b/docker/ik_llama-cuda.Containerfile @@ -0,0 +1,76 @@ +ARG UBUNTU_VERSION=24.04 +ARG CUDA_VERSION=12.6.2 +ARG BASE_CUDA_DEV_CONTAINER=docker.io/nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION} +ARG BASE_CUDA_RUN_CONTAINER=docker.io/nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION} + +# Stage 1: Build +FROM ${BASE_CUDA_DEV_CONTAINER} AS build +ARG CUDA_DOCKER_ARCH=default # CUDA architecture to build for (defaults to all supported archs) +RUN apt-get update && apt-get install -yq build-essential git libcurl4-openssl-dev curl libgomp1 cmake + +RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git /app +WORKDIR /app +RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \ + export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \ + fi && \ + cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \ + cmake --build build --config Release -j$(nproc) +RUN mkdir -p /app/lib && \ + find build -name "*.so" -exec cp {} /app/lib \; +RUN mkdir -p /app/build/src && \ + find build -name "*.so" -exec cp {} /app/build/src \; +RUN mkdir -p /app/full \ + && cp build/bin/* /app/full \ + && cp *.py /app/full \ + && cp -r gguf-py /app/full \ + && cp -r requirements /app/full \ + && cp requirements.txt /app/full \ + && cp .devops/tools.sh /app/full/tools.sh + +# Stage 2: base +FROM ${BASE_CUDA_RUN_CONTAINER} AS base +RUN apt-get update && apt-get install -yq libgomp1 curl \ + && update-ca-certificates \ + && apt-get autoremove -y \ + && apt-get clean -y \ + && rm -rf /tmp/* /var/tmp/* \ + && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \ + && find /var/cache -type f -delete +COPY --from=build /app/lib/ /app + +# Stage 3: full +FROM base AS full +COPY --from=build /app/full /app +RUN mkdir -p /app/build/src +COPY --from=build /app/build/src /app/build/src +WORKDIR /app +RUN apt-get update && apt-get install -yq \ + git \ + python3 \ + python3-pip \ + && pip3 install --break-system-packages -r requirements.txt \ + && apt-get autoremove -y \ + && apt-get clean -y \ + && rm -rf /tmp/* /var/tmp/* \ + && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \ + && find /var/cache -type f -delete +ENTRYPOINT ["/app/tools.sh"] + +# Stage 4: Server +FROM base AS server +ENV LLAMA_ARG_HOST=0.0.0.0 +COPY --from=build /app/full/llama-server /app/llama-server +WORKDIR /app +HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ] +ENTRYPOINT [ "/app/llama-server" ] + +# Stage 5: Swap +FROM server AS swap +ARG LS_REPO=mostlygeek/llama-swap +ARG LS_VER=189 +RUN curl -LO "https://github.com/${LS_REPO}/releases/download/v${LS_VER}/llama-swap_${LS_VER}_linux_amd64.tar.gz" \ + && tar -zxf "llama-swap_${LS_VER}_linux_amd64.tar.gz" \ + && rm "llama-swap_${LS_VER}_linux_amd64.tar.gz" +COPY ./ik_llama-cuda-swap.config.yaml /app/config.yaml +HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080"] +ENTRYPOINT [ "/app/llama-swap", "-config", "/app/config.yaml" ]