mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-09 16:00:12 +00:00
Ik llama swap in container step by step guide (#1249)
* Create README.md * Add container files and llama-swap configs * Update main README.md * Build without GGML_IQK_FA_ALL_QUANTS Otherwise fails with CUDA_DOCKER_ARCH=default * Mention GGML_IQK_FA_ALL_QUANTS usage * First step more explicit
This commit is contained in:
@@ -8,6 +8,8 @@ This repository is a fork of [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
||||
|
||||
## Latest News
|
||||
|
||||
### [Step by step guide](./docker/README.md) for ik_llama.cpp in podman/docker container including llama-swap
|
||||
|
||||
### Model Support
|
||||
|
||||
LlaMA-3-Nemotron [PR 377](https://github.com/ikawrakow/ik_llama.cpp/pull/377), Qwen3 [PR 355](https://github.com/ikawrakow/ik_llama.cpp/pull/355), GLM-4 [PR 344](https://github.com/ikawrakow/ik_llama.cpp/pull/344), Command-A [PR 341](https://github.com/ikawrakow/ik_llama.cpp/pull/341), bitnet-b1.58-2B-4T [PR 337](https://github.com/ikawrakow/ik_llama.cpp/pull/337), LLaMA-4 [PR 321](https://github.com/ikawrakow/ik_llama.cpp/pull/321), Gemma3 [PR 276](https://github.com/ikawrakow/ik_llama.cpp/pull/276), DeepSeek-V3 [PR 176](https://github.com/ikawrakow/ik_llama.cpp/pull/176), Kimi-2 [PR 609](https://github.com/ikawrakow/ik_llama.cpp/pull/609), dots.llm1 [PR 573](https://github.com/ikawrakow/ik_llama.cpp/pull/573), Hunyuan [PR 565](https://github.com/ikawrakow/ik_llama.cpp/pull/565)
|
||||
|
||||
138
docker/README.md
Normal file
138
docker/README.md
Normal file
@@ -0,0 +1,138 @@
|
||||
# Build and use ik_llama.cpp with CPU or CPU+CUDA
|
||||
|
||||
Built on top of [ikawrakow/ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) and [llama-swap](https://github.com/mostlygeek/llama-swap)
|
||||
|
||||
All commands are provided for Podman and Docker.
|
||||
|
||||
CPU or CUDA sections under [Build](#Build) and [Run]($Run) are enough to get up and running.
|
||||
|
||||
## Overview
|
||||
|
||||
- [Build](#Build)
|
||||
- [Run](#Run)
|
||||
- [Troubleshooting](#Troubleshooting)
|
||||
- [Extra Features](#Extra)
|
||||
- [Credits](#Credits)
|
||||
|
||||
# Build
|
||||
|
||||
Builds two image tags:
|
||||
|
||||
- `swap`: Includes only `llama-swap` and `llama-server`.
|
||||
- `full`: Includes `llama-server`, `llama-quantize`, and other utilities.
|
||||
|
||||
Start: download the 4 files to a new directory (e.g. `~/ik_llama/`) then follow the next steps.
|
||||
|
||||
```
|
||||
└── ik_llama
|
||||
├── ik_llama-cpu.Containerfile
|
||||
├── ik_llama-cpu-swap.config.yaml
|
||||
├── ik_llama-cuda.Containerfile
|
||||
└── ik_llama-cuda-swap.config.yaml
|
||||
```
|
||||
|
||||
## CPU
|
||||
|
||||
```
|
||||
podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap
|
||||
```
|
||||
|
||||
```
|
||||
docker image build --file ik_llama-cpu.Containerfile --target full --tag ik_llama-cpu:full . && docker image build --file ik_llama-cpu.Containerfile --target swap --tag ik_llama-cpu:swap .
|
||||
```
|
||||
|
||||
## CUDA
|
||||
|
||||
```
|
||||
podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full && podman image build --format Dockerfile --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap
|
||||
```
|
||||
|
||||
```
|
||||
docker image build --file ik_llama-cuda.Containerfile --target full --tag ik_llama-cuda:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --tag ik_llama-cuda:swap .
|
||||
```
|
||||
|
||||
# Run
|
||||
|
||||
- Download `.gguf` model files to your favorite directory (e.g. `/my_local_files/gguf`).
|
||||
- Map it to `/models` inside the container.
|
||||
- Open browser `http://localhost:9292` and enjoy the features.
|
||||
- API endpoints are available at `http://localhost:9292/v1` for use in other applications.
|
||||
|
||||
## CPU
|
||||
|
||||
```
|
||||
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro localhost/ik_llama-cpu:swap
|
||||
```
|
||||
|
||||
```
|
||||
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro ik_llama-cpu:swap
|
||||
```
|
||||
|
||||
## CUDA
|
||||
|
||||
- Install Nvidia Drivers and CUDA on the host.
|
||||
- For Docker, install [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
|
||||
- For Podman, install [CDI Container Device Interface](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/cdi-support.html)
|
||||
|
||||
```
|
||||
podman run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --device nvidia.com/gpu=all --security-opt=label=disable localhost/ik_llama-cuda:swap
|
||||
```
|
||||
|
||||
```
|
||||
docker run -it --name ik_llama --rm -p 9292:8080 -v /my_local_files/gguf:/models:ro --runtime nvidia ik_llama-cuda:swap
|
||||
```
|
||||
|
||||
# Troubleshooting
|
||||
|
||||
- If CUDA is not available, use `ik_llama-cpu` instead.
|
||||
- If models are not found, ensure you mount the correct directory: `-v /my_local_files/gguf:/models:ro`
|
||||
- If you need to install `podman` or `docker` follow the [Podman Installation](https://podman.io/docs/installation) or [Install Docker Engine](https://docs.docker.com/engine/install) for your OS.
|
||||
|
||||
# Extra
|
||||
|
||||
- `CUSTOM_COMMIT` can be used to build a specific `ik_llama.cpp` commit (e.g. `1ec12b8`).
|
||||
|
||||
```
|
||||
podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:full && podman image build --format Dockerfile --file ik_llama-cpu.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cpu-1ec12b8:swap
|
||||
```
|
||||
|
||||
```
|
||||
docker image build --file ik_llama-cuda.Containerfile --target full --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:full . && docker image build --file ik_llama-cuda.Containerfile --target swap --build-arg CUSTOM_COMMIT="1ec12b8" --tag ik_llama-cuda-1ec12b8:swap .
|
||||
```
|
||||
|
||||
- Using the tools in the `full` image:
|
||||
|
||||
```
|
||||
$ podman run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --entrypoint bash localhost/ik_llama-cpu:full
|
||||
# ./llama-quantize ...
|
||||
# python3 gguf-py/scripts/gguf_dump.py ...
|
||||
# ./llama-perplexity ...
|
||||
# ./llama-sweep-bench ...
|
||||
```
|
||||
|
||||
```
|
||||
docker run -it --name ik_llama_full --rm -v /my_local_files/gguf:/models:ro --runtime nvidia --entrypoint bash ik_llama-cuda:full
|
||||
# ./llama-quantize ...
|
||||
# python3 gguf-py/scripts/gguf_dump.py ...
|
||||
# ./llama-perplexity ...
|
||||
# ./llama-sweep-bench ...
|
||||
```
|
||||
|
||||
- Customize `llama-swap` config: save the `ik_llama-cpu-swap.config.yaml` or `ik_llama-cuda-swap.config.yaml` localy (e.g. under `/my_local_files/`) then map it to `/app/config.yaml` inside the container appending `-v /my_local_files/ik_llama-cpu-swap.config.yaml:/app/config.yaml:ro` to your`podman run ...` or `docker run ...`.
|
||||
- To run the container in background, replace `-it` with `-d`: `podman run -d ...` or `docker run -d ...`. To stop it: `podman stop ik_llama` or `docker stop ik_llama`.
|
||||
- If you build the image on the same machine where will be used, change `-DGGML_NATIVE=OFF` to `-DGGML_NATIVE=ON` in the `.Containerfile`.
|
||||
- For a smaller CUDA build, identify your GPU [CUDA GPU Compute Capability](https://developer.nvidia.com/cuda/gpus) (e.g. `8.6` for RTX30*0) then change `CUDA_DOCKER_ARCH` in `ik_llama-cuda.Containerfile` from `default` to your GPU architecture (e.g. `CUDA_DOCKER_ARCH=86`).
|
||||
- If you build only for your GPU architecture and want to make use of more KV quantization types, build with `-DGGML_IQK_FA_ALL_QUANTS=ON`.
|
||||
- Get the best (measures kindly provided on each model card) quants from [ubergarm](https://huggingface.co/ubergarm/models) if available.
|
||||
- Usefull graphs and numbers on @magikRUKKOLA [Perplexity vs Size Graphs for the recent quants (GLM-4.7, Kimi-K2-Thinking, Deepseek-V3.1-Terminus, Deepseek-R1, Qwen3-Coder, Kimi-K2, Chimera etc.)](https://github.com/ikawrakow/ik_llama.cpp/discussions/715) topic.
|
||||
- Build custom quants with [Thireus](https://github.com/Thireus/GGUF-Tool-Suite)'s tools.
|
||||
- Download from [ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU and Windows CUDA](https://github.com/Thireus/ik_llama.cpp) if you cannot build.
|
||||
- For a KoboldCPP experience [Croco.Cpp is fork of KoboldCPP infering GGML/GGUF models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compatible with most of Ikawrakow's quants except Bitnet. ](https://github.com/Nexesenex/croco.cpp)
|
||||
|
||||
# Credits
|
||||
|
||||
All credits to the awesome community:
|
||||
|
||||
[ikawrakow](https://github.com/ikawrakow/ik_llama.cpp)
|
||||
|
||||
[llama-swap](https://github.com/mostlygeek/llama-swap)
|
||||
44
docker/ik_llama-cpu-swap.config.yaml
Normal file
44
docker/ik_llama-cpu-swap.config.yaml
Normal file
@@ -0,0 +1,44 @@
|
||||
healthCheckTimeout: 1800
|
||||
logRequests: true
|
||||
metricsMaxInMemory: 1000
|
||||
|
||||
models:
|
||||
"qwen3 (you need to download .gguf first)":
|
||||
proxy: "http://127.0.0.1:9999"
|
||||
cmd: >
|
||||
/app/llama-server
|
||||
--model /models/Qwen_Qwen3-0.6B-Q6_K.gguf
|
||||
--alias qwen3
|
||||
--port 9999
|
||||
--parallel 1
|
||||
--webui llamacpp
|
||||
--jinja
|
||||
--ctx-size 12288
|
||||
-fa on
|
||||
|
||||
"qwen3-vl (you need to download .gguf and mmproj first)":
|
||||
proxy: "http://127.0.0.1:9999"
|
||||
cmd: >
|
||||
/app/llama-server
|
||||
--model /models/Qwen_Qwen3-VL-4B-Instruct-IQ4_NL.gguf
|
||||
--mmproj /models/Qwen_Qwen3-VL-4B-Instruct-mmproj-f16.gguf
|
||||
--alias qwen3-vl
|
||||
--port 9999
|
||||
--parallel 1
|
||||
--webui llamacpp
|
||||
--jinja
|
||||
--ctx-size 12288
|
||||
-fa on
|
||||
|
||||
"smollm2 (will be downloaded automatically from huggingface.co)":
|
||||
proxy: "http://127.0.0.1:9999"
|
||||
cmd: >
|
||||
/app/llama-server
|
||||
--hf-repo mradermacher/SmolLM2-135M-i1-GGUF --hf-file SmolLM2-135M.i1-IQ4_NL.gguf
|
||||
--alias smollm2
|
||||
--port 9999
|
||||
--parallel 1
|
||||
--webui llamacpp
|
||||
--jinja
|
||||
--ctx-size 12288
|
||||
-fa on
|
||||
73
docker/ik_llama-cpu.Containerfile
Normal file
73
docker/ik_llama-cpu.Containerfile
Normal file
@@ -0,0 +1,73 @@
|
||||
ARG UBUNTU_VERSION=22.04
|
||||
|
||||
# Stage 1: Build
|
||||
FROM docker.io/ubuntu:$UBUNTU_VERSION AS build
|
||||
ENV LLAMA_CURL=1
|
||||
ENV LC_ALL=C.utf8
|
||||
ARG CUSTOM_COMMIT
|
||||
|
||||
RUN apt-get update && apt-get install -yq build-essential git libcurl4-openssl-dev curl libgomp1 cmake
|
||||
RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git /app
|
||||
WORKDIR /app
|
||||
RUN if [ -n "$CUSTOM_COMMIT" ]; then git switch --detach "$CUSTOM_COMMIT"; fi
|
||||
RUN cmake -B build -DGGML_NATIVE=OFF -DLLAMA_CURL=ON -DGGML_IQK_FA_ALL_QUANTS=ON && \
|
||||
cmake --build build --config Release -j$(nproc)
|
||||
RUN mkdir -p /app/lib && \
|
||||
find build -name "*.so" -exec cp {} /app/lib \;
|
||||
RUN mkdir -p /app/build/src && \
|
||||
find build -name "*.so" -exec cp {} /app/build/src \;
|
||||
RUN mkdir -p /app/full \
|
||||
&& cp build/bin/* /app/full \
|
||||
&& cp *.py /app/full \
|
||||
&& cp -r gguf-py /app/full \
|
||||
&& cp -r requirements /app/full \
|
||||
&& cp requirements.txt /app/full \
|
||||
&& cp .devops/tools.sh /app/full/tools.sh
|
||||
|
||||
# Stage 2: Base
|
||||
FROM docker.io/ubuntu:$UBUNTU_VERSION AS base
|
||||
RUN apt-get update && apt-get install -yq libgomp1 curl \
|
||||
&& apt-get autoremove -y \
|
||||
&& apt-get clean -y \
|
||||
&& rm -rf /tmp/* /var/tmp/* \
|
||||
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
|
||||
&& find /var/cache -type f -delete
|
||||
COPY --from=build /app/lib/ /app
|
||||
|
||||
# Stage 3: Full
|
||||
FROM base AS full
|
||||
COPY --from=build /app/full /app
|
||||
RUN mkdir -p /app/build/src
|
||||
COPY --from=build /app/build/src /app/build/src
|
||||
WORKDIR /app
|
||||
RUN apt-get update && apt-get install -yq \
|
||||
git \
|
||||
python3 \
|
||||
python3-pip \
|
||||
&& pip install --upgrade pip setuptools wheel \
|
||||
&& pip install -r requirements.txt \
|
||||
&& apt-get autoremove -y \
|
||||
&& apt-get clean -y \
|
||||
&& rm -rf /tmp/* /var/tmp/* \
|
||||
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
|
||||
&& find /var/cache -type f -delete
|
||||
ENTRYPOINT ["/app/full/tools.sh"]
|
||||
|
||||
# Stage 4: Server
|
||||
FROM base AS server
|
||||
ENV LLAMA_ARG_HOST=0.0.0.0
|
||||
COPY --from=build /app/full/llama-server /app/llama-server
|
||||
WORKDIR /app
|
||||
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
|
||||
ENTRYPOINT [ "/app/llama-server" ]
|
||||
|
||||
# Stage 5: Swap
|
||||
FROM server AS swap
|
||||
ARG LS_REPO=mostlygeek/llama-swap
|
||||
ARG LS_VER=189
|
||||
RUN curl -LO "https://github.com/${LS_REPO}/releases/download/v${LS_VER}/llama-swap_${LS_VER}_linux_amd64.tar.gz" \
|
||||
&& tar -zxf "llama-swap_${LS_VER}_linux_amd64.tar.gz" \
|
||||
&& rm "llama-swap_${LS_VER}_linux_amd64.tar.gz"
|
||||
COPY ./ik_llama-cpu-swap.config.yaml /app/config.yaml
|
||||
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080"]
|
||||
ENTRYPOINT [ "/app/llama-swap", "-config", "/app/config.yaml" ]
|
||||
54
docker/ik_llama-cuda-swap.config.yaml
Normal file
54
docker/ik_llama-cuda-swap.config.yaml
Normal file
@@ -0,0 +1,54 @@
|
||||
healthCheckTimeout: 1800
|
||||
logRequests: true
|
||||
metricsMaxInMemory: 1000
|
||||
|
||||
models:
|
||||
"qwen3 (you need to download .gguf first)":
|
||||
proxy: "http://127.0.0.1:9999"
|
||||
cmd: >
|
||||
/app/llama-server
|
||||
--model /models/Qwen_Qwen3-0.6B-Q6_K.gguf
|
||||
--alias qwen3
|
||||
--port 9999
|
||||
--parallel 1
|
||||
--webui llamacpp
|
||||
--jinja
|
||||
--ctx-size 12288
|
||||
-fa on
|
||||
--merge-qkv
|
||||
-ngl 999 --threads-batch 1
|
||||
-ctk q8_0 -ctv q8_0
|
||||
|
||||
"oss-moe (you need to download .gguf first)":
|
||||
proxy: "http://127.0.0.1:9999"
|
||||
cmd: >
|
||||
/app/llama-server
|
||||
--model /models/kldzj_gpt-oss-120b-heretic-MXFP4_MOE-00001-of-00002.gguf
|
||||
--alias gpt-oss
|
||||
--port 9999
|
||||
--parallel 1
|
||||
--webui llamacpp
|
||||
--jinja
|
||||
--ctx-size 12288
|
||||
-fa on
|
||||
--merge-qkv
|
||||
-ngl 999
|
||||
--n-cpu-moe 30
|
||||
-ctk q8_0 -ctv q8_0
|
||||
--grouped-expert-routing
|
||||
--reasoning-format auto --chat-template-kwargs '{"reasoning_effort": "medium"}'
|
||||
|
||||
"smollm2 (will be downloaded automatically from huggingface.co)":
|
||||
proxy: "http://127.0.0.1:9999"
|
||||
cmd: >
|
||||
/app/llama-server
|
||||
--hf-repo mradermacher/SmolLM2-135M-i1-GGUF --hf-file SmolLM2-135M.i1-IQ4_NL.gguf
|
||||
--alias smollm2
|
||||
--port 9999
|
||||
--parallel 1
|
||||
--webui llamacpp
|
||||
--jinja
|
||||
--ctx-size 12288
|
||||
-fa on
|
||||
--merge-qkv
|
||||
-ngl 999 --threads-batch 1
|
||||
76
docker/ik_llama-cuda.Containerfile
Normal file
76
docker/ik_llama-cuda.Containerfile
Normal file
@@ -0,0 +1,76 @@
|
||||
ARG UBUNTU_VERSION=24.04
|
||||
ARG CUDA_VERSION=12.6.2
|
||||
ARG BASE_CUDA_DEV_CONTAINER=docker.io/nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}
|
||||
ARG BASE_CUDA_RUN_CONTAINER=docker.io/nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}
|
||||
|
||||
# Stage 1: Build
|
||||
FROM ${BASE_CUDA_DEV_CONTAINER} AS build
|
||||
ARG CUDA_DOCKER_ARCH=default # CUDA architecture to build for (defaults to all supported archs)
|
||||
RUN apt-get update && apt-get install -yq build-essential git libcurl4-openssl-dev curl libgomp1 cmake
|
||||
|
||||
RUN git clone https://github.com/ikawrakow/ik_llama.cpp.git /app
|
||||
WORKDIR /app
|
||||
RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \
|
||||
export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \
|
||||
fi && \
|
||||
cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
|
||||
cmake --build build --config Release -j$(nproc)
|
||||
RUN mkdir -p /app/lib && \
|
||||
find build -name "*.so" -exec cp {} /app/lib \;
|
||||
RUN mkdir -p /app/build/src && \
|
||||
find build -name "*.so" -exec cp {} /app/build/src \;
|
||||
RUN mkdir -p /app/full \
|
||||
&& cp build/bin/* /app/full \
|
||||
&& cp *.py /app/full \
|
||||
&& cp -r gguf-py /app/full \
|
||||
&& cp -r requirements /app/full \
|
||||
&& cp requirements.txt /app/full \
|
||||
&& cp .devops/tools.sh /app/full/tools.sh
|
||||
|
||||
# Stage 2: base
|
||||
FROM ${BASE_CUDA_RUN_CONTAINER} AS base
|
||||
RUN apt-get update && apt-get install -yq libgomp1 curl \
|
||||
&& update-ca-certificates \
|
||||
&& apt-get autoremove -y \
|
||||
&& apt-get clean -y \
|
||||
&& rm -rf /tmp/* /var/tmp/* \
|
||||
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
|
||||
&& find /var/cache -type f -delete
|
||||
COPY --from=build /app/lib/ /app
|
||||
|
||||
# Stage 3: full
|
||||
FROM base AS full
|
||||
COPY --from=build /app/full /app
|
||||
RUN mkdir -p /app/build/src
|
||||
COPY --from=build /app/build/src /app/build/src
|
||||
WORKDIR /app
|
||||
RUN apt-get update && apt-get install -yq \
|
||||
git \
|
||||
python3 \
|
||||
python3-pip \
|
||||
&& pip3 install --break-system-packages -r requirements.txt \
|
||||
&& apt-get autoremove -y \
|
||||
&& apt-get clean -y \
|
||||
&& rm -rf /tmp/* /var/tmp/* \
|
||||
&& find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
|
||||
&& find /var/cache -type f -delete
|
||||
ENTRYPOINT ["/app/tools.sh"]
|
||||
|
||||
# Stage 4: Server
|
||||
FROM base AS server
|
||||
ENV LLAMA_ARG_HOST=0.0.0.0
|
||||
COPY --from=build /app/full/llama-server /app/llama-server
|
||||
WORKDIR /app
|
||||
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
|
||||
ENTRYPOINT [ "/app/llama-server" ]
|
||||
|
||||
# Stage 5: Swap
|
||||
FROM server AS swap
|
||||
ARG LS_REPO=mostlygeek/llama-swap
|
||||
ARG LS_VER=189
|
||||
RUN curl -LO "https://github.com/${LS_REPO}/releases/download/v${LS_VER}/llama-swap_${LS_VER}_linux_amd64.tar.gz" \
|
||||
&& tar -zxf "llama-swap_${LS_VER}_linux_amd64.tar.gz" \
|
||||
&& rm "llama-swap_${LS_VER}_linux_amd64.tar.gz"
|
||||
COPY ./ik_llama-cuda-swap.config.yaml /app/config.yaml
|
||||
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080"]
|
||||
ENTRYPOINT [ "/app/llama-swap", "-config", "/app/config.yaml" ]
|
||||
Reference in New Issue
Block a user