Files
ik_llama.cpp/github-data/issues/353 - Binaries releases for Windows _.md
2025-07-23 13:31:53 +02:00

6.9 KiB

📝 #353 - Binaries releases for Windows ?

Author lbarasc
State Open
Created 2025-04-28
Updated 2025-06-06

Description

Hi,

Can you release binaries for windows working on different types of CPU (avx,avx2 etc...) ?

Thank you.


💬 Conversation

👤 ikawrakow commented the 2025-04-29 at 13:55:36:

If this repository gains more momentum and there are users testing on Windows and providing feedback, sure, we can consider releasing Windows binaries.

But in the meantime

  • I don't have access to a Windows machine
  • This is just a hobby project that does not have the funds to go out and rent something in the cloud
  • I don't feel OK releasing builds that were never tested

Another thing is that this project does not aim at providing the broad hardware support that mainline llama.cpp offers. The optimizations here are targeted towards newer CPUs and GPUs. For instance, a CPU old enough to not support AVX2 will not benefit at all from this project compared to mainline llama.cpp.


👤 PmNz8 commented the 2025-04-30 at 22:54:13:

I managed to compile from source for Windows cpu, but not for cuda - it is above my skills level. Having (best automatically) compiled binaries available on github would be great! I can always test some binaries if that would be helpful, one of my machine runs intel with avx512 (rocket lake), the other is AMD zen 3 + Nvidia ada.


👤 saood06 commented the 2025-05-01 at 07:32:23:

* I don't have access to a Windows machine
* I don't feel OK releasing builds that were never tested

If you want to do occasional releases (since we don't have CI like mainline does that generates over a dozen Windows builds), I can provide the Windows builds made with MVSC 2019 and CUDA v12.1 with AVX2 that have been tested and also Android builds. I could try cross compiling with AVX512 but they wouldn't be tested. ( I know this exists but I've never used it and so don't know how much of a slowdown it would have).


👤 SpookyT00th commented the 2025-05-01 at 22:11:05:

I noticed you mentioned that this is intended to support newer GPUs. Do you know if the Nvidia V100 (Volta Architecture) is supported? also, does this support tensor parallelism? i want to fit this model across 128GB VRAM : https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF


👤 SpookyT00th commented the 2025-05-01 at 22:11:05:

I noticed you mentioned that this is intended to support newer GPUs. Do you know if the Nvidia V100 (Volta Architecture) is supported?


👤 saood06 commented the 2025-05-02 at 03:05:53:

also, does this support tensor parallelism? i want to fit this model across 128GB VRAM : https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF

For MoE models such as the one you linked, -split-mode row does not function, see https://github.com/ikawrakow/ik_llama.cpp/issues/254


👤 sousekd commented the 2025-05-29 at 20:39:13:

I would be happy to test on AMD Epyc Turin + RTX 4090 / RTX Pro 6000, if builds are provided.


👤 Thireus commented the 2025-06-03 at 17:54:35:

If anyone wants to give a go to the build I've created, and report back if it works decently... https://github.com/Thireus/ik_llama.cpp/releases

Using CUDA 12.8 (and Blackwell compatible) + -DGGML_AVX512=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 See https://github.com/Thireus/ik_llama.cpp/blob/main/.github/workflows/release.yml#L448-L450


👤 lbarasc commented the 2025-06-03 at 19:25:40:

Well thank you !! i will test this on my server.


👤 ikawrakow commented the 2025-06-05 at 07:05:32:

How is the testing going here?

@Thireus

On x86_64 the CPU implementation has basically two implementation paths:

  • Vanilla AVX2, so /arch:AVX2 for MSVC.
  • "Fancy AVX512", which requires /arch:AVX512, plus __AVX512VNNI__, __AVX512VL__, __AVX512BW__ and __AVX512DQ__ being defined (if they are not defined, the implementation will use vanilla AVX2). These are supported on Zen4/Zen5 CPUs, and I guess some recent Intel CPUs. On Linux they will get defined with -march=native if the CPU supports them, not sure how this works under Windows.

There is also GEMM/GEMV implementation for CPUs natively supporting bf16 (e.g., Zen4/Zen5 and some recent Intel CPUs). To be turned on it requires __AVX512BF16__ to be defined.

So, to cover pre-build binaries for Windows users, one would need 6 different builds: vanilla AVX2, fancy AVX512 without bf16, fancy AVX512 with bf16, with or without CUDA (without CUDA for the users who don't have a supported GPU and don't want to get involved with installing CUDA toolkits and such so the app can run).


👤 PmNz8 commented the 2025-06-06 at 19:01:35:

@Thireus for me your binaries do not run. I try something simple like .\llama-cli.exe -m "D:\LLMs\bartowski\Qwen_Qwen3-4B-GGUF\Qwen_Qwen3-4B-Q8_0.gguf" and all I get in the log is:

[1749236397] Log start
[1749236397] Cmd: C:\Users\dawidgaming\Downloads\ik_llama-main-b3770-5a8bb97-bin-win-cuda-12.8-x64\llama-cli.exe -m D:\LLMs\bartowski\Qwen_Qwen3-4B-GGUF\Qwen_Qwen3-4B-Q8_0.gguf
[1749236397] main: build = 1 (5a8bb97)
[1749236397] main: built with MSVC 19.29.30159.0 for 
[1749236397] main: seed  = 1749236397
[1749236397] main: llama backend init
[1749236397] main: load the model and apply lora adapter, if any

Then it just shuts down.

Windows 11 + RTX 4090 @ 576.52 drivers.


👤 PmNz8 commented the 2025-06-06 at 19:01:35:

@Thireus for me your binaries do not run. I try something simple like .\llama-cli.exe -m "D:\LLMs\bartowski\Qwen_Qwen3-4B-GGUF\Qwen_Qwen3-4B-Q8_0.gguf" and all I get in the log is:

[1749236397] Log start
[1749236397] Cmd: C:\Users\dawidgaming\Downloads\ik_llama-main-b3770-5a8bb97-bin-win-cuda-12.8-x64\llama-cli.exe -m D:\LLMs\bartowski\Qwen_Qwen3-4B-GGUF\Qwen_Qwen3-4B-Q8_0.gguf
[1749236397] main: build = 1 (5a8bb97)
[1749236397] main: built with MSVC 19.29.30159.0 for 
[1749236397] main: seed  = 1749236397
[1749236397] main: llama backend init
[1749236397] main: load the model and apply lora adapter, if any

👤 kiron111 commented the 2025-06-06 at 19:55:45:

If anyone wants to give a go to the build I've created, and report back if it works decently... https://github.com/Thireus/ik_llama.cpp/releases

Using CUDA 12.8 (and Blackwell compatible) + -DGGML_AVX512=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 See https://github.com/Thireus/ik_llama.cpp/blob/main/.github/workflows/release.yml#L448-L450

Thanks it's great, I 've just stuck in compiling cuda version....failed for hours