Files
ik_llama.cpp/ggml
Kawrakow 3d9ee861f8 WIP - GPT-OSS working
However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.

My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.

For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.

Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.
2026-01-12 11:10:03 +02:00
..
2024-07-27 07:55:01 +02:00
2026-01-12 11:10:03 +02:00
2024-07-27 07:55:01 +02:00