* Fuse sigmoid+add+grouped_topk+get_rows (CPU)
* Fix CPU + CUDA
but CUDA is somehow not 100% correct as I get a slightly different
PPL (lower!)
* Minor
* Fuse sigmoid+add+topk+get_rows (CUDA)
* Fuse sigmoid+add+topk+get_rows (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CUDA)
* cpu: turn off the openai topk fusing for now
Something is not right and I don't see the bug.
On the CPU one doesn't gain much if anything, so not a big loss.
* Also fuse sum_rows and div
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>