* convert_hf_to_gguf for Kimi-K2-Instruct
Adapt mainline `PR14653` for tokenizer while maintaining proper MLA
tensors. Tested with this workflow using deepseek fp8_cast_bf16.py and
triton-cpu to upcast the fp8 safetensors to bf16 safetensors then used
this convert_hf_to_gguf.
* Add Kimi-K2 chat template
moonshotai/Kimi-K2-Instruct
https://github.com/ikawrakow/ik_llama.cpp/pull/609#issuecomment-3071259454
* kimi-k2 add ass to template to get response
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>