mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-05-20 04:19:17 +00:00
[fix](kt-kernel): fix double mem used by safetensor loader (#1997)
Release the SafeTensor mmap loader singleton after each layer's load_weights() completes. The C++ engine already holds a deep copy (cpu_infer.sync() guarantees this), so releasing the mmap handles is safe. The next layer recreates the loader on demand. This halves peak memory usage during model loading (e.g. DSv3.2: 1.2T -> 613G). Based on #1966 by @poryfly — adapted to v0.6.2.post3 codebase (adds MXFP4 support missing from the original PR). Co-authored-by: xiongchenhui <xiongchenhui@hisense.com>
This commit is contained in:
@@ -166,11 +166,15 @@ class SafeTensorLoader:
|
||||
def close_all_handles(self):
|
||||
"""Close all file handles and clear the handle map.
|
||||
|
||||
Note: safetensors.safe_open doesn't have a close() method,
|
||||
so we just clear the references and let garbage collection handle cleanup.
|
||||
Note: safetensors.safe_open doesn't expose a close() method. Releasing
|
||||
the mmap relies on reference counting: once file_handle_map is cleared
|
||||
and no tensor holds a reference to the underlying mmap region, the OS
|
||||
will reclaim the page cache. gc.collect() is called here to trigger
|
||||
immediate reclamation rather than waiting for the next GC cycle.
|
||||
"""
|
||||
# safetensors.safe_open doesn't have close(), just clear references
|
||||
import gc
|
||||
self.file_handle_map.clear()
|
||||
gc.collect()
|
||||
|
||||
def load_experts(self, base_key: str, device: str = "cpu"):
|
||||
"""
|
||||
|
||||
Reference in New Issue
Block a user