Model: Add Tensor Parallel support

Use the tensor parallel loader when the flag is enabled. The new loader has its own autosplit implementation, so gpu_split_auto isn't valid here. Also make it easier to determine which cache type to use rather than multiple if/else statements. Signed-off-by: kingbri <bdashore3@proton.me>
2026-04-26 09:18:53 +00:00 · 2024-08-16 16:35:19 -04:00
parent 5002617eac
commit 871c89063d
4 changed files with 109 additions and 53 deletions
--- a/common/args.py
+++ b/common/args.py
@@ -107,6 +107,11 @@ def add_model_args(parser: argparse.ArgumentParser):
        type=str_to_bool,
        help="Overrides base model context length",
    )
+    model_group.add_argument(
+        "--tensor-parallel",
+        type=str_to_bool,
+        help="Use tensor parallelism to load models",
+    )
    model_group.add_argument(
        "--gpu-split-auto",
        type=str_to_bool,