mirror of
https://github.com/kvcache-ai/ktransformers.git
synced 2026-04-20 14:29:22 +00:00
Merge branch 'main' into work-concurrent
This commit is contained in:
24
.github/workflows/score.yml
vendored
Normal file
24
.github/workflows/score.yml
vendored
Normal file
@@ -0,0 +1,24 @@
|
||||
name: Human Eval Score
|
||||
run-name: Human Eval Score
|
||||
on: workflow_dispatch
|
||||
jobs:
|
||||
Human-Eval-Score:
|
||||
runs-on: self-hosted
|
||||
steps:
|
||||
- run: echo "🎉 The job was automatically triggered by a ${{ github.event_name }} event."
|
||||
- run: echo "🔎 The name of your branch is ${{ github.ref }} and your repository is ${{ github.repository }}."
|
||||
- name: Check out repository code
|
||||
uses: actions/checkout@v4
|
||||
- run: echo "💡 The ${{ github.repository }} repository has been cloned to the runner."
|
||||
- name: Human Eval Run
|
||||
run: |
|
||||
set -e
|
||||
source /home/qujing3/anaconda3/etc/profile.d/conda.sh
|
||||
conda activate ktransformers-dev
|
||||
export PATH=/usr/local/cuda-12.4/bin:$PATH
|
||||
export LD_LIBRARY_PATH=/usr/local/cuda-12.4/lib64:$LD_LIBRARY_PATH
|
||||
export CUDA_HOME=/usr/local/cuda-12.4
|
||||
cd ${{ github.workspace }}
|
||||
python ktransformers/tests/score.py
|
||||
|
||||
- run: echo "This job's status is ${{ job.status }}."
|
||||
@@ -25,7 +25,7 @@ Our vision for KTransformers is to serve as a flexible platform for experimentin
|
||||
|
||||
* **Mar 27, 2025**: Support Multi-concurrency.
|
||||
* **Mar 15, 2025**: Support ROCm on AMD GPU ([Tutorial](./doc/en/ROCm.md)).
|
||||
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context) for DeepSeek-V3 and R1 in 24GB VRAM.
|
||||
* **Mar 5, 2025**: Support unsloth 1.58/2.51 bits weights and [IQ1_S/FP8 hybrid](./doc/en/fp8_kernel.md) weights. Support 139K [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022--v023-longer-context--fp8-kernel) for DeepSeek-V3 and R1 in 24GB VRAM.
|
||||
* **Feb 25, 2025**: Support [FP8 GPU kernel](./doc/en/fp8_kernel.md) for DeepSeek-V3 and R1; [Longer Context](./doc/en/DeepseekR1_V3_tutorial.md#v022-longer-context).
|
||||
* **Feb 15, 2025**: Longer Context (from 4K to 8K for 24GB VRAM) & Slightly Faster Speed (+15%, up to 16 Tokens/s), update [docs](./doc/en/DeepseekR1_V3_tutorial.md) and [online books](https://kvcache-ai.github.io/ktransformers/).
|
||||
* **Feb 10, 2025**: Support Deepseek-R1 and V3 on single (24GB VRAM)/multi gpu and 382G DRAM, up to 3~28x speedup. For detailed show case and reproduction tutorial, see [here](./doc/en/DeepseekR1_V3_tutorial.md).
|
||||
@@ -163,9 +163,9 @@ If you are interested in our design principles and the implementation of the inj
|
||||
|
||||
<h2 id="ack">Acknowledgment and Contributors</h2>
|
||||
|
||||
The development of KTransformer is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.
|
||||
The development of KTransformers is based on the flexible and versatile framework provided by Transformers. We also benefit from advanced kernels such as GGUF/GGML, Llamafile, Marlin, sglang and flashinfer. We are planning to contribute back to the community by upstreaming our modifications.
|
||||
|
||||
KTransformer is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformer faster and easier to use.
|
||||
KTransformers is actively maintained and developed by contributors from the <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> at Tsinghua University and members from <a href="http://approaching.ai/">Approaching.AI</a>. We welcome new contributors to join us in making KTransformers faster and easier to use.
|
||||
|
||||
<h2 id="ack">Discussion</h2>
|
||||
|
||||
|
||||
@@ -152,9 +152,9 @@ YAML 文件中的每个规则都有两部分:`match` 和 `replace`。`match`
|
||||
|
||||
<h2 id="ack">致谢和贡献者</h2>
|
||||
|
||||
KTransformer 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
|
||||
KTransformers 的开发基于 Transformers 提供的灵活和多功能框架。我们还受益于 GGUF/GGML、Llamafile 、 Marlin、sglang和flashinfer 等高级内核。我们计划通过向上游贡献我们的修改来回馈社区。
|
||||
|
||||
KTransformer 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们,使 KTransformer 更快、更易于使用。
|
||||
KTransformers 由清华大学 <a href="https://madsys.cs.tsinghua.edu.cn/">MADSys group</a> 小组的成员以及 <a href="http://approaching.ai/">Approaching.AI</a> 的成员积极维护和开发。我们欢迎新的贡献者加入我们,使 KTransformers 更快、更易于使用。
|
||||
|
||||
|
||||
<h2 id="ack">讨论</h2>
|
||||
|
||||
BIN
WeChatGroup.png
BIN
WeChatGroup.png
Binary file not shown.
|
Before Width: | Height: | Size: 258 KiB After Width: | Height: | Size: 420 KiB |
@@ -1,4 +1,4 @@
|
||||
# Ktransformer
|
||||
# Ktransformers
|
||||
|
||||
[Introduction](./README.md)
|
||||
# Install
|
||||
|
||||
@@ -9,7 +9,7 @@ There is a Docker image available for our project, you can pull the docker image
|
||||
```
|
||||
docker pull approachingai/ktransformers:0.2.1
|
||||
```
|
||||
**Notice**: In this image, we compile the ktransformers in AVX512 instuction CPUs, if your cpu not support AVX512, it is suggested to recompile and install ktransformer in the /workspace/ktransformers directory within the container.
|
||||
**Notice**: In this image, we compile the ktransformers in AVX512 instuction CPUs, if your cpu not support AVX512, it is suggested to recompile and install ktransformers in the /workspace/ktransformers directory within the container.
|
||||
|
||||
## Building docker image locally
|
||||
- Download Dockerfile in [there](../../Dockerfile)
|
||||
|
||||
@@ -118,7 +118,7 @@ From: https://github.com/kvcache-ai/ktransformers/issues/374
|
||||
|
||||
1. First, download the latest source code using git.
|
||||
2. Then, modify the DeepSeek-V3-Chat-multi-gpu-4.yaml in the source code and all related yaml files, replacing all instances of KLinearMarlin with KLinearTorch.
|
||||
3. Next, you need to compile from the ktransformer source code until it successfully compiles on your local machine.
|
||||
3. Next, you need to compile from the ktransformers source code until it successfully compiles on your local machine.
|
||||
4. Then, install flash-attn. It won't be used, but not installing it will cause an error.
|
||||
5. Then, modify local_chat.py, replacing all instances of flash_attention_2 with eager.
|
||||
6. Then, run local_chat.py. Be sure to follow the official tutorial's commands and adjust according to your local machine's parameters.
|
||||
|
||||
@@ -132,6 +132,7 @@ def grouped_topk(hidden_states: torch.Tensor,
|
||||
renormalize: bool,
|
||||
num_expert_group: int = 0,
|
||||
topk_group: int = 0,
|
||||
routed_scaling_factor: float = 1.0,
|
||||
scoring_func: str = "sigmoid",
|
||||
e_score_correction_bias: Optional[torch.Tensor] = None):
|
||||
|
||||
@@ -163,8 +164,8 @@ def grouped_topk(hidden_states: torch.Tensor,
|
||||
score_mask = group_mask.unsqueeze(-1).expand(
|
||||
num_token, num_expert_group,
|
||||
scores.shape[-1] // num_expert_group).reshape(num_token, -1) # [n, e]
|
||||
tmp_scores = scores.masked_fill(~score_mask.bool(),
|
||||
float("-inf")) # [n, e]
|
||||
tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0)
|
||||
#float("-inf")) # [n, e]
|
||||
|
||||
if e_score_correction_bias is not None:
|
||||
topk_ids = torch.topk(tmp_scores, k=topk, dim=-1, sorted=False)[1]
|
||||
@@ -176,9 +177,10 @@ def grouped_topk(hidden_states: torch.Tensor,
|
||||
dim=-1,
|
||||
sorted=False)
|
||||
|
||||
if renormalize:
|
||||
topk_weights = topk_weights / topk_weights.sum(dim=-1, keepdim=True)
|
||||
|
||||
if topk > 1 and renormalize:
|
||||
denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20
|
||||
topk_weights = topk_weights / denominator
|
||||
topk_weights = topk_weights * routed_scaling_factor # must multiply the scaling factor
|
||||
return topk_ids.to(torch.long), topk_weights.to(torch.float32)
|
||||
|
||||
class KMoEGateDeepSeekV3(BaseInjectedModule, KMoEGateBase):
|
||||
@@ -204,6 +206,7 @@ class KMoEGateDeepSeekV3(BaseInjectedModule, KMoEGateBase):
|
||||
self.is_windows = os.name == 'nt'
|
||||
self.use_quant = use_quant
|
||||
if not self.is_windows and use_quant:
|
||||
print("injecting gate_linear")
|
||||
self.gate_linear = nn.Linear(self.gating_dim, self.n_routed_experts, device=generate_device)
|
||||
self.gate_linear = KTransformersLinear(key + ".ffn_gate_inp",
|
||||
gguf_loader, config, self.gate_linear, #orig_module
|
||||
@@ -212,22 +215,20 @@ class KMoEGateDeepSeekV3(BaseInjectedModule, KMoEGateBase):
|
||||
self.gate_linear = None
|
||||
|
||||
def forward(self, hidden_states) -> torch.Tensor:
|
||||
if self.is_windows:
|
||||
if True or self.is_windows:
|
||||
return self.orig_module.forward(hidden_states)
|
||||
|
||||
bsz, seq_len, h = hidden_states.shape
|
||||
### compute gating score
|
||||
hidden_states = hidden_states.view(-1, h)
|
||||
if self.use_quant:
|
||||
logits = self.gate_linear.forward(logits)
|
||||
logits = self.gate_linear.forward(hidden_states)
|
||||
else:
|
||||
logits = F.linear(
|
||||
hidden_states.type(torch.float32), self.weight.type(torch.float32), None
|
||||
)
|
||||
|
||||
return grouped_topk(hidden_states, logits,
|
||||
self.top_k, self.norm_topk_prob,
|
||||
self.n_group, self.topk_group)
|
||||
return grouped_topk(hidden_states, logits, self.top_k, self.norm_topk_prob, self.n_group,
|
||||
self.topk_group, self.routed_scaling_factor, "sigmoid", self.e_score_correction_bias)
|
||||
|
||||
def load(self, w: dict | nn.Parameter | tuple | None = None, device: str|None = None):
|
||||
if device is None: device = self.device
|
||||
|
||||
137
ktransformers/tests/score.py
Normal file
137
ktransformers/tests/score.py
Normal file
@@ -0,0 +1,137 @@
|
||||
import subprocess
|
||||
import time
|
||||
import requests
|
||||
import sys
|
||||
import os
|
||||
|
||||
def wait_for_server(base_url: str, timeout: int = None) -> None:
|
||||
start_time = time.time()
|
||||
while True:
|
||||
try:
|
||||
response = requests.get(
|
||||
f"{base_url}/v1/models",
|
||||
headers={"Authorization": "Bearer None"},
|
||||
)
|
||||
if response.status_code == 200:
|
||||
print("Server is ready.")
|
||||
break
|
||||
except requests.exceptions.RequestException:
|
||||
time.sleep(1)
|
||||
if timeout and time.time() - start_time > timeout:
|
||||
raise TimeoutError("Server did not become ready within timeout period")
|
||||
|
||||
server_cmd = [
|
||||
"numactl", "-N", "1", "-m", "1",
|
||||
"/home/qujing3/anaconda3/envs/ktransformers-dev/bin/ktransformers",
|
||||
"--model_path", "/home/qujing3/models/DeepSeek-R1-Q4_K_M/config",
|
||||
"--gguf_path", "/home/qujing3/models/DeepSeek-V3-GGUF/DeepSeek-V3-Q4_K_M",
|
||||
"--port", "10002",
|
||||
"--cpu_infer", "48",
|
||||
"--optimize_config_path", "ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml",
|
||||
"--max_new_tokens", "3000",
|
||||
"--cache_lens", "6000"
|
||||
]
|
||||
|
||||
print("Starting ktransformers server...")
|
||||
print(" ".join(server_cmd))
|
||||
with open("/tmp/server_log.txt", "w") as f:
|
||||
server_process = subprocess.Popen(server_cmd, stdout=f, stderr=f, text=True)
|
||||
|
||||
try:
|
||||
wait_for_server("http://localhost:10002", timeout=600)
|
||||
|
||||
eval_cmd = ["python", "ktransformers/tests/humaneval/eval_api.py"]
|
||||
print("Running eval_api.py...")
|
||||
print(f"Command: {' '.join(eval_cmd)}")
|
||||
|
||||
env = os.environ.copy()
|
||||
env["PYTHONUNBUFFERED"] = "1"
|
||||
|
||||
eval_process = subprocess.Popen(
|
||||
eval_cmd,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
text=True,
|
||||
bufsize=1,
|
||||
env=env,
|
||||
universal_newlines=True
|
||||
)
|
||||
|
||||
import threading
|
||||
import queue
|
||||
|
||||
def enqueue_output(out, queue):
|
||||
for line in iter(out.readline, ''):
|
||||
queue.put(line)
|
||||
out.close()
|
||||
|
||||
stdout_queue = queue.Queue()
|
||||
stderr_queue = queue.Queue()
|
||||
|
||||
stdout_thread = threading.Thread(target=enqueue_output, args=(eval_process.stdout, stdout_queue))
|
||||
stderr_thread = threading.Thread(target=enqueue_output, args=(eval_process.stderr, stderr_queue))
|
||||
|
||||
stdout_thread.daemon = True
|
||||
stderr_thread.daemon = True
|
||||
stdout_thread.start()
|
||||
stderr_thread.start()
|
||||
|
||||
while eval_process.poll() is None:
|
||||
try:
|
||||
line = stdout_queue.get_nowait()
|
||||
print(line, end='', flush=True)
|
||||
except queue.Empty:
|
||||
pass
|
||||
|
||||
try:
|
||||
line = stderr_queue.get_nowait()
|
||||
print(line, end='', file=sys.stderr, flush=True)
|
||||
except queue.Empty:
|
||||
pass
|
||||
|
||||
time.sleep(1)
|
||||
|
||||
while not stdout_queue.empty():
|
||||
print(stdout_queue.get(), end='', flush=True)
|
||||
while not stderr_queue.empty():
|
||||
print(stderr_queue.get(), end='', file=sys.stderr, flush=True)
|
||||
|
||||
eval_process.wait()
|
||||
print(f"eval_api.py completed with exit code: {eval_process.returncode}")
|
||||
|
||||
evaluate_cmd = [
|
||||
"evaluate_functional_correctness",
|
||||
"ktransformers/tests/humaneval/results/api/eval_b.jsonl"
|
||||
]
|
||||
print("Running evaluate_functional_correctness...")
|
||||
print(f"Command: {' '.join(evaluate_cmd)}")
|
||||
|
||||
evaluate_process = subprocess.Popen(
|
||||
evaluate_cmd,
|
||||
stdout=subprocess.PIPE,
|
||||
stderr=subprocess.PIPE,
|
||||
text=True,
|
||||
bufsize=1,
|
||||
universal_newlines=True
|
||||
)
|
||||
|
||||
for line in evaluate_process.stdout:
|
||||
print(line, end='', flush=True)
|
||||
for line in evaluate_process.stderr:
|
||||
print(line, end='', file=sys.stderr, flush=True)
|
||||
|
||||
evaluate_process.wait()
|
||||
|
||||
print(f"evaluate_functional_correctness completed with exit code: {evaluate_process.returncode}")
|
||||
if evaluate_process.returncode != 0:
|
||||
print(f"evaluate_functional_correctness exited with code {evaluate_process.returncode}")
|
||||
sys.exit(evaluate_process.returncode)
|
||||
|
||||
finally:
|
||||
print("Stopping ktransformers server...")
|
||||
server_process.terminate()
|
||||
try:
|
||||
server_process.wait(timeout=30)
|
||||
except subprocess.TimeoutExpired:
|
||||
print("Server did not terminate gracefully, forcing...")
|
||||
server_process.kill()
|
||||
Reference in New Issue
Block a user